Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen
Editorial Board David Hutchison Lancaster University, UK Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA Josef Kittler University of Surrey, Guildford, UK Jon M. Kleinberg Cornell University, Ithaca, NY, USA Alfred Kobsa University of California, Irvine, CA, USA Friedemann Mattern ETH Zurich, Switzerland John C. Mitchell Stanford University, CA, USA Moni Naor Weizmann Institute of Science, Rehovot, Israel Oscar Nierstrasz University of Bern, Switzerland C. Pandu Rangan Indian Institute of Technology, Madras, India Bernhard Steffen TU Dortmund University, Germany Madhu Sudan Microsoft Research, Cambridge, MA, USA Demetri Terzopoulos University of California, Los Angeles, CA, USA Doug Tygar University of California, Berkeley, CA, USA Gerhard Weikum Max Planck Institute for Informatics, Saarbruecken, Germany
6522
Marcos K. Aguilera Haifeng Yu Nitin H. Vaidya Vikram Srinivasan Romit Roy Choudhury (Eds.)
Distributed Computing and Networking 12th International Conference, ICDCN 2011 Bangalore, India, January 2-5, 2011 Proceedings
13
Volume Editors Marcos K. Aguilera Microsoft Research Silicon Valley 1065 La Avenida – bldg. 6, Mountain View, CA 94043, USA E-mail:
[email protected] Haifeng Yu National University of Singapore School of Computing, COM2-04-25 15 Computing Drive, Republic of Singapore 117418 E-mail:
[email protected] Nitin H. Vaidya University of Illinois at Urbana-Champaign 458 Coordinated Science Laboratory MC-228, 1308 West Main Street, Urbana, IL 61801, USA E-mail:
[email protected] Vikram Srinivasan Alcatel-Lucent Technologies Manyata Technology Park, Nagawara, Bangalore 560045, India E-mail:
[email protected] Romit Roy Choudhury Duke University, ECE Department 130 Hudson Hall, Box 90291, Durham, NC 27708, USA E-mail:
[email protected] Library of Congress Control Number: 2010940620 CR Subject Classification (1998): C.2, D.1.3, D.2.12, D.4, F.2, F.1.2, H.4 LNCS Sublibrary: SL 1 – Theoretical Computer Science and General Issues ISSN ISBN-10 ISBN-13
0302-9743 3-642-17678-X Springer Berlin Heidelberg New York 978-3-642-17678-4 Springer Berlin Heidelberg New York
This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. springer.com © Springer-Verlag Berlin Heidelberg 2011 Printed in Germany Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India Printed on acid-free paper 06/3180
Message from the General Chairs
On behalf of the Conference Committee for ICDCN 2011, it is our pleasure to welcome you to Bangalore, India for the 12th International Conference on Distributed Computing and Networking. ICDCN is a premier international forum for distributed computing and networking researchers, vendors, practitioners, application developers, and users, organized every year with support from industry and academic sponsors. Since the first conference on distributed computing held in 2000, ICDCN has become a leading forum for researchers and practitioners to exchange ideas and share best practices in the field of distributed computing and networking. In addition, ICDCN serves as a forum for PhD students to share their research ideas and get quality feedback from renowned experts in the field. The only way this reputation can be achieved is from the quality of the work submitted, the standard of tutorials and workshops organized, the dedication and sincerity of the Technical Program Committee members, the quality of the keynote speakers, the ability of the Steering Committee to react to change, and the policy of being friendly to students by sponsoring a large number of travel grants and keeping the registration cost the lowest among international conferences anywhere in the world. This 12th ICDCN illustrates the intense productivity and cutting-edge research of the members of the distributed computing and networking community across the globe. This is the first time ICDCN is hosted in Bangalore, India, the Silicon Valley of India. It is jointly hosted by the leading global information technology company Infosys Technologies headquartered in Bangalore and the renowned research and academic institution the International Institute of Information Technology-Bangalore (IIIT-B). Bangalore is the hub of information technology companies and the seat of innovations in high-tech industry in India. The richness of its culture and history blended with the modern lifestyle and the vibrancy of its young professional population, together with its position at the heart of southern India, make Bangalore one of the major Indian tourist destinations. We are grateful for the generous support of our numerous sponsors: Infosys, Google, Microsoft Research, HP, IBM, Alcatel-Lucent, NetApp, and NIIT University. Their sponsorship is critical to the success of this conference. The success of the conference depended on the help of many other people too, and our thanks go to each and every one of them: the Steering Committee, which helped us in all stages of the conference, the Technical Program Committee, which meticulously evaluated each and every paper submitted to the conference, the Workshop and Tutorial Committee, which put together top-notch and topical workshops and tutorials, the Local Arrangements and Finance Committee, who worked day in and day out to make sure that each and every attendee of the conference feels
VI
Preface
at home before and during the conference, and the other Chairs who toiled hard to maintain the high standards of the conference making it a great success. Welcome and enjoy ICDCN 2011, Bangalore and India. January 2011
Sanjoy Paul Lorenzo Alvisi
Message from the Technical Program Chairs
The 12th International Conference on Distributed Computing and Networking (ICDCN 2011) continues to grow as a leading forum for disseminating the latest research results in distributed computing and networking. It is our greatest pleasure to present the proceedings of the technical program of ICDCN 2011. This year we received 140 submissions from all over the world, including Austria, Canada, China, Finland, France, Germany, India, Iran, Israel, The Netherlands, Portugal, Singapore, Spain, Sri Lanka, Switzerland, and USA. These submissions were carefully reviewed and evaluated by the Program Committee, which consisted of 36 members for the Distributed Computing track and 48 members for the Networking track. For some submissions, the Program Committee further solicited additional help from external reviewers. The Program Committee eventually selected 31 regular papers and 3 short papers for inclusion in the proceedings and presentation at the conference. It is our distinct honor to recognize the paper “Generating Fast Indulgent Algorithms” by Dan Alistarh, Seth Gilbert, Rachid Guerraoui, and Corentin Travers as the Best Paper in the Distributed Computing track, and the paper “GoDisco: Selective Gossip-Based Dissemination of Information in Social Community-Based Overlays” by Anwitaman Datta and Rajesh Sharma as the Best Paper in the Networking track. In both the reviewing and the best paper selection processes, PC members and PC Chairs who had a conflict of interest with any given paper were excluded from the decision-making process related to that paper. Besides the core technical program, ICDCN 2011 offers a number of other stimulating events. Before the main conference program, we have a full day of tutorials. During the main conference, we are fortunate to have several distinguished scientists as keynote speakers. The main conference is further followed by several other exciting events including the PhD forum. We thank all authors who submitted a paper to ICDCN 2011, which allowed us to select a strong technical program. We thank the Program Committee members and external reviewers for their diligence and commitment, both during the reviewing process and during the online discussion phase. We thank the conference General Chairs and other Organizing Committee members for working with us to make ICDCN 2011 a success. January 2011
Marcos K. Aguilera Romit Roy Choudhury Vikram Srinivasan Nitin Vaidya Haifeng Yu
Organization
General Chairs Lorenzo Alvisi Sanjoy Paul
University of Texas at Austin, USA (Distributed Computing Track) Infosys Technologies, Bangalore, India (Networking Track)
Program Chairs Networking Track Vikram Srinivasan (Co-chair) Nitin Vaidya (Co-chair) Romit Roy Choudhury (Vice Chair)
Alcatel-Lucent, India University of Illinois at Urbana-Champaign, USA Duke University, USA
Distributed Computing Track Marcos K. Aguilera (Co-chair) Microsoft Research Silicon Valley, USA Haifeng Yu (Co-chair) National University of Singapore, Singapore
Keynote Chair Sajal Das Prasad Jayanti
University of Texas at Arlington and NSF, USA Dartmouth College, USA
Tutorial Chairs Vijay Garg Samir Das
University of Texas at Austin, USA Stony Brook University, USA
Publication Chair Marcos K. Aguilera Haifeng Yu Vikram Srinivasan
Microsoft Research Silicon Valley, USA National University of Singapore, Singapore Alcatel-Lucent, India
X
Organization
Publicity Chair Luciano Bononi Dipanjan Chakraborty Anwitaman Datta Rui Fan
University of Bologna, Italy IBM Research Lab, India NTU, Singapore Microsoft, USA
Industry Chairs Ajay Bakre
Intel, India
Finance Chair Santonu Sarkar
Infosys Technologies, India
PhD Forum Chairs Mainak Chatterjee Sriram Pemmaraju
University of Central Florida, USA University of Iowa, Iowa City, USA
Local Arrangements Chairs Srinivas Padmanabhuni Amitabha Das Debabrata Das
Infosys Technologies, India Infosys Technologies, India International Institute of Information Technology, Bangalore, India
International Advisory Committee Prith Banerjee Prasad Jayanti Krishna Kant Dipankar Raychaudhuri S. Sadagopan Rajeev Shorey Nitin Vaidya Roger Wattenhofer
HP Labs, USA Dartmouth College, USA Intel and NSF, USA Rutgers University, USA IIIT Bangalore, India NIIT University, India University of Illinois at Urbana-Champaign, USA ETH Zurich, Switzerland
Program Committee: Networking Track Arup Acharya Habib M. Ammari Vartika Bhandari
IBM Research, USA Hofstra University, USA Google, USA
Organization
Bharat Bhargava Saad Biaz Luciano Bononi Mainak Chatterjee Mun Choon Chan Carla-Fabiana Chiasserini Romit Roy Choudhury Marco Conti Amitabha Das Samir Das Roy Friedman Marco Gruteser Katherine H. Guo Mahbub Hassan Gavin Holland Sanjay Jha Andreas Kassler Salil Kanhere Jai-Hoon Kim Myungchul Kim Young-Bae Ko Jerzy Konorski Bhaskar Krishnamachari Mohan Kumar Joy Kuri Baochun Li Xiangyang Li Ben Liang Anutosh Maitra Archan Misra Mehul Motani Asis Nasipuri Srihari Nelakuditi Sotiris Nikoletseas Kumar Padmanabh Chiara Petrioli Bhaskaran Raman Catherine Rosenberg Rajashri Roy Bahareh Sadeghi Moushumi Sen Srinivas Shakkottai Wang Wei Xue Yang Yanyong Zhang
XI
Purdue University, USA Auburn University, USA University of Bologna, Italy University of Central Florida, USA National University of Singapore, Singapore Politecnico Di Torino,Italy Duke University, USA University of Bologna, Italy Infosys, India Stony Brook University, USA Technion, Israel Rutgers University, USA Bell Labs, USA University of New South Wales, Australia HRL Laboratories, USA University of New South Wales, Australia Karlstad University, Sweden University of New South Wales, Australia Ajou University, South Korea Information and Communication University, South Korea Ajou University, South Korea Gdansk University of Technology, Poland University of Southern California, USA University of Texas -Arlington, USA IISc, Bangalore, India University of Toronto, Canada Illinois Institute of Technology, USA University of Toronto, Canada Infosys, India Telcordia Lab, USA National University of Singapore, Singapore University of North Carolina at Charlotte, USA University of South Carolina, USA Patras University, Greece Infosys, India University of Rome La Sapienza, Italy IIT Bombay, India University of Waterloo, Canada IIT Kharagpur, India Intel, USA Motorola, India Texas A&M University, USA ZTE, China Intel, USA Rutgers University, USA
XII
Organization
Program Committee: Distributed Computing Track Mustaque Ahamad Hagit Attiya Rida A. Bazzi Ken Birman Pei Cao Haowen Chan Wei Chen Gregory Chockler Jeremy Elson Rui Fan Christof Fetzer Pierre Fraigniaud Seth Gilbert Rachid Guerraoui Tim Harris Maurice Herlihy Prasad Jayanti Chip Killian Arvind Krishnamurthy Fabian Kuhn Zvi Lotker Victor Luchangco Petros Maniatis Alessia Milani Yoram Moses Gopal Pandurangan Sergio Rajsbaum C. Pandu Rangan Andre Schiper Stefan Schmid Neeraj Suri Srikanta Tirthapura Sam Toueg Mark Tuttle Krishnamurthy Vidyasankar Hakim Weatherspoon
Georgia Institute of Technology, USA Technion, Israel Arizona State University, USA Cornell University, USA Stanford University, USA Carnegie Mellon University, USA Microsoft Research Asia, China IBM Research Haifa Labs, Israel Microsoft Research, USA Technion, Israel Dresden University of Technology, Germany CNRS and University of Paris Diderot, France National University of Singapore, Singapore EPFL, Switzerland Microsoft Research, UK Brown University, USA Dartmouth College, USA Purdue University, USA University of Washington, USA University of Lugano, Switzerland Ben-Gurion University of the Negev, Israel Sun Labs, Oracle, USA Intel Labs Berkeley, USA Universite Pierre & Marie Curie, France Technion, Israel Brown University and Nanyang Technological University, Singapore Universidad Nacional Autonoma de Mexico, Mexico Indian Institute of Technology Madras, India EPFL, Switzerland T-Labs/TU Berlin, Germany TU Darmstadt, Germany Iowa State University, USA University of Toronto, Canada Intel Corporation, USA Memorial University of Newfoundland, Canada Cornell University, USA
Organization
Additional Referees: Networking Track Rik Sarkar Kangseok Kim Maheswaran Sathiamoorthy Karim El Defrawy Sangho Oh Michele Nati Sung-Hwa Lim Yi Gai Tam Vu Young-June Choi Jaehyun Kim Amitabha Ghosh
Giordano Fusco Ge Zhang Sanjoy Paul Aditya Vashistha Bo Yu Sung-Hwa Lim Vijayaraghavan Varadharajan Ying Chen Francesco Malandrino Majed Alresaini Pralhad Deshpande
Additional Referees: Distributed Computing Track John Augustine Ioannis Avramopoulos Binbin Chen Atish Das Sarma Carole Delporte-Gallet Michael Elkin Hugues Fauconnier Danny Hendler Damien Imbs
Maleq Khan Huijia Lin Danupon Nanongkai Noam Rinetzky Nuno Santos Andreas Tielmann Amitabh Trehan Maysam Yabandeh
XIII
Table of Contents
The Inherent Complexity of Transactional Memory and What to Do about It (Invited Talk) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hagit Attiya
1
Sustainable Ecosystems: Enabled by Supply and Demand Management (Invited Talk) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chandrakant D. Patel and IEEE Fellow
12
Unclouded Vision (Invited Talk) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jon Crowcroft, Anil Madhavapeddy, Malte Schwarzkopf, Theodore Hong, and Richard Mortier
29
Generating Fast Indulgent Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Dan Alistarh, Seth Gilbert, Rachid Guerraoui, and Corentin Travers
41
An Efficient Decentralized Algorithm for the Distributed Trigger Counting Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Venkatesan T. Chakaravarthy, Anamitra R. Choudhury, Vijay K. Garg, and Yogish Sabharwal Deterministic Dominating Set Construction in Networks with Bounded Degree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Roy Friedman and Alex Kogan PathFinder: Efficient Lookups and Efficient Search in Peer-to-Peer Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Dirk Bradler, Lachezar Krumov, Max M¨ uhlh¨ auser, and Jussi Kangasharju
53
65
77
Single-Version STMs Can Be Multi-version Permissive (Extended Abstract) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hagit Attiya and Eshcar Hillel
83
Correctness of Concurrent Executions of Closed Nested Transactions in Transactional Memory Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sathya Peri and Krishnamurthy Vidyasankar
95
Locality-Conscious Lock-Free Linked Lists . . . . . . . . . . . . . . . . . . . . . . . . . . Anastasia Braginsky and Erez Petrank Specification and Constant RMR Algorithm for Phase-Fair Reader-Writer Lock . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Vibhor Bhatt and Prasad Jayanti
107
119
XVI
Table of Contents
On the Performance of Distributed Lock-Based Synchronization . . . . . . . Yuval Lubowich and Gadi Taubenfeld
131
Distributed Generalized Dynamic Barrier Synchronization . . . . . . . . . . . . . Shivali Agarwal, Saurabh Joshi, and Rudrapatna K. Shyamasundar
143
A High-Level Framework for Distributed Processing of Large-Scale Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Elzbieta Krepska, Thilo Kielmann, Wan Fokkink, and Henri Bal Affinity Driven Distributed Scheduling Algorithm for Parallel Computations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ankur Narang, Abhinav Srivastava, Naga Praveen Kumar, and Rudrapatna K. Shyamasundar Temporal Specifications for Services with Unboundedly Many Passive Clients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Shamimuddin Sheerazuddin Relating L -Resilience and Wait-Freedom via Hitting Sets . . . . . . . . . . . . . . Eli Gafni and Petr Kuznetsov
155
167
179 191
Load Balanced Scalable Byzantine Agreement through Quorum Building, with Full Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Valerie King, Steven Lonargan, Jared Saia, and Amitabh Trehan
203
A Necessary and Sufficient Synchrony Condition for Solving Byzantine Consensus in Symmetric Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Olivier Baldellon, Achour Most´efaoui, and Michel Raynal
215
GoDisco: Selective Gossip Based Dissemination of Information in Social Community Based Overlays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Anwitaman Datta and Rajesh Sharma
227
Mining Frequent Subgraphs to Extract Communication Patterns in Data-Centres . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Maitreya Natu, Vaishali Sadaphal, Sangameshwar Patil, and Ankit Mehrotra On the Hardness of Topology Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . H.B. Acharya and M.G. Gouda
239
251
An Algorithm for Traffic Grooming in WDM Mesh Networks Using Dynamic Path Selection Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sukanta Bhattacharya, Tanmay De, and Ajit Pal
263
Analysis of a Simple Randomized Protocol to Establish Communication in Bounded Degree Sensor Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bala Kalyanasundaram and Mahendran Velauthapillai
269
Table of Contents
Reliable Networks with Unreliable Sensors . . . . . . . . . . . . . . . . . . . . . . . . . . Srikanth Sastry, Tsvetomira Radeva, Jianer Chen, and Jennifer L. Welch
XVII
281
Energy Aware Fault Tolerant Routing in Two-Tiered Sensor Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ataul Bari, Arunita Jaekel, and Subir Bandyopadhyay
293
Scheduling Randomly-Deployed Heterogeneous Video Sensor Nodes for Reduced Intrusion Detection Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Congduc Pham
303
An Integrated Routing and Medium Access Control Framework for Surveillance Networks of Mobile Devices . . . . . . . . . . . . . . . . . . . . . . . . . . . . Nicholas Martin, Yamin Al-Mousa, and Nirmala Shenoy
315
Security in the Cache and Forward Architecture for the Next Generation Internet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . G.C. Hadjichristofi, C.N. Hadjicostis, and D. Raychaudhuri
328
Characterization of Asymmetry in Low-Power Wireless Links: An Empirical Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Prasant Misra, Nadeem Ahmed, Diethelm Ostry, and Sanjay Jha
340
Model Based Bandwidth Scavenging for Device Coexistence in Wireless LANs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Anthony Plummer Jr., Mahmoud Taghizadeh, and Subir Biswas
352
Minimal Time Broadcasting in Cognitive Radio Networks . . . . . . . . . . . . . Chanaka J. Liyana Arachchige, S. Venkatesan, R. Chandrasekaran, and Neeraj Mittal
364
Traffic Congestion Estimation in VANETs and Its Application to Information Dissemination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Rayman Preet Singh and Arobinda Gupta
376
A Tiered Addressing Scheme Based on a Floating Cloud Internetworking Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yoshihiro Nozaki, Hasan Tuncer, and Nirmala Shenoy
382
DHCP Origin Traceback . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Saugat Majumdar, Dhananjay Kulkarni, and Chinya V. Ravishankar A Realistic Framework for Delay-Tolerant Network Routing in Open Terrains with Continuous Churn . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Veeramani Mahendran, Sivaraman K. Anirudh, and C. Siva Ram Murthy Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
394
407
419
Invited Paper: The Inherent Complexity of Transactional Memory and What to Do about It Hagit Attiya Department of Computer Science, Technion
[email protected] Abstract. This paper overviews some of the lower bounds on the complexity of implementing software transactional memory, and explains their underlying assumptions. It discusses how these lower bounds align with experimental results and design choices made in existing implementations to indicate that the transactional approach for concurrent programming must compromise either programming simplicity or scalability. There are several contemporary research avenues that address the challenge of concurrent programming. For example, optimizing coarse-grained techniques, and concurrent programming with mini-transactions—simple atomic operations on a small number of locations.
1 The TM Approach to Concurrent Programming As anyone with a laptop or an Internet connection (that is, everyone) knows, the multicore revolution is here. Almost any computing appliance contains several processing cores, and the number of cores in servers is in the low teens. With the improved hardware comes the need to harness the power of concurrency, since the processing power of individual cores does not increase. Applications must be restructured in order to reap the benefits of multiple processing units, without paying a costly price for coordination among them. It has been argued that writing concurrent applications is significantly more challenging than writing sequential ones. Surely, there is a longer history of creating and analyzing sequential code, and this is reflected in undergraduate eduction. Many programmers are mystified by the intricacies of interaction between multiple processes or threads, and the need to coordinate and synchronize them. Transactional memory (TM) has been suggested as a way to deal with the alleged difficulty of writing concurrent applications. In its simplest form, the programmer need only wrap code with operations denoting the beginning and end of a transaction. The transactional memory will take care of synchronizing the shared memory accesses so that each transaction seems to execute sequentially and in isolation. Originally suggested as a hardware platform by Herlihy and Moss [29], TM has resurfaced as a software mechanism a couple of years later. The first software implementation of transactional memory [43] provided, in essence, support for a multi-word synchronization operations on a static set of data items, in terms of a unary operation (LL/SC), somewhat optimized over prior implementations, e.g., [46, 9]. Shavit and Touitou coined the term software transactional memory (STM) to describe this implementation. M.K. Aguilera et al. (Eds.): ICDCN 2011, LNCS 6522, pp. 1–11, 2011. c Springer-Verlag Berlin Heidelberg 2011
2
H. Attiya
Only when the termination condition was relaxed to obstruction freedom (see Section 2.2), the first STM handling a dynamic set of data items was presented [28]. Work by Rajwar et al., e.g., [37, 42], helped to popularize the TM approach in the programming languages and hardware communities.
2 Formalizing TM This section outlines how transactional memory can be formally captured. A comprehensive in-depth treatment can be found in [33]. The model encompasses at least two levels of abstraction: The high level has transactions, each of which is a sequence of operations accessing data items. At the low level, the operations are translated into executions in which a sequence of events apply primitive operations to base objects, containing the data and the meta-data needed for the implementation. A transaction is a sequence of operations executed by a single process on a set of data items, shared with other transactions. Data items are accessed by read and write operations; some systems also support other operations. The interface also includes a try-commit and try-abort operations, in which a transaction requests to commit or abort, respectively. Any of these operations, not just try-abort, may cause the transaction to abort; in this case, we say that the transaction forcibly aborted. The collection of data items accessed by a transaction is its data set; the items written by the transaction are its write set, with the other items being its read set. A software implementation of transactional memory (abbreviated STM) provides data representation for transactions and data items using base objects, and algorithms, specified as primitive operations (abbreviated primitives) on the base objects. These procedures are followed by asynchronous processes in order to execute the operations of transactions. The primitives can be simple reads and writes, but also more sophisticated ones, like CAS or DCAS, typically applied to memory locations, which are the base objects for the implementation. When processes invoke these procedures, in an interleaved manner, we obtain executions, in the standard sense of asynchronous distributed computing (cf. [8]). Executions consist of configurations, describing a complete state of the system, and events, describing a single step by an individual process, including an application of a single primitive to base objects (possibly several objects, e.g., in case of DCAS). The interval of a transaction T is the execution interval that starts at the first event of T and ends at the last event of T . If T does not have a last event in the execution, then the interval of T is the (possibly infinite) execution interval starting at the first event of T . Two transactions overlap if their intervals overlap. 2.1 Safety: Consistency Properties of TM An STM is serializable if transactions appear to execute sequentially, one after the other [39]. An STM is strictly serializable if the serialization order preserves the order
The Inherent Complexity of Transactional Memory and What to Do about It
3
of non-overlapping transactions [39]. This notion is called order-preserving serializability in [47], and is the analogue of linearizability [31] for transactions.1 Opacity [23] further demands that even partially executed transactions, which may later abort, must be serializable (in an order-preserving manner). Opacity also accommodates operations beyond read and write. While opacity is a stronger condition than serializability, snapshot isolation [10] is a consistency condition weaker than serializability. Roughly stated, snapshot isolation ensures that all read operations in a transaction return the most recent value as of the time the transaction starts; the write sets of concurrent transactions must be disjoint. (Cf. [47, Definition 10.3].) 2.2 Progress: Termination Guarantees for TM One of the innovations of TM is in allowing transactions not to commit, when they are faced with conflicting transactions. This, however, admits trivial implementations where no progress is ever made. Finding the right balance between nontriviality and efficiency has lead to several progress properties. They are first and foremost distinguished by whether locking is accommodated or not. When locks are not allowed, the strongest requirement—rarely provided—is of waitfreedom, namely, that each transaction has to eventually commit. A weaker property ensures that some transaction eventually commits, or that a transaction commits when running by itself. The last property is called obstruction-freedom [28] (see further discussion in [3]). A lock-based STM (e.g., TL2 [16]) is often required to be weakly progressive [24], namely, a transaction that does not encounter conflicts must commit. Several lower bounds assume a minimal progress property, ensuring that a transaction terminates successfully if it runs alone, from a situation in which no other transactions is pending. This property encompasses both obstruction freedom and weak progressiveness. Related definitions [34, 20, 24] further attempt to capture the distinction between aborts that are necessary in order to maintain the safety properties (e.g., opacity) and spurious aborts that are not mandated by the consistency property, and to measure their ratio. Strong progressiveness [24] ensures that even when there are conflicts, some transaction commits. More specifically, an STM is strongly progressive if a transaction without nontrivial conflicts is not forcibly aborted, and if a set of transactions have nontrivial conflicts on a single item then not all of them are forcibly aborted. (Recall that a transaction is forcibly aborted, when the abort was not requested by a try-abort operation of the transaction, i.e., the abort is in response to try-commit, read or write operations.) Another definition [40] says that an STM is multi-version (MV)-permissive if a transaction is forcibly aborted only if it is an update transaction that has a nontrivial conflict with another update transaction. 1
Linearizability, like sequential consistency [36], talks about implementing abstract data structures, and hence they involve one abstraction—from the high-level operations of the data structure to the low level primitives. It also provides the semantics of the operations, and their expected results at the high-level, on the data structure.
4
H. Attiya
Strong progressiveness and MV-permissiveness are incomparable: The former allows a read-only transaction to abort, if it has a conflict with another update transaction, while the latter does not guarantee that at least one transaction is not forcibly aborted in case of a conflict. Strictly speaking, these properties are not liveness properties in the classical sense [35], since they can be checked in finite executions. 2.3 Predicting Performance There has been some theoretical attempts to predict how well will TM implementations scale, resulting in definitions that postulate behaviors that are expected to yield superior performance. Disjoint-access parallelism. The most accepted such notion is disjoint-access parallelism, capturing the requirement that unrelated transactions progress independently, even if they occur at the same time. That is, an implementation should not cause two transactions, which are unrelated at the high-level, to simultaneously access the same low-level shared memory. We explain what it means for two transactions to be unrelated through a conflict graph that represents the relations between transactions. The conflict graph of an execution interval I is an undirected graph, where vertices represent transactions and edges connect transactions that share a data item. Two transactions T1 and T2 are disjoint access if there is no path between the vertices representing them in the conflict graph of their execution intervals; they are strictly disjoint access if there is no edge between these vertices. Two events contend on a base object o if they both access o, and at least one of them applies a non-trivial primitive to o.2 Transactions concurrently contend on a base object o if they have pending events at the same configuration that contend on o. Property 1 (Disjoint access parallelism). An STM implementation is disjoint-access parallel if two transactions concurrently contend on the same base object, only if they are not disjoint access. This definition captures the first condition of the disjoint-access parallelism property of Israeli and Rappoport [32], in accordance with most of the literature (cf. [30]). It is somewhat weaker, as it allows two processes to apply a trivial primitive on the same base object, e.g., read, even when executing disjoint-access transactions. Moreover, this definition only prohibits concurrent contending accesses, allowing transactions to contend on a base object o at different points of the execution. The original disjoint-access parallelism definition [32] also restricts the impact of concurrent transactions on the step complexity of a transaction. For more precise definitions and discussion, see [7]. 2
A primitive is non-trivial if it may change the value of the object, e.g., a write or CAS; otherwise, it is trivial, e.g., a read.
The Inherent Complexity of Transactional Memory and What to Do about It
5
Invisible reads. It is expected that many typical applications will generate workloads that include a significant portion of read-only transactions. This includes, for example, transactions to search a data structure, and find whether it contains a particular data. Many STMs attempt to optimize read-only transactions, and more generally, the implementation of read operations inside the transaction. By their very nature, read operations, and even more so, read-only transactions, need not leave a mark on the shared memory, and therefore, it is desirable to avoid writing in such transactions, i.e., to make sure that reads are invisible, and certainly, that read-only transactions do not write at all. Remark 1. Some authors [15] refer to a transaction as having invisible reads even if it writes, but the information is not sufficiently detailed to supply the exact details about the transaction’s data set. (In their words, “the STM does not know which, or even how many, readers are accessing a given memory location.”) This behavior is captured by the stronger notion of an oblivious STM [5].
3 Some Lower Bound Results This section overview some of the recent work on showing the inherent complexity of TM. This includes a few impossibility results showing that certain properties simply cannot be achieved by a TM, and several worst-case lower bounds showing that other properties put a high price on the TM, often in terms of the number of steps that should be performed, or as bounds on the local computation involved. The rest of the section mentions some of these results. 3.1 The Cost of Validation A very interesting result shows the additional cost of opacity over serializability, namely, making sure that the values read by a transaction are consistent as it is in progress (and not just at commit time, as done in many database implementations). Guerraoui and Kapalka [23] showed that the number of steps in a read operation is linear in the size of the invoking transaction’s read set, assuming that reads are invisible, the STM keeps only a single version of each data item, and is progressive (i.e., it never aborts a transaction unless it conflicts with another pending transaction). In contrast, when only serializability is guaranteed, then the values read should only be validated at commit time, leading to significant savings. 3.2 The Consensus Number of TM It has been shown that lock-based and obstruction-free TMs can solve consensus for at most two processes [22], that is, their consensus number [26] is 2. An intermediate step shows that such TMs are equivalent to shared objects that fail in a very clean manner [3]. Roughly speaking, this is a consensus object providing a familiar propose operation, allowing a thread to provide an input and wait for a unanimous decision value; however, the propose operation may return a definite fail indication, which ensures that the proposed value will not be decided upon. Intuitively, an aborted transaction corresponds to a propose operation returning false. To get the full result, further mechanisms are needed to handle the long-lived nature of a transactional memory.
6
H. Attiya
3.3 Achieving Disjoint-Access Parallelism Guerraoui and Kapalka [22] prove that obstruction-free implementations of software transactional memory cannot ensure strict disjoint-access parallelism. This property requires transactions with disjoint data sets not to access a common base object. This notion is stronger than disjoint-access parallelism (Property 1), which allows two transactions with disjoint data sets to access the same base objects, provided they are connected via other transactions. Note that the lower bound does not hold under this more standard notion, as Herlihy et al. [28] present an obstruction-free and disjoint-access parallel STM. For the stronger case of wait-free read-only transactions, the assumption of strict disjoint-access parallel can be replaced with the assumption that read-only transactions are invisible. We have proved [7] that an STM cannot be disjoint-access parallel and have invisible read-only transactions that always terminate successfully. A read-only transaction not only has to write, but the number of writes is linear in the size of its read set. Both results hold for strict serializability, and hence also for opacity. With a slight modification of the notion of disjoint-access parallelism, they also hold for serializability and snapshot isolation. 3.4 Privatization An important goal for STM is to access certain items by simple reads and writes, without paying the overhead of the transactional memory. It has been shown [21] that, in many cases, this cannot be achieved without prior privatization [45, 44], namely, invoking a privatization transaction, or some other kind of a privatizing barrier [15]. We have recently proved [5] that, unless parallelism (in terms of progressiveness) is greatly compromised or detailed information about non-conflicting transactions is tracked (the STM is not oblivious), privatization cost must be linear in the number of items that are privatized. 3.5 Avoiding Aborts It have been shown [34] that an opaque, strongly progressive STM requires NPcomplete local computation, while a weaker, online notion requires visible reads.
4 Interlude: How Well Does TM Work in Practice? Collectively, the results that will be described here demonstrate that TM faces significant limitations: It cannot provide clean semantics without weakening the consistency semantics or compromising the progress guarantees. The implementations are also significantly limited in their scalability. Finally, it is not clear how expressive is the programming idiom they provide (since their consensus number is only two). One might argue that these are just theoretical results, which anyway, (mostly) describe only the worst case, so, in practice, we are just fine. However, while the results
The Inherent Complexity of Transactional Memory and What to Do about It
7
are mostly stated for the worst case, these are often not corner cases, unlikely to happen in practice, but natural cases, representative of typical scenarios. Moreover, it is difficult to design an STM that behaves differently in different scenarios, or to expose these specific scenarios to the programmer using intricate guarantees. There is evidence that implementation-focused research has also been hitting a similar wall [11]. Design choices done in existing TMs, whether in hardware or in software, compromise either the claimed simplicity of the model (e.g., elastic transactions [19]), or its transparency and generality (e.g., transactional boosting [27]). Alternatively, there are TMs with reduced scalability, weakening progress guarantees or performance.
5 Concurrent Programming in a Post-TM Era The TM approach “infantilizes” programmers, telling them that the TM will take care of making sure their programs runs correctly and efficiently, even in a concurrent setting. Given that this approach may not be able to deliver the promised combination of efficiency and programming simplicity, and it must expose many of the complications of consistency or progress guarantees, perhaps we should stop sheltering the programmer from the reality of concurrency? It might be possible to expose a cleaner model of a multi-core system to programmers, while providing them with better methodologies, tools and programming patterns that will simplify the design of concurrent code, without hiding its tradeoffs. It is my belief that a multitude of approaches should be proposed, besides TM, catering to different needs and setups. This section mentions two, somewhat complementary, approaches to alleviating the difficulty of designing concurrent applications. 5.1 Optimizing Coarse-Grain Programming For small-scale applications, or with moderate amount of contention for the data, the overhead of managing the memory might outweight the cost of delays due to synchronization [17]. In such situations, it might be simpler to rely on coarse-grained synchronization, that is, design applications in which shared data is mostly accessed “in exclusion”. This does not mean a return to simplistic programming with critical sections and mutex. Instead, this recommends the use of novel methods that have several processes compete for the lock, and then, to avoid additional contention, have the lock holder carry out all (or many of the) pending operations on the data [25]. For non-locking algorithms, this can be seen as a return to Herlihy’s universal construction [26], somewhat optimized to improve memory utilization [13]. 5.2 Programming with Mini-transactions A complementary approach is motivated by the observation that many of the lower bounds rely on the fact that TM must deal with large, unpredictable (dynamic) data sets,
8
H. Attiya
accessed with an arbitrary set of operations, and interleaved with generic calculations (including I/O). What if the TM had to deal only with short transactions, with simple functionality and small, known-in-advance (static) data sets, to which only simple arithmetic, comparison, and memory access operations are applied? My claim is that such minitransactions could greatly alleviate the burden of concurrent programming, while still allowing efficient implementation. It is obvious that mini-transactions avoid many of the costs indicated by the lower bounds and complexity results with TM, because they are so restricted. Looking from an implementation perspective, mini-transactions is a design choice that simplifies and improves the performance of TM. Indeed, they are already almost provided by recent hardware TM proposals from AMD [2] and Sun [12]. The support is best-effort in nature, since, in addition to data conflicts, transactions can be aborted due to other reasons, for example, TLB misses, interrupts, certain function-call sequences and instructions like division [38]. Mini-transactions. Mini-transactions are a simple extension of DCAS, or its extension to k CAS, with small values of k, e.g., 3 CAS, 4 CAS. In fact, mini-transactions are a natural, multi-location variant of the LL/SC pair supported in IBM’s PowerPC [1] and DEC Alpha [18]. A mini transaction works on a small, if possible, static, data set, and applies simple functionality, without I/O, out-of-core memory accesses, etc. It is supposed to be short, in order to ensure success. Yet, even if all these conditions are satisfied, the application should be prepared to deal with spurious failures, and not violate the integrity of the data. An important issue is to allow “native” (uninstrumented) access to the locations accessed by mini-transactions, through a clean, implicit mechanism. Thus, they are subject to concerns similar to those arising when privatizing transactional data. Algorithmic challenges. Mini-transactions can provide a significant handle on the difficult task of writing concurrent applications, based on our experience of leveraging even the fairly restricted DCAS [6,4], and others’ experience in utilizing recent hardware TM support [14]. Nevertheless, using mini-transactions still leaves several algorithmic challenges. The first, already discussed above, is the design of algorithms accommodating the best-effort nature of mini-transactions. Another one is to deal with their limited arity, i.e., the small data set, in a systematic manner. An interesting approach could be to understand how mini-transactions can support the needs of amorphous data parallelism [41]. Finally, even with convenient synchronization of accesses to several locations, it is still necessary to find ways to exploit the parallelism, by having thread make progress on their individual tasks, without interfering with each other, and while helping each other as necessary. Part of the challenge is to span the full spectrum: from virtually sequential situations in which threads operate almost in isolation from each other, all the way to highly-parallel situations, where many concurrent threads should be harnessed to perform work efficiently, rather than slow progress due to high contention.
The Inherent Complexity of Transactional Memory and What to Do about It
9
6 Summary This paper describes recent research on formalizing transactional memory, and exploring its inherent limitations. It suggests ways to facilitate the design of efficient and correct concurrent applications, in the post-TM era, while still capitalizing on the lessons learned in designing TM, and the wide interest it generated. The least-explored of them is the design of algorithms and programming patterns that accommodate best-effort mini-transactions, in a way that does not compromise safety, and guarantees liveness in an eventual sense. Acknowledgements. I have benefited from discussions about concurrent programming, and transactional memory with many people, but would like to especially thank my Ph.D. student Eshcar Hillel for many illuminating discussions and comments on this paper. Part of this work was done while the author was on sabbatical at EPFL. The author is supported in part by the Israel Science Foundation (grants 953/06 and 1227/10).
References 1. PowerPC Microprocessor Family: The Programming Environment (1991) 2. Advanced Micro Devices, Inc. Advanced Synchronization Facility - Proposed Architectural Specification, 2.1 edition (March 2009) 3. Attiya, H., Guerraoui, R., Hendler, D., Kuznetsov, P.: The complexity of obstruction-free implementations. J. ACM 56(4) (2009) 4. Attiya, H., Hillel, E.: Built-in coloring for highly-concurrent doubly-linked lists. In: Dolev, S. (ed.) DISC 2006. LNCS, vol. 4167, pp. 31–45. Springer, Heidelberg (2006) 5. Attiya, H., Hillel, E.: The cost of privatization. In: Lynch, N.A., Shvartsman, A.A. (eds.) Distributed Computing. LNCS, vol. 6343, pp. 35–49. Springer, Heidelberg (2010) 6. Attiya, H., Hillel, E.: Highly-concurrent multi-word synchronization. In: Rao, S., Chatterjee, M., Jayanti, P., Murthy, C.S.R., Saha, S.K. (eds.) ICDCN 2008. LNCS, vol. 4904, pp. 112– 123. Springer, Heidelberg (2008) 7. Attiya, H., Hillel, E., Milani, A.: Inherent limitations on disjoint-access parallel implementations of transactional memory. In: SPAA 2009 (2009) 8. Attiya, H., Welch, J.L.: Distributed Computing: Fundamentals, Simulations and Advanced Topics, 2nd edn. Wiley, Chichester (2004) 9. Barnes, G.: A method for implementing lock-free shared-data structures. In: SPAA 1993, pp. 261–270 (1993) 10. Berenson, H., Bernstein, P., Gray, J., Melton, J., O’Neil, E., O’Neil, P.: A critique of ANSI SQL isolation levels. In: SIGMOD 1995, pp. 1–10 (1995) 11. Cascaval, C., Blundell, C., Michael, M., Cain, H.W., Wu, P., Chiras, S., Chatterjee, S.: Software transactional memory: why is it only a research toy? Commun. ACM 51(11), 40–46 (2008) 12. Chaudhry, S., Cypher, R., Ekman, M., Karlsson, M., Landin, A., Yip, S., Zeffer, H., Tremblay, M.: Rock: A high-performance SPARC CMT processor. IEEE Micro. 29(2), 6–16 (2009) 13. Chuong, P., Ellen, F., Ramachandran, V.: A universal construction for wait-free transaction friendly data structures. In: SPAA 2010, pp. 335–344 (2010)
10
H. Attiya
14. Dice, D., Lev, Y., Marathe, V., Moir, M., Olszewski, M., Nussbaum, D.: Simplifying concurrent algorithms by exploiting hardware tm. In: SPAA 2010, pp. 325–334 (2010) 15. Dice, D., Matveev, A., Shavit, N.: Implicit privatization using private transactions. In: Transact 2010 (2010) 16. Dice, D., Shalev, O., Shavit, N.: Transactional locking II. In: Dolev, S. (ed.) DISC 2006. LNCS, vol. 4167, pp. 194–208. Springer, Heidelberg (2006) 17. Dice, D., Shavit, N.: What really makes transactions fast? In: Transact 2006 (2006) 18. Digital Equipment Corporation. Alpha Architecture Handbook (1992) 19. Felber, P., Gramoli, V., Guerraoui, R.: Elastic transactions. In: Lynch, N.A., Shvartsman, A.A. (eds.) Distributed Computing. LNCS, vol. 6343, pp. 93–107. Springer, Heidelberg (2010) 20. Gramoli, V., Harmanci, D., Felber, P.: Towards a theory of input acceptance for transactional memories. In: Baker, T.P., Bui, A., Tixeuil, S. (eds.) OPODIS 2008. LNCS, vol. 5401, pp. 527–533. Springer, Heidelberg (2008) 21. Guerraoui, R., Henzinger, T., Kapalka, M., Singh, V.: Transactions in the jungle. In: SPAA 2010, pp. 263–272 (2010) 22. Guerraoui, R., Kapalka, M.: On obstruction-free transactions. In: SPAA 2008, pp. 304–313 (2008) 23. Guerraoui, R., Kapalka, M.: On the correctness of transactional memory. In: PPoPP 2008, pp. 175–184 (2008) 24. Guerraoui, R., Kapalka, M.: The semantics of progress in lock-based transactional memory. In: POPL 2009, pp. 404–415 (2009) 25. Hendler, D., Incze, I., Shavit, N., Tzafrir, M.: Flat combining and the synchronizationparallelism tradeoff. In: SPAA 2010, pp. 355–364 (2010) 26. Herlihy, M.: Wait-free synchronization. ACM Trans. Program. Lang. Syst. 13(1), 124–149 (1991) 27. Herlihy, M., Koskinen, E.: Transactional boosting: a methodology for highly-concurrent transactional objects. In: PPoPP 2008, pp. 207–216 (2008) 28. Herlihy, M., Luchangco, V., Moir, M., Scherer III., W.N.: Software transactional memory for dynamic-sized data structures. In: PODC 2003, pp. 92–101 (2003) 29. Herlihy, M., Moss, J.E.B.: Transactional memory: Architectural support for lock-free data structures. In: ISCA 1993 (1993) 30. Herlihy, M., Shavit, N.: The Art of Multiprocessor Programming. Morgan Kaufmann, San Francisco (2008) 31. Herlihy, M.P., Wing, J.M.: Linearizability: a correctness condition for concurrent objects. ACM Trans. Program. Lang. Syst. 12(3), 463–492 (1990) 32. Israeli, A., Rappoport, L.: Disjoint-access-parallel implementations of strong shared memory primitives. In: PODC 2004, pp. 151–160 (2004) 33. Kapalka, M.: Theory of Transactional Memory. Nr. 4664, EPFL (2010) 34. Keidar, I., Perelman, D.: On avoiding spare aborts in transactional memory. In: SPAA 2009, pp. 59–68 (2009) 35. Lamport, L.: Proving the correctness of multiprocess programs. IEEE Transactions on Software Engineering SE-3(2), 125–143 (1977) 36. Lamport, L.: How to make a multiprocessor computer that correctly executes multiprocess program. IEEE Transactions on Computers 100(28), 690–691 (1979) 37. Larus, J.R., Rajwar, R.: Transactional Memory. Morgan and Claypool, San Francisco (2007) 38. Moir, M., Moore, K., Nussbaum, D.: The adaptive transactional memory test platform: A tool for experimenting with transactional code for Rock. In: Transact 2008 (2008) 39. Papadimitriou, C.H.: The serializability of concurrent database updates. J. ACM 26(4), 631–653 (1979)
The Inherent Complexity of Transactional Memory and What to Do about It
11
40. Perelman, D., Fan, R., Keidar, I.: On maintaining multiple versions in STM. In: PODC 2010, pp. 16–25 (2010) 41. Pingali, K., Kulkarni, M., Nguyen, D., Burtscher, M., Mendez-Lojo, M., Prountzos, D., Sui, X., Zhong, Z.: Amorphous data-parallelism in irregular algorithms. Technical Report TR-0905, The University of Texas at Austin, Department of Computer Sciences (2009) 42. Rajwar, R., Goodman, J.R.: Transactional lock-free execution of lock-based programs. In: ASPLOS 2002, pp. 5–17 (2002) 43. Shavit, N., Touitou, D.: Software transactional memory. In: PODC 1995, pp. 204–213 (1995) 44. Shpeisman, T., Menon, V., Adl-Tabatabai, A.-R., Balensiefer, S., Grossman, D., Hudson, R.L., Moore, K.F., Saha, B.: Enforcing isolation and ordering in STM. SIGPLAN Not. 42(6), 78–88 (2007) 45. Spear, M.F., Marathe, V.J., Dalessandro, L., Scott, M.L.: Privatization techniques for software transactional memory. Technical Report Tr 915, Dept. of Computer Science, Univ. of Rochester (2007) 46. Turek, J., Shasha, D., Prakash, S.: Locking without blocking: making lock based concurrent data structure algorithms nonblocking. In: PODS 1992, pp. 212–222 (1992) 47. Weikum, G., Vossen, G.: Transactional Information Systems: Theory, Algorithms, and the Practice of Concurrency Control and Recovery. Morgan Kaufmann, San Francisco (2001)
Sustainable Ecosystems: Enabled by Supply and Demand Management Chandrakant D. Patel and IEEE Fellow Hewlett Packard Laboratories, Palo Alto, CA 94304, USA
[email protected] Abstract. Continued population growth, coupled with increased per capita consumption of resources, poses a challenge to the quality of life of current and future generations. We cannot expect to meet the future needs of society simply by extending existing infrastructures. The necessary transformation can be enabled by a sustainable IT ecosystem made up of billions of service-oriented client devices and thousands of data centers. The IT ecosystem, with data centers at its core and pervasive measurement at the edges, will need to be seamlessly integrated into future communities to enable need-based provisioning of critical resources. Such a transformation requires a systemic approach based on supply and demand of resources. A supply side perspective necessitates using local resources of available energy, alongside design and management that minimizes the energy required to extract, manufacture, mitigate waste, transport, operate and reclaim components. The demand side perspective requires provisioning resources based on the needs of the user by using flexible building blocks, pervasive sensing, communications, knowledge discovery and policy-based control. This paper presents a systemic framework for supply-demand management in IT—in particular, on building sustainable data centers—and suggests how the approach can be extended to manage resources at the scale of urban infrastructures. Keywords: available energy, exergy, energy, data center, IT, sustainable, ecosystems, sustainability.
1 Introduction 1.1 Motivation Environmental sustainability has gained great mindshare. Actions and behaviors are often classified as either “green” or “not green” using a variety of metrics. Many of today’s “green” actions are based on products that are already built but classified as “environmentally friendly” based on greenhouse gas emission and energy consumption in use phase. Such compliance-time thinking lacks a sustainability framework that could holistically address global challenges associated with resource consumption. These resource consumption challenges will stem from various drivers. The world population is expected to reach 9 billion by 2050 [1]. How do we deal with the increasing strain that the economic growth is placing on our dwindling natural M.K. Aguilera et al. (Eds.): ICDCN 2011, LNCS 6522, pp. 12–28, 2011. © Springer-Verlag Berlin Heidelberg 2011
Sustainable Ecosystems: Enabled by Supply and Demand Management
13
resources? Can we expect to meet the needs of society by solely relying on replicating and extending the existing physical infrastructure to cope with economic and population growth? Indeed, anecdotal evidence of the strain that society is placing on the supply side—the resources used for goods and services—is apparent: rising prices for critical materials, such as copper and steel; the dramatic reduction in output of the Pemex Canatrell oil field in Mexico, one of the largest in the world; and limitations in city scale waste disposal. Furthermore, a rise in the price of fuel has led to inflationary impact that could threaten the quality of life of billions. Thus, depletion of limited natural resources and increases in the cost of basic goods necessitates new business models and infrastructures that are designed, built and operated using the least possible amount of appropriate materials and energy. The supply side must be considered together with the societal demand for resources. This paper presents a holistic framework for sustainable design and management. Unlike past work that has mostly focused on operational energy considerations of devices, this contribution weaves lifecycle implications into a broader supply-demand framework. The following are the salient contributions: • • • •
Use of available energy (also called exergy) from 2nd law of thermodynamics as a metric for quantifying sustainability. Formulation of a supply-demand framework based on available energy. Application of this framework to IT, in particular, to data centers. Extension of the framework to other ecosystems such as cities.
1.2 Role of the IT Ecosystem Consider the information technology (IT) ecosystem made up of billions of service oriented client devices, thousands of data centers and digital print factories. As shown in Figure 1, the IT ecosystem and other human managed ecosystems such as transportation, waste management, power delivery, industrial systems, etc. draw from a pool of available energy. In this context, IT has the opportunity to change existing business models and deliver a net positive impact with respect to consumption of available energy. To do so, sustainability of the IT ecosystem itself must be addressed holistically. Given a sustainable IT ecosystem, imagine the scale of impact when billons in growth economies like India utilize IT services to conduct transactions such as purchasing railway tickets, banking, availing healthcare, government services, etc. As the billions board the IT bus, and shun other business models, such as ones that require the use of physical transportation means like an auto-rickshaw to go to the train station to buy tickets, the net reduction in the consumption of available energy can be significant. Indeed, when one overlays a scenario where everything will be delivered as a service, a picture emerges of billions of end users utilizing trillions of applications through a cloud of networked data centers. However, to reach the desired price point where such services will be feasible—especially in emerging economies, where Internet access is desired at approximately US $1 per month—the total cost-ofownership (TCO) of the physical infrastructure that supports the cloud will need to be revisited. There are about 81 million Internet connections in India [2]. There has been progress in reducing the cost of access devices [3], but the cost to avail services still needs to be addressed. In this regard, without addressing the cost of data centers—the foundation for services to the masses—scaling to billions of users is not possible.
14
C.D. Patel
Fig. 1. Consumption of Available Energy
With respect to data centers, prior work has shown that a significant fraction of the TCO comes from the recurring energy consumed in the operation of the data center, and from the burdened capital expenditures associated with the supporting physical infrastructure [4]. The burdened cost of power and cooling, inclusive of redundancy, is estimated to be 25% to 30% of the total cost of ownership in typical enterprise data centers [4]. These power and cooling infrastructure costs may match, or even exceed, the cost of the IT hardware within the data center. Thus, including the cost of IT hardware, over half of the TCO in a typical data center is associated with design and management of the physical infrastructure. For Internet services providers, with thinner layers of software and licensing costs, the physical infrastructure could be responsible for as much as 75% of the TCO. Conventional approaches in building data centers with multiple levels of redundancies and excessive material—an “always-on” mantra with no regard to service level agreement, and lack of dynamic provisioning of resources—leads to excessive over provisioning and cost. Therefore, cost reduction requires an end to end approach that delivers least materials, least energy data centers. Indeed, contrary to the oft held view of sustainability as “paying more to be green”, minimizing the overall lifecycle available energy consumption and thereby building sustainable data centers leads to lowest cost data centers.
2 Available Energy or Exergy as a Metric 2.1 Exergy IT and other ecosystems draw from a pool of available energy as shown in Figure 1. Available energy, also called exergy, refers to energy that is available for performing work [5]. While energy refers to the quantity of energy, exergy quantifies the useful portion (or “quality”) of energy. As an example, in a vehicle, the combustion of a
Sustainable Ecosystems: Enabled by Supply and Demand Management
15
given mass of fuel such as diesel results in propulsion of vehicle (useful work done), dissipation of heat energy and a waste stream of exhaust gases at a given temperature. From the first law of thermodynamics, the quantity of energy was conserved in the combustion process as the sum of the energy in the products equals that in the fuel. However, from the 2nd law of thermodynamics, the usefulness of energy was destroyed since there is not much useful work that can be harnessed from the waste streams e.g. exhaust gases. One can also state that the combustion of fuel resulted in increase of entropy or disorder in the universe – going from a more ordered state in fuel to less ordered state in waste streams. As all processes result in increase in entropy, and consequent destruction of exergy due to entropy generation, minimizing the destruction of exergy is an important sustainability consideration. From a holistic supply-demand point of view, one can say that we are drawing from a finite pool of available energy, and minimizing destruction of available energy is key for future generations to enjoy the same quality of life as the current generation. With respect to making the most of available energy, it is also important to understand and avail opportunities in extracting available energy from waste streams. Indeed, it is instructive to examine the combustion example further to understand the exergy content of waste streams. Classical thermodynamics dictates the upper limit of the work, A, that could be recovered from a heat source, Q (in Joules), at temperature Tj (in Kelvins) emitting to a reservoir at ground state temperature Ta as:
⎛ T ⎞ A = ⎜1 − a ⎟ Q ⎜ T ⎟ j ⎠ ⎝
(1)
For example, with reference to equation 1, relative to a temperature of 298 K (25 oC), 1 joule of heat energy at 773 K (500 oC)—such as exhaust gases from a gas turbine— can give 0.614 joules of available energy. Therefore, a waste stream at this high temperature has good availability (61%) that can be harvested. By the same token, the same joule at 323 K (50 oC)—such as exhaust air from a high power server—can only give 0.077 joules of work. While this determines the theoretical maximum Carnot work that can be availed with a perfect reversible engine, the actual work is much less due to irreversible losses such as friction. Stated simply, the 2nd law of thermodynamics places a limit on the amount of energy that can be converted from one form to another. Similarly, laws of thermodynamics can be applied to other conversion means e.g. electrochemical reactions in fuel cells to estimate the portion of reaction enthalpy that can be converted to electricity [6]. Traditional methods of design involve the completion of an energy balance based on the conservation theory of the first law of thermodynamics. Such a balance can provide the information necessary to reduce thermal losses or enhance heat recovery, but an energy analysis fails to account for degradation in the quality of energy due to irreversibilities predicted by the second law of thermodynamics. Thus, an approach based on the second law of thermodynamics is necessary for analyzing available energy or exergy consumption across the lifecycle of a product—from “cradle to cradle”. Furthermore, it can also be used to create the analytics necessary to run operations that minimize destruction of exergy and create inference analytics that can enable need-based provisioning of resources. Lastly, exergy analysis is important to
16
C.D. Patel
determine the value of the waste stream and tie it to an appropriate process that can make the most of it. For example, the value of converting exhaust heat energy to electrical energy using a thermo-electric conversion process may apply in some cases, but not in others when one takes into account the exergy requirement to build and operate the thermo-electric conversion means. 2.2 Exergy Chain in IT Electrical energy is produced from conversion of energy from one form to another — a common chain starts with converting the chemical energy in the fuel to thermal energy from the combustion of fuel to mechanical energy in a rotating physical device to electrical energy from a magnetically based dynamo. Alternatively, available energy in water—such as potential energy at a given height in a dam—can be converted to mechanical energy and to electrical energy. The electrical energy is 100% available. However, as electrical energy is transmitted and distributed from the source to the point of use, losses along the way in transmission and distribution lead to destruction of availability. The source of power for most data centers (i.e., thermal power station) operates at an efficiency in the neighborhood of 35% to 60% [7]. Transmission and distribution losses can range from 5% to 12%. System level efficiency in the data center power delivery infrastructures (i.e., from building to chip) can range from 60% to 85% depending on the component efficiency and load. Around 80% is typical for a fully loaded state-of-the-art data center. Overall, out of every watt generated at the source, only about 0.3 W to 0.4 W is used for computation. If the generation cycle itself as well as overhead of the data center infrastructure (i.e., cooling) is taken into account, the coal-to-chip power delivery efficiency will be around 5% to 12%. In addition to consumption of exergy in operation, the material within the data center has exergy embedded in it. The embedded exergy stems from exergy required to extract, manufacture, mitigate waste, and reclaim the material. Exergy is also embedded in IT as result of direct use of water (for cooling) and indirect use of water (for production of parts, electricity, etc). Water too can be represented using exergy. As an example, assuming nature desalinates water and there is sufficient fresh water available from natural cycle, one can represent exergy embedded in water as a result of distribution (exergy required to pump) and treatment (exergy required to treat waste water). On average, treatment and distribution of a million gallons of surface water requires 1.5 MWh of electrical energy. Similarly, treatment of a million gallons of waste water consumes 2.5 MWh of electrical energy [8].
3 Supply Side and Demand Side Management 3.1 Architectural Framework In order to build sustainable ecosystems, the following systemic framework articulates the management of supply and demand side of available energy based on the needs of the users.
Sustainable Ecosystems: Enabled by Supply and Demand Management •
•
17
On the supply side: o minimizing the exergy required to extract, manufacture, mitigatewaste, transport, operate and reclaim components; o design and management using local sources of available energy to minimize the destruction of exergy in transmission and distribution, e.g., dissipation in transmission; and, take advantage of exergy in the waste streams, e.g., exhaust heat from turbine. On the demand side: o minimizing the consumption of exergy by provisioning resources based on the needs of the user by using flexible building blocks, pervasive sensing, communications, knowledge discovery and policy based control.
Sustainable ecosystems, given the supply and demand side definition above, are then built on delivering to the needs of the user. The needs of the user are derived from the service level agreement (SLA), decomposed into lower level metrics that can be applied in the field to enable integrated management of supply and demand. The balance of the paper steps through the framework by examining lifetime exergy consumption in IT, evolving a supply-demand framework for data centers and closing by extending the framework to other ecosystems. 3.2 Quantifying Lifetime Exergy Consumption As noted earlier, exergy or available energy stemming from the second law of thermodynamics fuses information about materials and energy use into a single meaningful measure. It estimates the maximum work in Joules that could theoretically have been extracted from a given amount of material or energy. By equating a given system in terms of its lifetime exergy consumption, it becomes possible to remove dependencies on the type of material or the type of energy (heat, electricity, etc) consumed. Therefore, given a lifecycle of a product, as shown in Figure 2, one can now create an abstract information plane that can be commonly applied across any arbitrary infrastructure. Lifecycle design then implies inputting the entire supply chain from “cradle to cradle” to account for exergy consumed in extraction, manufacturing, waste mitigation, transportation, operation and reclamation. From a supply side perspective, designers can focus on minimizing lifetime energy consumption through de-materialization, material choices, transportation, process choices, etc. across the lifecycle. The design toolkit requires a repository of available energy consumption data for various materials and processes. With respect to IT, the following provides an overview of the salient “hotspots” discerned using an exergy based lifetime analysis [9]: •
For service oriented access devices such as laptops, given typical residential usage pattern, the lifetime operational exergy consumption is 20-30% of the total exergy consumed while the rest is embedded (exergy consumed in extraction, manufacturing, transportation, reclamation). o Of the 70-80% of the embedded lifetime exergy consumption, display is a big component.
18
C.D. Patel
Fig. 2. Lifecycle of a product •
For data centers, for a given server, the lifetime operational exergy consumption is about 60% to 80% of the total lifetime exergy consumption [10]. o The large operational component stems from high electricity consumption in the IT equipment and the data center level cooling infrastructure [11][12].
From a strategic perspective, for handhelds, laptops and other forms of access devices, reducing the embedded exergy is critical. And, in order to minimize embedded exergy, least exergy process and material innovations are important. As an example, innovations in display technologies can reduce the embedded footprint of laptops and handhelds. On the other hand, for data centers, it is important to devise an architecture that focuses on minimizing electricity (100% available energy) consumed in operation. Next section presents a supply-demand based architecture for peak exergetic efficiency associated with synthesis and operation of a data center. A sustainable data center—built on lifetime exergy considerations, flexible and configurable resource micro-grids, pervasive sensing, communications and aggregation of sensed data, knowledge discovery and policy based autonomous control—is proposed. 3.3 Synthesis of a Sustainable Data Center Using Lifetime Exergy Analysis Business goals drive the synthesis of a data center. For example, assuming a million users subscribing to a service at US $1/month, the expected revenue would be US $12 million per year. Correspondingly, it may be desirable to limit the infrastructure (excluding software licenses, personnel) TCO to be 1/5th of that amount, or roughly US $2.4 million per year. A simple understanding of impact of power can be had by estimating the cost implications in areas where low cost utility power is not available, and diesel generators are used as a primary source. For example, if the data center supporting the million users consumes 1 MW of power at 1 W per user or 8.76 million KWh per year, the cost of just powering with diesel at approximately $0.25 per KWh will be about $2.2 million per year. Thus, growth economies strained on the
Sustainable Ecosystems: Enabled by Supply and Demand Management
19
resource side and reliant on local power generation with diesel will be at a great disadvantage. Having understood such constraints, the data center must meet the target total cost of ownership (TCO) and uptime based on the service level agreements for a variety of workloads. Data center synthesis can be enabled by using a library of IT and facility templates to create a variety of design options. Given the variety of supply-demand design options, the key areas of analysis for sustainability and cost become: 1. Lifecycle analysis to evaluate each IT-Facility template and systematically dematerialize to drive towards least lifetime embedded exergy design and lowest capital outlay, e.g., systematically reduce the number of IT, power and cooling units, remove excessive material in the physical design, etc. • Overall exergy analysis should also consider exergy in waste streams, and locality of the data center to avail supply side resources (power and cooling). 2. Performance modeling toolkit to determine the performance of the ensemble and estimate consumption of exergy during operation. 3. Reliability modeling toolkit to discover and design for various levels of uptime within the data center. 4. Performance modeling toolkit to determine the ability to meet the SLAs for a given IT-Facility template. 5. TCO modeling to estimate the deviation from the target TCO. Combining all the key elements noted above enables a structure for analysis for a set of applicable data center design templates for a given business need. The data center can be benchmarked in terms of performance per Joules of lifetime available energy or exergy destroyed. The lifetime exergy can be incorporated in a total cost of ownership model that includes software, personnel and license to determine the total cost of ownership of a rack [4] and used to price a cloud business model such as “infrastructure as a service”. 3.4 Demand Side Management of Sustainable Data Centers On the demand side, it is instructive to trace the energy flow in a data center. Electrical energy, all of which is available to do useful work, is transferred to IT equipment. Most of the available electrical energy drawn by the IT hardware is dissipated as heat energy, while useful work is availed through information processing. The amount of useful work is not proportional to the power consumed by the IT hardware. Even in idle mode IT hardware typically consumes more than 50% of its maximum power consumption [32]. As noted in [32], it is important to devise energy-proportional machines. However, it is also important to increase the utilization of IT hardware and reduce the total amount of required hardware. [31] presents such a resource management architecture. A common approach to increasing utilization is executing applications in virtual machines and consolidating the virtual machines onto fewer larger servers [33]. As shown in [15] workload consolidation has the potential to reduce the IT power demand significantly.
20
C.D. Patel
Next, additional exergy is used to actively transfer the heat energy from chip to the external ambient. All of the exergy delivered to the cooling equipment to remove heat is not used to affect the heat transfer. While a fraction of the electrical energy provided to a blower or pump is converted to flow work (product of pressure in N/m2 and volume flow in m3/s), and likewise a portion of the electrical energy applied to a compressor is converted to thermodynamic work (to reduce temperature of the data center), the bulk of the exergy provided is destroyed due to irreversibility. Therefore, in order to build an exergy efficient system, the mantra for demand side management in data center becomes one of allocation of IT (compute, networking and storage), power and cooling resources based on the need with the following salient considerations: •
•
•
Decompose SLAs to Service Level Objectives o based on the SLOs, allocate appropriate IT resources while meeting the performance and uptime requirements [28][29][30]; o account for spatial and temporal efficiencies and redundancies associated with thermo-fluids behavior in a given data center based on heat loads and cooling boundary conditions[14]. Consolidate workloads while taking into account the spatial and temporal efficiencies noted above, e.g., place critical workloads in “gold zones” of data centers which have inherent redundancies due to intersection of fluid flows from multiple air conditioning units and turn off or scale back power to IT equipment not in use [13][14][31]. Enable the data center cooling equipment to scale based on the heat load distribution in the data center [11].
Dynamic implementation of the key points described above can result in better utilization of resources, reduction of active redundant components and reduction in electrical power consumption by half [13][15]. As an example, a data center designed for 1 MW of power at maximum IT load can run up to 80% capacity with workload consolidation and dynamic control. The balance of 200 KW can be used for cooling and other support equipment. Indeed, besides availing failover margin by operating at 80%, the coefficient of performance of the power and cooling ensemble is often optimal at about 80% loading given the efficiency curves of UPSs, and mechanical equipment such as blowers, compressors, etc. 3.5 Coefficient of Performance of the Ensemble Figure 3 shows a schematic of energy transfer in a typical air-cooled data center through flow and thermodynamic processes. Heat is transferred from the heat sinks on a variety of chips—microprocessors, memory, etc—to the cooling fluid in the system e.g. driven by fans, air as a coolant enters the system and undergoes a temperature rise based on the mass flow and is exhausted out into the room. Fluid streams from different servers undergo mixing and other thermodynamic and flow processes in the exhaust area of the racks. As an example, for air cooled servers and racks, the dominant irreversibilities that lead to destruction of exergy arise from cold and hot air streams mixing and mechanical inefficiency of air moving devices. These streams (or some fraction of them) flow back to the modular computer room air conditioning units
Sustainable Ecosystems: Enabled by Supply and Demand Management
21
(CRACs) and transfer heat to the chilled water (or refrigerant) in the cooling coils. Heat transferred to the chilled water at the cooling coils is transported to the chillers through a hydronics network. The coolant in the hydronics network, water in this case, undergoes a pressure drop and heat transfer until it loses heat to the expanding refrigerant in the evaporator coils of the chiller. The heat extracted by the chiller is dissipated through the cooling tower. Work is added at each stage to change the flow and thermodynamic state of the fluid. While this example shows a chilled water infrastructure, the problem definition and analysis can be extended to other forms of cooling infrastructure. Qdatacenter
Hot Fluid Cold Fluid
Power Grid Wblower
Wensemble System Exergo-Thermo Volumes
Cooling Grid
Wpump
Outside Air, Wblower
Wpump Wcompressor Chiller
System Blower (s)
COPG
¦W
Qdc ¦Wblower ¦Wpump ¦ Wcompressor ¦ Wcoolingtower
system
k
l
m
n
o
Fig. 3. Energy Flow in the IT stack – Supply and Demand Side
Development of a performance model at each stage in the heat flow path can enable efficient cooling equipment design, and provide a holistic view of operational exergy consumption from chips to the cooling tower. The performance model should be agnostic and be applicable to an ensemble of components for any environmental control infrastructure. [16] proposes such an operational metric to quantify the performance of the ensemble from chips to cooling tower. The metric, called coefficient of performance of the ensemble, COPG, builds on the thermodynamic metric called coefficient of performance [16]. Maximizing the coefficient of performance of the ensemble leads to minimization of exergy required to operate the cooling equipment. In Figure 3, the systems—such as processor, networking and storage blades—are modeled as “exergo-thermo-volumes” (ETV), an abstraction to represent lifetime exergy consumption of the IT building blocks and their cooling performance [17][18]. The thermo-volumes portion represents the coolant volume flow and resistance to the & ) and pressure drop (ΔP) respectively, to afflow, characterized by volume flow ( V fect heat energy removal from the ETVs. The product of pressure drop (ΔP in N/m2) & in m3/s) determines the flow work required to move a given and volume flow ( V coolant (air here) through the given IT building block represented as an ETV. The & ) for a given temperature rise required through the minimum coolant volume flow ( V
22
C.D. Patel
ETV, shown by a dashed line in Fig 3, can be determined from the energy equation (Eq. 2).
& & =ρ V Q = m C p (T out − T in ) where m .
.
(2)
& is the mass flow in kg/s, ρ is density in where Q& is the heat dissipated in Watts, m 3 kg/m of the coolant (air in this example), Cp is specific heat capacity of air (J/kg-K) and Tout and Tin represent inlet and outlet temperatures of air. As shown in Equation 3, the electrical power (100% available energy) required by the blower (Wb) is the ratio of the calculated flow work to the blower wire to air efficiency, ζb. The blower characteristic curves show the efficiency (ζb) and are important to understand the optimal capacity at the ensemble level. Wb =
( Δ P etv × V& etv )
ς
(3)
b
Total heat load of the datacenter is assumed to be a direct summation of the power delivered to the computational equipment via UPS and PDUs. Extending the coefficient of performance (COP) to encompass power required by cooling resources in form of flow and thermodynamic work, the ratio of total heat load to the power consumed by the cooling infrastructure is defined as: COP = =
Total Heat Dissipation
(Flow Work + Thermodynamic Work ) of Cooling system Heat Extracted by Air Conditione rs Net Work Input
(4)
The ensemble COP is then represented as shown below. The reader is referred to [16] for details. COPG =
Qdatacenter ∑W + ∑ Wblower + ∑ W pump + Wcompressor + W coolingtower k system l m
(5)
3.6 Supply Side of a Sustainable Data Center Design The supply side motivation to design and manage the data center using local sources of available energy is intended to minimize the destruction of exergy in distribution. Besides reducing exergy loss in distribution, a local micro-grid [19] can take advantage of resources that might otherwise remain unutilized and also present an opportunity to use the exergy in waste streams. Figure 4 shows a power grid with solar and methane based generator at a dairy farm. The methane is produced by anaerobic digestion of manure from dairy cows [20]. The use of biogas from manure is well known and has been used all over the world [7]. The advantage of co-location [20]
Sustainable Ecosystems: Enabled by Supply and Demand Management
23
Fig. 4. Biogas and solar electric
stems from the use of heat energy exhausted by the server racks—one Joule of which has a maximum theoretical available energy of 0.077 W at 323 K (50 oC)—to enhance methane production as shown in Figure 4. The hot water from the data center is circulated through the “soup” in the digester. Furthermore, [20] also suggests the use of available energy in the exhaust stream of the electric generator to drive an adsorption refrigeration cycle to cool the data center. Thus, multiple local options for power—wind, sun, biogas and natural gas— sourced locally can power a data center. And, high and low grade exergy in waste streams such as exhaust gases, can be utilized to drive other systems. Indeed, cooling for the data center ought to follow the same principles —a cooling grid made up of local sources. A cooling grid can be made up of ground coupled loops to dissipate heat into the ground, and use outside air (Figure 3) when at appropriate temperature and humidity to cool the data center. 3.7 Integrated Supply-Demand Management of Sustainable Data Centers Figure 5 shows the architectural framework for integrated supply-demand management of a data center. The key components of the data center—IT (compute, networking and storage), power and cooling—have five key horizontal elements. The foundational design elements of the data center are lifecycle design using exergy as a measure and flexible micro-grids of power, cooling and IT building blocks. The micro-grids enable the integrated manager the ability to choose between multiple supply side sources of power, multiple supply side sources of cooling and multiple types of IT hardware and software. The flexibility in power and cooling provides the ability to set power levels of IT systems and the ability to vary cooling (speed of the blowers, etc). The design flexibility in IT comes from intelligent scheduling framework, multiple power states and virtualization [15]. On this design foundation, the management layers are sensing and aggregation, knowledge discovery and policy based control.
24
C.D. Patel
IT
Power
Cooling
Policy Based Control Knowledge Discovery & Visualization Pervasive Sensing Scalable, Configurable Resource Micro-grids Lifetime Based Design
Fig. 5. Architectural framework for a Sustainable Data Center
At runtime, the integrated IT-Facility manager maintains the run-time status of the IT and facility elements of the datacenter. A lower level facility manager collects physical, environmental and process data from racks, room, chillers, power distribution components, etc. Based on the higher level requirements passed down by the integrated manager, the facility management system creates low-level SLAs for operation of power and cooling devices e.g. translating a high level energy efficiency goals to lower level temperature and utilization levels for facility elements to guarantee SLAs. The integrated manager has a knowledge discovery and visualization module that has data analytics for monitoring lifetime reliability, availability and downtimes for preventive maintenance. It has modules that provide insights that are otherwise not apparent at runtime e.g. temporal data mining of facility historical data. As an example, data mining techniques have been explored for more efficient operation of an ensemble of chillers [23][34]. In [23], operational patterns (or motifs) are mined in historical data pertaining to an ensemble of water and air cooled chillers. These patterns are characterized in terms of their COPG, thus allowing comparison in terms of their operational energy efficiency. At the control level in Figure 5, a variety of approaches can be taken. As an example, in one approach, the cooling controller maintains dynamic control of the facility infrastructure (includes CRACs, UPS, Chillers, supply side power etc.) at levels determined by the facility manager to optimize the coefficient of performance of the COPG, e.g., for providing requisite air flow to the racks to maintain the temperature at the inlet of the racks between 25 oC and 30 oC [11][12][14][21]. While exercising dynamic cooling control through the facility manager, the controller also provides information to the IT manager to consolidate workloads to optimize the performance of the data center, e.g., the racks are ranked based on thermo-fluids efficiency at a given time, and the ranking is used in workload placement [15]. The IT equipment not in use is scaled down by the integrated manager [15]. Furthermore, in order to reduce the redundancy in the data center, working in conjunction with IT and Facility managers, the integrated manager uses virtualization and power scaling as a flexibility to mitigate failures e.g. air conditioner failures [12][14][15]. Based on past work, the total energy consumed—in power and cooling—with these demand management techniques would be half of that of state of the art designs. Coupling the demand side management with the supply side options from the local
Sustainable Ecosystems: Enabled by Supply and Demand Management
25
power grid, and the local cooling grid, opens up a completely new approach to integrated supply-demand side management [24] and lead to a “net zero” data center.
4 Applying Supply-Demand Framework to Other Ecosystems In previous sections, the supply-demand framework was applied in devising least lifetime exergy data centers. Sustainable IT can now become IT for sustainability to enable need-based provisioning of resources—power, water, waste, transportation, etc—at scales of cities and can thus deliver a net positive impact by reducing the consumption and depletion of precious Joules of available energy. Akin to the sustainable data center, the foundation of the “Sustainable City” or “City 2.0” is comprehensive lifecycle design [25][26]. Unlike previous generations, where cities were built predominantly focusing on cost and functionality desired by inhabitants, sustainable cities will require a comprehensive life-cycle view, where systems are designed not just for operation but for optimality across resource extraction, manufacturing and transport, operation, and end-of-life. The next distinction within sustainable cities will arise in the supply-side resource pool. The inhabitants of sustainable cities are expected to desire on-demand, just-in-time access to resources at affordable costs. Instead of following a centralized production model with large distribution and transmission networks, a more distributed model is proposed: augmentation of existing centralized infrastructure with local resource micro-grids. As shown for the sustainable data center, there is an opportunity to exploit the resources locally available to create local supply side grids made up of multiple local sources, e.g., power generation by photo-voltaic cells on roof tops, utilization of exergy available in waste stream of cities such as municipal waste and sewage together with natural gas fired turbines with full utilization of waste stream from the turbine, etc. Similarly, for other key verticals such as water, there is an opportunity to leverage the past experience to build water micro-grids using local sources, e.g., harvesting rain water to charge local man-made reservoirs and underground aquifers. Indeed, past examples such as the Amber Fort in Jaipur, Rajasthan, India show such considerations in arid regions [27].
design
management
Electricity
Water
Transport
Waste
Policy Based Control Knowledge Discovery & Visualization Pervasive Sensing Scalable, Configurable Resource Micro-grids Lifetime Based Design
Fig. 6. Architectural Framework for a Sustainable City
...
26
C.D. Patel
As shown in Figure 6, having constructed lifecycle based physical infrastructures consisting of configurable resource micro-grids, the next key element is a pervasive sensing layer. Such a sensing infrastructure can generate data streams pertaining to the current supply and demand of resources emanating from disparate geographical regions, their operational characteristics, performance and sustainability metrics, and the availability of transmission paths between the different micro-grids. The great strides made in building high density, small lifecycle footprint IT storage can enable archival storage of the aggregated data about the state of each micro-grid. Sophisticated data analysis and knowledge discovery methods can be applied to both streaming and archival data to infer trends and patterns, with the goal of transforming the operational state of the systems towards least exergy operations. The data analysis can also enable construction of models using advanced statistical and machine learning techniques for optimization, control and fault detection. Intelligent visualization techniques can provide high-level indicators of the ‘health’ of each system being monitored. The analysis can enable end of life replacement decisions, e.g., when to replace pumps in water distribution system to make the most of the lifecycle. Lastly, while a challenging task—given the flexible and configurable resource pools, pervasive sensing, data aggregation and knowledge discovery mechanisms—an opportunity exists to devise policy based control system. As an example, given a sustainability policy, upstream and downstream pumps in a water micro-grid can operate to maintain a balance of demand and supply.
5 Summary and Conclusions This paper presented a supply-demand framework to enable sustainable ecosystems and suggested that a sustainable IT ecosystem, built using such a framework, can enable IT to drive sustainability in other human managed ecosystems. With respect to the IT ecosystem, an architecture for a sustainable data center composed of three key components—IT, power and cooling—and five key design and management elements stripped across the verticals was presented (Figure 5).The key elements—lifecycle design, scalable and configurable resource micro-grids, sensing, knowledge discovery and policy based control—enable the supply-demand management of the key components. Next, as shown by Figure 6, an architecture for “sustainable cities” built on the same principles by integrating IT elements across key city verticals such as power, water, waste, transport, etc. was presented. One hopes that cities are able to incorporate the key elements of the proposed architecture along one vertical, and when sufficient number of verticals have been addressed, a unified city scale architecture can be achieved. The instantiation of the sustainability framework and the architecture for data centers and cities will require a multi-disciplinary workforce. Given the need to develop the human capital, the specific call to action at this venue is to: •
Leverage the past, and return to “old school” core engineering, to build the foundational elements of the supply-demand architecture using lifecycle design and supply side design principles.
Sustainable Ecosystems: Enabled by Supply and Demand Management •
27
Create a multi-disciplinary curriculum composed of various fields of engineering, e.g., melding of computer science and mechanical engineering to scale the supply-demand side management of power, etc. o The curriculum also requires social and economic tracks as sustainability in its broadest definition is defined by economic, social and environmental spheres—the triple bottom line—and requires us to operate at the intersection of these spheres.
References 1. United Nations Population Division, http://www.un.org/esa/population 2. Aguiar, M., Boutenko, V., Michael, D., Rastogi, V., Subramanian, A., Zhou, Y.: The Internet’s New Billion. Boston Consulting Group Report (2010) 3. Chopra, A.: $35 Computer Taps India’s Huge Low-Income Market. Christian Science Monitor (2010) 4. Patel, C., Shah, A.: Cost Model for Planning, Development and Operation of a Data Center, HP Laboratories Technical Report, HPL-2005-107R1, Palo Alto, CA (2005) 5. Moran, M.J.: Availability Analysis: A Guide to Efficient Energy Use. Prentice-Hall, Englewood Cliffs (1982) 6. Barbir, F.: PEM Fuel Cells, pp. 18–25. Elsevier Academic Press (2005) 7. Rao, S., Parulekar, B.B.: Energy Technology. Khanna Publishers (2005) 8. Sharma, R., Shah, A., Bash, C.E., Patel, C.D., Christian, T.: Water Efficiency Management in Datacenters. In: International Conference on Water Scarcity, Global Changes and Groundwater Management Responses, Irvine, CA (2008) 9. Shah, A., Patel, C., Carey, V.: Exergy-Based Metrics for Environmentally Sustainable Design. In: 4th International Exergy, Energy and Environment Symposium, Sharjah (2009) 10. Hannemann, C., Carey, V., Shah, A., Patel, C.: Life-cycle Exergy Consumption of an Enterprise Server. International Journal of Exergy 7(4), 439–453 (2010) 11. Patel, C.D, Bash, C.E., Sharma, R., Friedrich, R.: Smart Cooling of Data Centers. In: ASME IPACK, Maui, Hawaii (2003) 12. Patel, C., Sharma, R., Bash, C., Beitelmal, A.: Thermal Considerations in Data Center Design, In: IEEE-Itherm, San Diego (2002) 13. Bash, C.E., Patel, C.D., Sharma, R.K.: Dynamic Thermal Management of Air Cooled Data Centers. In: IEEE-Itherm (2006) 14. Beitelmal, A., Patel, C.D.: Thermo-fluids Provisioning of a High Density Data Center. HP Labs External Technical Report, HPL-2004-146R1 (2004) 15. Chen, Y., Gmach, D., Hyser, C., Wang, Z., Bash, C., Hoover, C., Singhal, S.: Integrated Management of Application Performance, Power and Cooling in Data Centers. In: 12th IEEE/IFIP Network Operations and Management Symposium (NOMS), Osaka (2010) 16. Patel, C.D., Sharma, R.K., Bash, C.E., Beitelmal, M.: Energy Flow in the Information Technology Stack: Introducing the Coefficient of Performance of the Ensemble. In: ASME International Mechanical Engineering Congress & Exposition, Chicago, Illinois (2006) 17. Shah, A., Patel, C.D.: Exergo-Thermo-Volumes: An Approach for Environmentally Sustainable Thermal Management of Energy Conversion Devices. Journal of Energy Resource Technology, Special Issue on Energy Efficiency, Sources and Sustainability 2 (2010) 18. Shah, A., Patel, C.: Designing Environmentally Sustainable Systems using ExergoThermo-Volumes. International Journal of Energy Research 33 (2009)
28
C.D. Patel
19. Sharma, R., Bash, C.E., Marwah, M., Christian, T., Patel, C.D.: MICROGRIDS: A new approach to supply-side design of data centers. In: IMECE 2009: Lake Buena Vista, FL (2009) 20. Sharma, R., Christian, T., Arlitt, M., Bash, C., Patel, C.: Design of Farm Waste-driven Supply Side Infrastructure for Data Centers, In: ASME-Energy Sustainability (2010) 21. Sharma, R., Bash, C.E., Patel, C.D., Friedrich, R.S., Chase, J.: Balance of Power: Dynamic Thermal Management of Internet Data Centers. IEEE Computer (2003) 22. Marwah, M., Sharma, R.K., Patel, C.D., Shih, R., Bhatia, V., Rajkumar, V., Mekanapurath, M., Velayudhan, S.: Data Analysis, Visualization and Knowledge Discovery in Sustainable Data Centers. In: Computer 2009, Bangalore (2009) 23. Patnaik, D., Marwah, M., Sharma, R., Ramakrishna, N.: Sustainable Operation and Management of Data Center Chillers using Temporal Data Mining, In: ACM KDD (2009) 24. Gmach, D., Rolia, J., Bash, C., Chen, Y., Christian, T., Shah, A., Sharma, R., Wang, Z.: Capacity Planning and Power Management to Exploit Sustainable Energy. In: 6th International Conference on Network and Service Management, Niagara Falls, Canada (2010) 25. Bash, C., Christian, T., Marwah, M., Patel, C., Shah, A., Sharma, R.: City 2.0: Leveraging Information Technology to Build a New Generation of Cities. Silicon Valley Engineering Council (SVEC) Journal 1(1), 1–6 (2009) 26. Hoover, C.E., Sharma, R., Watson, B., Charles, S.K., Shah, A., Patel, C.D., Marwah, M., Christian, T., Bash, C.E.: Sustainable IT Ecosystems: Enabling Next-Generation Cities, HP Laboratories Technical Report, HPL-2010-73 (2010) 27. Water harvesting, http://megphed.gov.in/knowledge/RainwaterHarvest/Chap2.pdf 28. Cunha, I., Almeida, J., Almeida, V., Santos, M.: Self-adaptive capacity management for multi-tier virtualized environments. In: 10th IFIP/IEEE IM (2007) 29. Khana, G., Beaty, K., Kar, G., Kochut, A.: Application performance management in virtualized server environments. In: IEEE/IFIP NOMS (2006) 30. Gmach, D., Rolia, J., Cherkasova, L., Kemper, A.: Capacity management and demand prediction for next generation data centers. IEEE ICWS Salt Lake City (2007) 31. Kephart, J., Chan, H., Das, R., Levine, D., Tesauro, G., Rawson, F., Lefurgy, C.: Coordinating multiple autonomic managers to achieve specified power-performance tradeoffs. In: 4th IEEE Int. Conf. on Autonomic Computing, ICAC (2007) 32. Barroso, L.A., Hölzle, U.: The case for energy-proportional computing. IEEE Computer 40(12), 33–37 (2007) 33. Andrzejak, A., Arlitt, M., Rolia, J.: Bounding the Resource Savings of Utility Computing Models, HP Laboratories Technical Report HPL-2002-339 (2002) 34. Patnaik, D., Marwah, M., Sharma, R., Ramakrishnan, N.: Data Mining for Modeling Chiller Systems in Data Centers, IDA (2010)
Unclouded Vision Jon Crowcroft1, Anil Madhavapeddy1 , Malte Schwarzkopf1, Theodore Hong1 , and Richard Mortier2 1
2
Cambridge University Computer Laboratory, 15, JJ Thomson Avenue, Cambridge CB3 0FB. UK
[email protected] Horizon Digital Economy Research, University of Nottingham, Triumph Road, Nottingham NG7 2TU. UK
[email protected] Abstract. Current opinion and debate surrounding the capabilities and use of the Cloud is particularly strident. By contrast, the academic community has long pursued completely decentralised approaches to service provision. In this paper we contrast these two extremes, and propose an architecture, Droplets, that enables a controlled trade-off between the costs and benefits of each. We also provide indications of implementation technologies and three simple sample applications that substantially benefit by exploiting these trade-offs.
1
Introduction
The commercial reality of the Internet and mobile access to it is muddy. Generalising, we have a set of cloud service providers (e.g. Amazon, Facebook, Flickr, Google and Microsoft, to name a representative few), and a set of devices that many – and soon most – people use to access these resources (so-called smartphones, e.g., Blackberry, iPhone, Maemo, Android devices). This combination of hosted services and smart access devices is what many people refer to as “The Cloud” and is what makes it so pervasive. But this situation is not entirely new. Once upon a time, looking as far back as the 1970s, we had “thin clients” such as ultra-thin glass ttys accessing timesharing systems. Subsequently, the notion of thin client has periodically resurfaced in various guises such as the X-Terminal, and Virtual Networked Computing (VNC) [14]. Although the world is not quite the same now as back in those thin client days, it does seem similar in economic terms. But why is it not the same? Why should it not be the same? The short answer is that the end user, whether in their home or on the top of the Clapham Omnibus,1 has in their pocket a device with vastly more resource than a mainframe of the 1970s by any measure, whether processing speed, storage capacity or network access rate. With this much power at our fingertips, we should be able to do something smarter than simply using our devices as vastly over-specified dumb terminals. 1
http://en.wikipedia.org/wiki/The_man_on_the_Clapham_omnibus
M.K. Aguilera et al. (Eds.): ICDCN 2011, LNCS 6522, pp. 29–40, 2011. c Springer-Verlag Berlin Heidelberg 2011
30
J. Crowcroft et al.
Meanwhile, the academic reality is that many people have been working at the opposite extreme from this commercial reality, trying to build “ultra-distributed” systems, such as peer-to-peer file sharing, swarms,2 ad-hoc mesh networks, mobile decentralised social networks,3 in complete contrast to the centralisation trends of the commercial world. We choose to coin the name “The Mist” for these latter systems. The defining characteristic of the Mist is that data is dispersed among a multitude of responsible entities (typically, though not exclusively, ordinary users), rather than being under the control of a single monolithic provider. Haggle [17], Mirage [11] and Nimbus [15] are examples of architectures for, respectively, the networking, operating system and storage components of the Mist. The Cloud and the Mist are extreme points in a spectrum, each with its upsides and downsides. Following a discussion of users’ incentives (§2), we will expand on the capabilities of two instances of these ends later (§3). We will then describe our proposed architecture (§4) and discuss its implications for three particular application domains (§5), before concluding (§6).
2
User Incentives
For the average user, accustomed to doing plain old storage and computation on their own personal computer or mobile (what we might term “The Puddle”), there are multiple competing incentives pushing in many directions: both towards and away from the Cloud, and towards and away from the Mist (see Figure 1).
cloud social easy to use centrally managed scalable virus protection
de p puddle
mist
Sharing
default privacy no lock-in
Sync Data Location
physical control
Speed
high bandwidth
Security
hacking protection
Fig. 1. Incentives pushing users toward the centralised Cloud vs. the decentralised Mist 2 3
http://bittorrent.com/ http://joindiaspora.com, http://peerson.net/
Unclouded Vision
31
Consider some of the forms of utility a user wants from their personal data: – Sharing. There is a tension between the desire to share some personal data easily with selected peers (or even publicly), and the need for control over more sensitive information. The social Cloud tends to share data, whereas the decentralised Mist defaults to privacy at the cost of making social sharing more difficult. – Synchronization. The Cloud provides a centralised naming and storage service to which all other devices can point. As a downside, this service typically incurs an ongoing subscription charge while remaining vulnerable to the provider stopping the service. Mist devices work in a peer-to-peer fashion which avoids provider lock-in, but have to deal with synchronisation complexity. – Data Location. The Cloud provides a convenient, logically centralised data storage point, but the specific location of any component is hard for the data owner to control.4 In contrast, the decentralised Mist permits physical control over where the devices are but makes it hard to reliably ascertain how robustly stored and backed-up the data is. – Speed. A user must access a centralised Cloud via the Internet, which limits access speeds and creates high costs for copying large amounts of data. In the Mist, devices are physically local and hence have higher bandwidth. However, Cloud providers can typically scale their service much better than individuals for those occasions when “flash traffic” drives a global audience to popular content. – Security. A user of the Mist is responsible for keeping their devices updated and can be vulnerable to malicious malware if they fall behind. However, the damage of intrusion is limited only to their devices. In contrast, a Cloud service is usually protected by dedicated staff and systems, but presents a valuable hacking target in which any failures can have widespread consequences, exposing the personal data of millions of users. These examples demonstrate the clear tension between what users want from services managing their personal data vs. how Cloud providers operate in order to keep the system economically viable. Ideally, the user would like to keep their personal data completely private while still hosting it on the Cloud. On the other hand, the cloud provider needs to recoup hosting costs by, e.g., selling advertising against users’ personal data. Even nominally altruistic Mist networks need incentives to keep them going: e.g., in BitTorrent it was recently shown that a large fraction of the published content is driven by profit-making companies rather than altruistic amateur filesharers [2]. Rather than viewing this as a zero-sum conflict between users and providers, we seek to leverage the smart capabilities of our devices to provide happy compromises that can satisfy the needs of all parties. By looking more closely at the true underlying interests of the different sides, we can often discover solutions that achieve seemingly incompatible goals [6]. 4
http://articles.latimes.com/2010/jul/24/business/la-fi-google-la-20100724
32
3
J. Crowcroft et al.
The Cloud vs. the Mist
To motivate the droplets architecture, we first examine the the pros and cons of the Cloud and the Mist in more detail. The Cloud’s Benefits: Centralising resources brings several significant benefits, specifically: – economies of scale, – reduction in operational complexity, and – commercial gain. Perhaps the most significant of these is the offloading of the configuration and management burden traditionally imposed by computer systems of all kinds. Additionally, cloud services are commonly implemented using virtualisation technology which enables statistical multiplexing and greater efficiencies of scale while still retaining “Chinese walls” that protect users from one another. As cloud services have grown, they have constructed specialised technology dedicated to the task of large data storage and retrieval, for example the new crop of “NoSQL” databases in recent years [10]. Most crucially, centralised cloud services have built up valuable databases of information that did not previously exist before. Facebook’s “social graph” contains detailed information on the interactions of hundreds of millions of individuals every day, including private messages and media. These databases are not only commercially valuable in themselves, they can also reinforce a monopoly position, as the network effect of having sole access to this data can prevent other entrants from constructing similar databases. The Cloud’s Costs: Why should we trust a cloud provider with our personal data? There are many ways in which they might abuse that trust, data protection legislation notwithstanding. The waters are further muddied by the various commercial terms and conditions to which users initially sign up, but which providers often evolve over time. When was the last time you checked the URL to which your providers post alterations to their terms and conditions, privacy policies, etc.? Even if you object to a change, can you get your data back and move it to another provider, and ensure that they have really deleted it? The Mist’s Benefits: Accessing the Cloud can be financially costly due to the need for constant high-bandwidth access. Using the Mist, we can reduce our access costs because data is stored (cached) locally and need only be uploaded to others selectively and intermittently. We keep control over privacy, choosing exactly what to share with whom and when. We also have better access to our data: we retain control over the interfaces used to access it; we are immune to service disruptions which might affect the network or cloud provider; and we cannot be locked out from our own data by a cloud provider. The Mist’s Costs: Ensuring reliability and availability in a distributed decentralised system is extremely complex. In particular, a new vector for breach of
Unclouded Vision
33
personal data is introduced: we might leave our fancy device on top of the aforesaid Clapham Omnibus with our data in it! We have to manage the operation of the system ourselves, and need to be connected often enough for others to be able to contact us. Droplets: A Happy Compromise? In between these two extremes should lie the makings of a design that has all the positives and none of the negatives. In fact, a hint of a way forward is contained in the comments above. If data is encrypted on both our personal computer/device and in the Cloud, then for privacy purposes it doesn’t really matter where it is physically stored. However, for performance reasons, we do care. Hence we’d like to carry information of immediate value close to us. We would also like it replicated in multiple places for reliability reasons. We also observe that the vast majority of usergenerated content is of interest only within the small social circle of the content’s subject/creator/producer/owner and thus note that interest/popularity in objects tends to be Zipf-distributed. In the last paragraph, it might be unclear who “we” are: “we” refers to Joe Public, whether sitting at home or on the top of that bus. However, there is another important set of stakeholders: those who provide The Cloud and The Net. These stakeholders need to make money lest all of this fail. The service provider needs revenue to cover operational expenses and to make a profit, but is loath to charge the user directly. Even in the case of the network, ISPs (and 3G providers) are mostly heading toward flat data rates. As well as targeted advertisements and associated “click-through” revenue, service providers also want to carry out data mining to do market research of a more general kind. Fortunately, recent advances in cryptography and security hint at ways to continue to support the two-sided business models that abound in today’s Internet. In the case of advertising, the underlying interest of the Cloud provider is actually the ability to sell targeted ads, not to know everything about its users. Privacy-preserving query techniques can permit ads to be delivered to users matching certain criteria without the provider actually knowing which users they were [8,9,16]. In the case of data mining on the locations or transactions of users, techniques such as differential privacy [5] and k-anonymity [18] can allow providers to make queries on aggregate data without being able to determine information about specific users. So we propose Droplets, half way between the Cloud and the Mist. Droplets make use of the Mirage operating system [11], Nimbus storage [15] and Haggle networking [17]. They float between the personal device and the cloud, using technologies such as social networks, virtualisation and migration [1,3], and they provide the basic components of a Personal Container [12]. They condense within social networks, where privacy is assured by society, but in the great unwashed Internet, they stay opaque. The techniques referred to above allow the service providers to continue to provide the storage, computation, indexing, search and transmission services that they do today, with the same wide range of business models.
34
4
J. Crowcroft et al.
Droplets
Droplets are units of network-connected computation and storage, designed to migrate around the Internet and personal devices. At a droplet’s core is the Mirage operating system, which compiles high-level language code into specialised targets such as Xen micro-kernels, UNIX binaries, or even Javascript applications. The same Mirage source code can thus run on a cloud computing platform, within a user’s web browser, on a smart-phone, or even as a plugin on a social network’s own servers. As we note in Table 1, there is no single “perfect” location where a Droplet should run all the time, and so this agility of placement is crucial to maximising satisfaction of the users’ needs while minimising their costs and risks. Table 1. Comparison of different potential Droplets platforms Platform Storage Bandwidth
Google AppEngine VM (e.g., on EC2) Home Computer Mobile Phone moderate
moderate
high
low
high
high
limited
low
Accessibility
always on
always on
variable
variable
Computation
limited
flexible, plentiful
flexible, limited
limited
Cost
free
expensive
cheap
cheap
Reliability
high
high
medium (failure)
low (loss)
Storage in such an environment presents a notable challenge, which we address via the Nimbus system, a distributed, encrypted and delay-tolerant personal data store. Working on the assumption that personal data access follows a Zipf power-law distribution, popular objects can be kept live on relatively expensive but low-latency platforms such as a Cloud virtual machine, while older objects can be archived inexpensively but safely on a storage device at home. Nimbus also provides local attestation in the form of “trust fountains,” which let nodes provide a cryptographic attestation witnessing another node’s presence or ownership of some data. Trust fountains are entirely peer-to-peer, and so proof is established socially (similarly to the use of lawyers or public notaries) rather than via central authority. Haggle provides a delay-tolerant networking platform, in which all nodes are mobile and can relay messages via various routes. Even with the use of central “stable” nodes such as the Cloud, outages will still occur due to the scale and dynamics of the Cloud and the Net, as has happened several times to such highprofile and normally robust services as GMail. During such events, the user must not lose all access to their data, and so the Haggle delay-tolerant model is a good fit. It is also interesting to observe that many operations performed by users are quite latency-insensitive, e.g. backups can be performed incrementally, possibly overnight.
Unclouded Vision
4.1
35
Deployment Model
Droplets are a compromise between the extremely-distributed Mist model and the more centralised Cloud. They store a user’s data and provide a network interface to this data rather than exposing it directly. The nature of this access depends on where the Droplet has condensed. – Internet droplet. If the Droplet is running exposed to the wild Internet, then the network interfaces are kept low-bandwidth and encrypted by default. To prevent large-scale data leaks, the Droplet rejects operations that would download or erase a large body of data. – Social network droplet. For hosting data, a droplet can condense directly within a social network, where it provides access to its database to the network, e.g., for data mining, in return for “free” hosting. Rather than allowing raw access, it can be configured to only permit aggregate queries to help populate the provider’s larger database, but still keep track of its own data. – Mobile droplet. The Droplet provides high-bandwidth, unfettered access to data. It also regularly checks with any known peers to see if a remote wipe instruction will cause it to permanently stop serving data. – Archiver droplet. Usually runs on a low-power device, e.g., an ARM-based BeagleBoard, accepting streams of data changes but not itself serving data. Its resources are used to securely replicate long-term data, ensuring it remains live, and to alert the user in case of significant degradation. – Web droplet. A Droplet in a web browser executes as a local Javascript application, where it can provide web bookmarklet services, e.g., trusted password storage. It uses cross-domain AJAX to update a more reliable node with pertinent data changes. Droplets can thus adapt their external interfaces depending on where they are deployed, allowing negotiation of an acceptable compromise between hosting costs and desire for privacy. 4.2
Trust Fountains
To explain trust fountains by way of example, consider the following. As part of the instantiation of their Personal Container, Joe Public runs an instance of a Nimbus trust fountain. When creating a droplet from some data stored in his Personal Container, this trust fountain creates a cryptographic attestation proving Joe’s ownership of the data at that time in the form of a time-dependent hash token. The droplet is then encrypted under this hash token using a fast, medium strength cipher5 and pushed out to the cloud. By selectively publishing the token, Joe can grant access to the published droplet e.g., allowing data mining access to a provider in exchange for free data storage and hosting. Alternatively, 5
Strong encryption is not required as the attestations are unique for each droplet publication and breaking one does not grant an attacker access to any other droplets.
36
J. Crowcroft et al.
the token might only be shared with a few friends via an ad hoc wireless network in a coffee shop, granting them access only to that specific data at that particular time. 4.3
Backwards Provenance
A secondary purpose of the attestation is to enable “backwards provenance”, i.e., a way to prove ownership. Imagine that Joe publishes a picture of some event which he took using his smartphone while driving past it on that oftconsidered bus. A large news agency picks up and uses that picture after Joe publishes it to his Twitter stream using a droplet. The attached attestations then enable the news agency to compensate both the owner and potentially the owner’s access provider, who takes a share in all profits made from Joe’s digital assets in exchange for serving them. Furthermore, Joe is given a tool to counter “hijacking” of his creation even if the access token becomes publicly known: using the cryptographic properties of the token, the issue log of his trust fountain together with his provider’s confirmation of receipt of the attested droplet forms sufficient evidence to prove ownership and take appropriate legal action. Note that Joe Public can also deny ownership if he chooses, as only his trust fountain holds the crucial information necessary to regenerate the hash token and thus prove the attestation’s origin. 4.4
Handling 15 Minutes of Fame
Of course, whenever a droplet becomes sufficiently popular to merit condensation into a cloud burst of marketing, then we have the means to support this transition, and we have the motivation and incentives to make sure the right parties are rewarded. In this last paragraph, “we” refers to all stakeholders: users, government and business. It seems clear that the always-on, everywhere-logged, ubiquitously-connected vision will continue to be built, while real people become increasingly concerned about their privacy [4]. Without such privacy features, it is unclear for how much longer the commercial exploitation of personal data will continue to be acceptable to the public; but without such exploitation, it is unclear how service providers can continue to provide the many “free” Internet services on which we have come to rely.
5
Droplications
The Droplet model requires us to rethink how we construct applications – rather than building centralised services, they must now be built according to a distributed, delay-tolerant model. In this section, we discuss some of the early services we are building.
Unclouded Vision
5.1
37
Digital Yurts
In the youthful days of the Internet, there was a clear division between public data (web homepages, FTP sites, etc.) and private (e-mail, personal documents, etc.). It was common to archive personal e-mail, home directories and so on, and thus to keep a simple history of all our digital activities. The pace of change in recent years has been tremendous, not only in the variety of personal data, but in where that data is held. It has moved out of the confines of desktop computers to data-centres hosted by third-parties such as Google, Yahoo and Facebook, who provide “free” hosting of data in return for mining information from millions of users to power advertising platforms. These sites are undeniably useful, and hundreds of millions of users voluntarily surrender private data in order to easily share information with their circle of friends. Hence, the variety of personal data available online is booming – from media (photographs, videos), to editorial (blogging, status updates), and streaming (location, activity). However, privacy is rapidly rising up the agenda as companies such as Facebook and Google collect vast amounts of data from hundreds of millions of users. Unfortunately, the only alternative that privacy-sensitive users currently have is to delete their online accounts, losing both access to and what little control they have over their online social networks. Often, deletion does not even completely remove their online presence. We have become digital nomads: we have to fetch data from many third-party hosted sites to recover a complete view of our online presence. Why is it so difficult to go back to managing our own information, using our own resources? Can we do so while keeping the “good bits” of existing shared systems, such as ease-of-use, serendipity and aggregation? Although the immediate desire to regain control of our privacy is a key driver, there are several other longer-term concerns about third parties controlling our data. The incentives of hosting providers are not aligned with the individual: we care about preserving our history over our lifetime, whereas the provider will choose to discard information when it ceases to be useful for advertising. This is where the Droplet model is useful – rather than dumbly storing data, we can also negotiate access to that data with hosting providers via an Internet droplet, and arrive at a compromise between letting them data mine it, versus the costs of hosting it. When the hosting provider loses interest in the older, historical tail of data, the user can deploy an archival droplet to catch the data before it disappears, and archive it for later retrieval. 5.2
Dust Clouds
Dust Clouds [13] is a proposal for the provision of secure anonymous services using extremely lightweight virtual machines hosted in the cloud. As they are lightweight, they can be created and destroyed with very short lifetimes, yet still achieve useful work. However, several tensions exist between the requirements of users and cloud providers in such a system. For example, cloud providers have a strong requirement for a variety of auditing functions. They need to know who consumed what resources in order
38
J. Crowcroft et al.
to bill, to provision appropriately, to ensure that, e.g., upstream service level agreements with other providers are met, and so on. They would tend to prefer centralisation for reasons already mentioned (efficiency, economy of scale, etc.). By contrast, individual consumers use such a system precisely because it provides anonymity while they are doing things that they wish not to be attributed to them, e.g., to avoid arrest. Anonymity in a dust cloud is largely provided by having a rich mixnet of traffic and other resource consumption. Consumers would also prefer diversity, in both geography and provider, to ensure that they’re not at the mercy of a single judicial/regulatory system. Pure cloud approaches fail the users’ requirements by putting too much control in the hands of one (or a very small number of) cloud providers. Pure mist approaches fail the user by being unable to provide the richness of mixing to provide sufficient anonymity: many of the devices in the mist are either insufficiently powerful or insufficiently well-connected to support a large enough number of users’ processes. By taking a droplets approach we obviate both these issues: the lightweight nature of VM provisioning means that it becomes largely infeasible for the cloud provider to track in detail what users are doing, particularly when critical parts of the overall distributed process/communication are hosted on non-cloud infrastructure. Local auditing for payment recovery based on resources used is still possible, but the detailed correlation and reconstruction of an individual process’s behaviour becomes effectively impossible. At the same time, the scalable and generally efficient nature of cloud-hosted resources can be leveraged to ensure that the end result is itself suitably scalable. 5.3
Evaporating Droplets
With Droplets, we also have a way of creating truly ephemeral data items in a partially trusted or untrusted environment, such as a social network, or the whole Internet. Since Droplets have the ability to do computation, they can refuse to serve data if access prerequisites are not met: for example, time-dependent hashes created from a key and a time stamp can be used to control access to data in a Droplet. Periodically, the user’s “trust fountain” will issue new keys, notifying the Droplet that it should now accept the new key only. To “evaporate” data in a Droplet, the trust fountain simply ceases to provide keys for it, thus making users unable to access the Droplet, even if they still have the binary data or even the Droplet itself (assuming, of course, that brute-forcing the hash key is not a worthwhile option). Furthermore, their access is revoked even in disconnected state, i.e. when the Droplet cannot be notified to accept the new hash key only: since it is necessary to provide time stamp and key as authentication tokens in order for the Droplet to generate the correct hash, expired keys can no longer be used as they have to be provided along with their genuine origin time stamp. Additionally, as a more secure approach, the Droplet could even periodically re-encrypt its contents in order to combat brute-forcing. This subject has been of some research interest recently. Another approach [7] relies on statistical metrics that require increasingly large amounts of data from a DHT to be available to an attacker in order to reconstruct the data, but
Unclouded Vision
39
is vulnerable to certain Sybil attacks [19]. Droplets, however, have the power of being able to completely ensure that all access to data is revoked, even when facing a powerful adversary in a targeted attack. Furthermore, as a side effect of the hash-key based access control, the evaporating Droplet could serve different views, or stages of evaporation, to different requesters depending on the access key they use (or its age). Finally, the “evaporating Droplet” can be made highly accessible from a user perspective by utilizing a second Droplet: a Web Droplet (see §4.1) that integrates with a browser can automate the process of requesting access keys from trust fountains and unlocking the evaporating Droplet’s contents.
6
Conclusions and Future Work
In this paper, we have discussed the tension between the capabilities of and demands on the Cloud and the Mist. We concluded that both systems are at opposite ends of a spectrum of possibilities and that compromise between providers and users is essential. From this, we derived an architecture for an alternative system, Droplets, that enables control over the trade-offs involved, resulting in systems acceptable to both hosting providers and users. Having realised two of the main components involved in Droplets, Haggle networking and the Mirage operating system, we are now completing realisation of the third, Nimbus storage, as well as building some early “droplications”.
References 1. Clark, C., Fraser, K., Hand, S., Hansen, J.G., Jul, E., Limpach, C., Pratt, I., Warfield, A.: Live migration of virtual machines. In: USENIX Symposium on Networked Systems Design & Implementation (NSDI), pp. 273–286. USENIX Association, Berkeley (2005) 2. Cuevas, R., Kryczka, M., Cuevas, A., Kaune, S., Guerrero, C., Rejaie, R.: Is content publishing in BitTorrent altruistic or profit-driven (July 2010), http://arxiv.org/abs/1007.2327 3. Cully, B., Lefebvre, G., Meyer, D.T., Karollil, A., Feeley, M.J., Hutchinson, N.C., Warfield, A.: Remus: High availability via asynchronous virtual machine replication. In: USENIX Symposium on Networked Systems Design & Implementation (NSDI). USENIX Association, Berkeley (April 2008) 4. Doctorow, C.: The Things that Make Me Weak and Strange Get Engineered Away. Tor.com (August 2008), http://www.tor.com/stories/2008/08/weak-and-strange 5. Dwork, C.: Differential privacy. In: Bugliesi, M., Preneel, B., Sassone, V., Wegener, I. (eds.) ICALP 2006. LNCS, vol. 4052, pp. 1–12. Springer, Heidelberg (2006) 6. Fisher, R., Patton, B.M., Ury, W.L.: Getting to Yes: Negotiating Agreement Without Giving In. Houghton Mifflin (April 1992), http://www.amazon.com/exec/ obidos/redirect?tag=citeulike07-20&path=ASIN/0395631246 7. Geambasu, R., Kohno, T., Levy, A., Levy, H.M.: Vanish: Increasing data privacy with self-destructing data. In: Proceedings of the USENIX Security Symposium (August 2009)
40
J. Crowcroft et al.
8. Guha, S., Reznichenko, A., Tang, K., Haddadi, H., Francis, P.: Serving Ads from localhost for Performance, Privacy, and Profit. In: Proceedings of Hot Topics in Networking (HotNets), New York, NY (October 2009) 9. Haddadi, H., Hui, P., Brown, I.: MobiAd: Private and scalable mobile advertising. In: Proceedings of MobiArch (to appear, 2010) 10. Leavitt, N.: Will nosql databases live up to their promise? Computer 43(2), 12–14 (2010), http://dx.doi.org/10.1109/MC.2010.58 11. Madhavapeddy, A., Mortier, R., Sohan, R., Gazagnaire, T., Hand, S., Deegan, T., McAuley, D., Crowcroft, J.: Turning down the LAMP: software specialisation for the cloud. In: HotCloud 2010: Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing, p. 11. USENIX Association, Berkeley (2010) 12. Mortier, R.: et al.: The Personal Container, or Your Lif. In Bit. In: Proceedings of Digital Futures (October 2010) 13. Mortier, R., Madhavapeddy, A., Hong, T., Murray, D., Schwarzkopf, M.: Using Dust Clouds to enhance anonymous communication. In: Proceedings of the Eighteenth International Workshop on Security Protocols, IWSP (April 2010) 14. Richardson, T., Stafford-Fraser, Q., Wood, K.R., Hopper, A.: Virtual network computing. IEEE Internet Computing 2(1), 33–38 (1998) 15. Schwarzkopf, M., Hand, S.: Nimbus: Intelligent Personal Storage. Poster at the Microsoft Research Summer School 2010, Cambridge, UK (2010) ¨ 16. Shikfa, A., Onen, M., Molva, R.: Privacy in content-based opportunistic networks. In: AINA Workshops, pp. 832–837 (2009) 17. Su, J., Scott, J., Hui, P., Crowcroft, J., De Lara, E., Diot, C., Goel, A., Lim, M.H., Upton, E.: Haggle: Seamless networking for mobile applications. In: Krumm, J., Abowd, G.D., Seneviratne, A., Strang, T. (eds.) UbiComp 2007. LNCS, vol. 4717, pp. 391–408. Springer, Heidelberg (2007) 18. Sweeney, L.: k-anonymity: a model for protecting privacy. Int. J. Uncertain. Fuzziness Knowl.-Based Syst. 10(5), 557–570 (2002) 19. Wolchok, S., Hofmann, O., Heninger, N., Felten, E., Halderman, J., Rossbach, C., Waters, B., Witchel, E.: Defeating vanish with low-cost sybil attacks against large DHTs. In: Proceedings of the 17th Network and Distributed System Security Symposium (NDSS), pp. 37–51 (2010)
Generating Fast Indulgent Algorithms Dan Alistarh1 , Seth Gilbert2 , Rachid Guerraoui1, and Corentin Travers3 1
2 3
EPFL, Switzerland National University of Singapore Universit´e de Bordeaux 1, France
Abstract. Synchronous distributed algorithms are easier to design and prove correct than algorithms that tolerate asynchrony. Yet, in the real world, networks experience asynchrony and other timing anomalies. In this paper, we address the question of how to efficiently transform an algorithm that relies on synchronization into an algorithm that tolerates asynchronous executions. We introduce a transformation technique from synchronous algorithms to indulgent algorithms [1], which induces only a constant overhead in terms of time complexity in well-behaved executions. Our technique is based on a new abstraction we call an asynchrony detector, which the participating processes implement collectively. The resulting transformation works for a large class of colorless tasks, including consensus and set agreement. Interestingly, we also show that our technique is relevant for colored tasks, by applying it to the renaming problem, to obtain the first indulgent renaming algorithm.
1 Introduction The feasibility and complexity of distributed tasks has been thoroughly studied both in the synchronous and asynchronous models. To better capture the properties of real-world systems, Dwork, Lynch, and Stockmeyer [2] proposed the partially synchronous model, in which the distributed system may alternate between synchronous to asynchronous periods. This line of research inspired the introduction of indulgent algorithms [1], i.e. algorithms that guarantee correctness and efficiency when the system is synchronous, and maintain safety even when the system is asynchronous. Several indulgent algorithms have been designed for specific distributed problems, such as consensus (e.g., [3, 4]). However, designing and proving correctness of such algorithms is usually a difficult task, especially if the algorithm has to provide good performance guarantees. Contribution. In this paper, we introduce a general transformation technique from synchronous algorithms to indulgent algorithms, which induces only a constant overhead in terms of time complexity. Our technique is based on a new primitive called an asynchrony detector, which identifies periods of asynchrony in a fault-prone asynchronous system. We showcase the resulting transformation to obtain indulgent algorithms for a large class of colorless agreement tasks, including consensus and set agreement. We also apply our transformation to the distinct class of colored tasks, to obtain the first indulgent renaming algorithm. Detecting Asynchrony. Central to our technique is a new abstraction, called an asynchrony detector, which we design as a distributed service for detecting periods of
M.K. Aguilera et al. (Eds.): ICDCN 2011, LNCS 6522, pp. 41–52, 2011. c Springer-Verlag Berlin Heidelberg 2011
42
D. Alistarh et al.
asynchrony. The service detects asynchrony both at a local level, by determining whether the view of a process is consistent with a synchronous execution, and at a global level, by determining whether the collective view of a set of processes could have been observed in a synchronous execution. We present an implementation of an asynchrony detector, based on the idea that each process maintains a log of the messages sent and received, which it exchanges with other processes. This creates a view of the system for every process, which we use to detect asynchronous executions. The Transformation Technique. Based on this abstraction, we introduce a general technique allowing synchronous algorithms to tolerate asynchrony, while maintaining time efficiency in well-behaved executions. The main idea behind the transformation is the following: as long as the asynchrony detector signals a synchronous execution, processes run the synchronous algorithm. If the system is well behaved, then the synchronous algorithm yields an output, on which the process decides. Otherwise, if the detector notices asynchrony, we revert to an existing asynchronous backup algorithm with weaker termination and performance guarantees. Transforming Agreement Algorithms. We first showcase the technique by transforming algorithms for a large class of agreement tasks, called colorless tasks, which includes consensus and set agreement. Intuitively, a colorless task allows processes to adopt each other’s output values without violating the task specification, while ensuring that every value returned has been proposed by a process. We show that any synchronous algorithm solving a colorless task can be made indulgent at the cost of two rounds of communication. For example, if a synchronous algorithm solves synchronous consensus in t + 1 rounds, where t is the maximum number of crash failures (i.e. the algorithm it is time-optimal), then the resulting indulgent algorithm will solve consensus in t + 3 rounds if the system is initially synchronous, or will revert to a safe backup, e.g. Paxos [4, 5] or ASAP [6], otherwise. The crux of the technique is the hand-off procedure: we ensure that, if a process decides using the synchronous algorithm, any other process either decides or adopts a state which is consistent with the decision. In this second case, we show that a process can recover a consistent state by examining the views of other processes. The validity property will ensure that the backup protocol generates a valid output configuration. Transforming Renaming Algorithms. We also apply our technique to the renaming problem [7], and obtain the first indulgent renaming algorithm. Starting from the synchronous protocol of [8], our protocol renames in a tight namespace of N names and terminates in (log N + 3) rounds, in synchronous executions. In asynchronous executions, the protocol renames in a namespace of size N + t. Roadmap. In Section 2, we present the model, while Section 3 presents an overview of related work. We define asynchrony detectors in Section 4. Section 5 presents the transformation for colorless agreement tasks, while Section 6 applies it to the renaming problem. In Section 7 we discuss our results. Due to space limitations, the proofs of some basic results are omitted, and we present detailed sketches for some of the proofs.
Generating Fast Indulgent Algorithms
43
2 Model We consider an eventually synchronous system with N processes Π = {p1 , p2 , . . . , pN }, in which t < N/2 processes may fail by crashing. Processes communicate via messagepassing in rounds, which we model much as in [3, 9, 10]. In particular, time is divided into rounds, which are synchronized. However, the system is asynchronous, i.e. there is no guarantee that a message sent in a round is also delivered in the same round. We do assume that processes receive at least N − t messages in every round, and that a process always receives its own message in every round. Also, we assume that there exists a global stabilization time GST ≥ 0 after which the system becomes synchronous, i.e. every message is delivered in the same round in which it was sent. We denote such a system by ES(N, t). Although indulgent algorithms are designed to work in this asynchronous setting, they are optimized for the case in which the system is initially synchronous, i.e. when GST = 0. We denote the synchronous message-passing model with t < N failures by S(N, t). In case the system stabilizes at a later point in the execution, i.e. 0 < GST < ∞, then the algorithms are still guaranteed to terminate, although they might be less efficient. If the system never stabilizes, i.e. GST = ∞, indulgent algorithms might not terminate, although they always maintain safety. In the following, we say that an execution is synchronous if every message sent by a correct process in the course of the execution is delivered in the same round in which it was sent. Alternatively, if process pi receives a message m from process pj in round r ≥ 2, then every process received all messages sent by process pj in all rounds r < r. The view of a process p at a round r is given by the messages that p received at round r and in all previous rounds. We say that the view of process p is synchronous at round r if there exists an r-round synchronous execution which is indistinguishable from p’s view at round r.
3 Related Work Starting with seminal work by Dwork, Lynch and Stockmeyer [2], a variety of different models have been introduced to express relaxations of the standard asynchronous model of computation. These include failure detectors [11], round-by-round fault detectors (RRFD) [12], and, more recently, indulgent algorithms [1]. In [3, 9], Guerraoui and Dutta address the complexity of indulgent consensus in the presence of an eventually perfect failure detector. They prove a tight lower bound of t+ 2 rounds on the time complexity of the problem, even in synchronous runs, thus proving that there is an inherent price to tolerating asynchronous executions. Our approach is more general than that of this reference, since we transform a whole class of synchronous distributed algorithms, solving various tasks, into their indulgent counterparts. On the other hand, since our technique induces a delay of two rounds of communication over the synchronous algorithm, in the case of consensus, we miss the lower bound of t + 2 rounds by one round. Recent work studied the complexity of agreement problems, such as consensus [6] and k-set agreement [10], if the system becomes synchronous after an unknown stabilization time GST . In [6], the authors present a consensus algorithm that terminates in
44
D. Alistarh et al.
f + 2 rounds after GST , where f is the number of failures in the system. In [10], the authors consider k-set agreement in the same setting, proving that t/k + 4 rounds after GST are enough for k-set agreement, and that at least t/k + 2 rounds are required. The algorithms from these references work with the same time complexity in the indulgent setting, where GST = 0. On the other hand, the transformation in the current paper does not immediately yield algorithms that would work in a window of synchrony. From the point of view of the technique, references [6, 10] also use the idea of “detecting asynchrony” as part of the algorithms, although this technique has been generalized in the current work to address a large family of distributed tasks. Reference [13] considered a setting in which failures stop after GST , in which case 3 rounds of communication are necessary and sufficient. Leader-based, Paxos-like algorithms, e.g. [4, 5], form another class of algorithms that tolerate asynchrony, and can also be seen as indulgent algorithms. A precise definition of colorless tasks is given in [14]. Note that, in this paper, we augment their definition to include the standard validity property (see Section 5).
4 Asynchrony Detectors An asynchrony detector is a distributed service that detects periods of asynchrony in an asynchronous system that may be initially synchronous. The service returns a YES/NO indication at the end of every round, and has the property that processes which receive YES at some round share a synchronous execution prefix. Next, we make this definition precise. Definition 1 (Asynchrony Detector). Let d be a positive integer. A d-delay asynchrony detector in ES(N, t) is a distributed service that, in every round r, returns either YES or NO, at each process. The detector ensures the following properties. – (Local detection) If process p receives YES at round r, then there exists an r-round synchronous execution in which p has the same view as its current view at round r. – (Global detection) For all processes that receive YES in round r, there exists an (r− d)-round synchronous execution prefix S[1, 2, . . . , r − d] that is indistinguishable from their views at the end of round r − d. – (Non-triviality) The detector never returns NO during a synchronous execution. The local detection property ensures that, if the detector returns YES, then there exists a synchronous execution consistent with the process’ view. On the other hand, the global detection property ensures that, for processes that receive YES from the detector, the (r − R)-round execution prefix was “synchronous enough”, i.e. there exists a synchronous execution consistent with what these processes perceived during the prefix. The non-triviality property ensures that there are no false positives. 4.1
Implementing an Asynchrony Detector
Next, we present an implementation of a 2-delay asynchrony detector in ES(N, t), which we call AD(2). The pseudocode is presented in Figure 1.
Generating Fast Indulgent Algorithms
45
The main idea behind the detector, implemented in the process procedure, is that processes maintain a detailed view of the state of the system by aggregating all messages received in every round. For each round, each process maintains an Active set of processes, i.e. processes that sent at least one message in the round; all other processes are in the Failed set for that round (lines 2–4). Whenever a process receives a new message, it merges the contents of the Active and Failed sets of the sender with its own (lines 8–9). Asynchrony is detected by checking if there exists any process that is in the Active set in some round r, while being in the Failed set in some previous round r < r (lines 10–12). In the next round, each process sends its updated view of the system together with a synch flag, which was set to true, if asynchrony was detected.
1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
procedure detector()i msgi ← ⊥; synchi ← true; Active i ← [ ]; Failed i ← [ ]; for each round Rc do send( msgi ) msgSeti ← receive() (synchi , msgi ) ← process(msgSeti, Rc ) if synchi = true then output YES else output NO procedure process( msgSeti , r )i if synchi = true then Activei [Rc ] ← processes from which pi receives a message in round Rc F ailedi [Rc ] ← processes from which pi did not receive a message in round Rc if there exists pj ∈ msgSeti with synchj = false then synchi ← false for every msg j ∈ msgSet i do for round r from 1 to Rc do Active i [r] ← msg j .Active j [r] ∪ Active i [r] Failed i [r] ← msg j .Failed j [r] ∪ Failed i [r] for round r from 1 to Rc − 1 do for round k from r + 1 to Rc do if (Active i [k] ∩ Failed i [r] = ∅) then synchi ← false if synchi = true then msgi ← (synchi , (Activei [r])r∈[1,Rc ] , (F ailedi [r])r∈[1,Rc ] ) else msgi ← (synchi , ⊥, ⊥) return (synchi , msgi ) Fig. 1. The AD(2) asynchrony detection protocol
4.2 Proof of Correctness In this section, we prove that the protocol presented in the Section 4.1 satisfies the definition of an asynchrony detector. First, to see that the local detection condition is satisfied, notice that the contents of the Active and Failed sets at each process p can be used to construct a synchronous execution which is coherent with process p’s view. In the following, we focus on the global detection property. We show that, for a fixed round r > 0, given a set of processes P ⊆ Π that receive YES from AD(2) at the end of
46
D. Alistarh et al.
round r + 2, there exists an r-round synchronous execution S[1, r] such that the views of processes in P at the end of round r are consistent with S[1, r]. We begin by proving that if two processes receive YES from the asynchronous detector in round r + 2, then they must have received eachother’s round r + 1 messages, either directly, or through a relay. Note that, because of the round structure, a process’s round r + 1 message only contains information that it has acquired up to round r. In the following, we will use a superscript notation to denote the round at which the [r + 1] denotes the set Active[r + 2] local variables are seen. For example, Activer+2 q at process q, as seen from the end of round r + 2. Lemma 1. Let p and q be two processes that receive YES from AD(2) at the end of round r + 2. Then p ∈ Activer+2 [r + 1] and q ∈ Activer+2 q p [r + 1]. [r + 1]–the proof of the second statement is symProof. We prove that p ∈ Activer+2 q metric. Assume, for the sake of contradiction, that p ∈ / Activer+2 [r + 1]. Then, by q lines 8–9 of the process() procedure, none of the processes that send a message to q in round r + 2 received a message from p in round r + 1. However, this set of processes contains at least N − t > t elements, and therefore, in round r + 2, process p receives a message from at least one process that did not receive a message from p in round r+2 r + 1. Therefore p ∈ Activer+2 p [r + 2] ∩ F ailedp [r + 1] (recall that p receives its own message in every round). Following the process() procedure for p, we obtain that synchp = false in round r + 2, which means that process p receives NO from AD(2) in round r + 2, contradiction. Lemma 2. Let p and q be two processes in P . Then, for all rounds k < l ≤ r, Activerp [l] ∩ F ailedrq [k] = ∅, and Activerp [l] ∩ F ailedrq [k] = ∅, where the Active and Failed sets are seen from the end of round r. Proof. We prove that, given r ≥ l > k, Activerp [l] ∩ F ailedrq [k] = ∅. Assume, for the sake of contradiction, that there exist rounds k < l ≤ r and a processor s such that s ∈ Activerp [l] ∩ F ailedrq [k]. Lemma 1 ensures that p and q communicate in round r+2 r + 1, therefore it follows that s ∈ F ailedr+2 p [k]. This means that s ∈ Activep [l] ∩ F ailedr+2 p [k], for k < l, therefore p cannot receive YES in round r + 2, contradiction. The next lemma provides a sufficient condition for a set of processes to share a synchronous execution up to the end of some round R. The proof follows from the observation that the required synchronous execution E can be constructed by exactly following the contents of the Active and Failed sets by processes at every round in the execution. Lemma 3. Let E be an R-round execution in ES(N, t), and P be a set of processes in Π such that, at the end of round R, the following two properties are satisfied: 1. For any p and q in P , and any round r ∈ {1, 2, . . . , R − 1}, ActiveR p [r + 1] ∩ R F ailedq [r] = ∅. 2. | p∈P ActiveR p [R]| ≥ N − t. Then there exists a synchronous execution E which is indistinguishable from the views of processes in P at the end of round R.
Generating Fast Indulgent Algorithms
47
Finally, we prove that if a set of processes P receive YES from AD(2) at the end of some round R+2, then there exists a synchronous execution consistent with their views at the end of round R, for any R > 0, i.e. that AD(2) is indeed a 2-round asynchrony detector. The proof follows from the previous results. Lemma 4. Let R > 0 be a round and P be a set of processes that receive YES from AD(2) at the end of round R + 2. Then there exists a synchronous execution consistent with their views at the end of round R.
5 Generating Indulgent Algorithms for Colorless Tasks 5.1 Task Definition In the following, a task is a tuple (I, O, Δ), where I is the set of vectors of input values, O is a set of vectors of output values, and Δ is a total relation from I to O. A solution to a task, given an input vector I, yields an output vector O ∈ O such that O ∈ Δ(I). Intuitively, a colorless task is a terminating task in which any process can adopt any input or output value of any other process, without violating the task specification, and in which any (decided) output value is a (proposed) input value. We also assume that the output values have to verify a predicate P, such as agreement or k-agreement. For example, in the case of consensus, the predicate P states that all output values should be equal. Let val (V ) be the set of values in a vector V . We precisely define this family of tasks as follows. A colorless task satisfies the following properties: (1) Termination: every correct process eventually outputs; (2) Validity: for every O ∈ Δ(I), val (O) ⊆ val (I); (3) The Colorless property: If O ∈ Δ(I), then for every I with val (I ) ⊆ val (I) : I ∈ I and Δ(I ) ⊆ Δ(I). Also, for every O with val (O ) ⊆ val (O) : O ∈ O and O ∈ Δ(I). Finally, we assume that the outputs satisfy a generic property (4) Output Predicate: every O ∈ O satisfies a given predicate P. Consensus and k-set agreement are canonical examples of colorless tasks. 5.2 Transformation Description We present an emulation technique that generates an indulgent protocol in ES(N, t) out of any protocol in S(N, t) solving a given colorless task T , at the cost of two communication rounds. If the system is not synchronous, the generated protocol will run a given backup protocol Backup which ensures safety, even in asynchronous executions. For example, if an protocol solves synchronous consensus in t + 1 rounds (i.e. it is optimal), then the resulting protocol will solve consensus in t + 3 rounds if the system is initially synchronous. Otherwise, the protocol reverts to a safe backup, e.g. Paxos [5], or ASAP [6]. We fix a protocol A solving a colorless task in the synchronous model S(N, t). The running time of the synchronous protocol is known to be of R rounds. In the first phase of the transformation, each process p runs the AD(2) asynchrony detector in parallel with the protocol A, as long as the detector returns a YES indication at every round. Note that the protocol’s messages are included in the detector’s messages (or viceversa), preventing the possibility that the protocol encounters asynchronous message
48
D. Alistarh et al.
deliveries without the detector noticing. If the detector returns NO during this phase, the process stops running the synchronous protocol, and continues running only AD(2). If the process receives YES at the end of round R + 2, then it returns the decision value that A produced at the end of round R1 . On the other hand, if the process receives NO from AD(2) in round R + 2, i.e. asynchrony was detected, then the process will run the second phase of the transformation. More precisely, in phase two, the process will run a backup agreement protocol that tolerates periods of asynchrony (for example, the K4 protocol [10], if the task is k-set agreement). The main question is how to initialize the backup protocol, given that some of the processes may have already decided in phase one, without breaking the properties of the task. We solve this problem as follows. Let Supp (the support set) be the set of processes that received YES from AD(2) in round R + 1 that process p receives messages from in round R + 2. There are two cases. (1) If the set Supp is empty, then the process starts running the backup protocol using its initial proposal value. (2) If the set Supp is non-empty, then the process obtains a new proposal value as follows. It picks one process from Supp and adopts its state at R − 1. Then, in round R, it simulates receiving the messages in the end of round R+1 [R], where we maintain the notation used in Section 4. We will j∈Supp msgSet j show that in this case, the simulated protocol A will necessarily return a decision value at the end of simulated round R. The process p then runs the backup protocol, using as initial value the decision value resulting from the simulation of the first R rounds. 5.3
Proof of Correctness
We now prove that the resulting protocol verifies the task specification. The proofs of termination, validity, and the colorless property follow from the properties of the A and Backup protocols, therefore we will concentrate on proving that the resulting protocol also satisfies the output predicate P. Theorem 1 (Output Predicate). The indulgent transformation protocol satisfies the output predicate P associated to the task T . Assume for the sake of contradiction that there exists an execution in which the output of the transformation breaks the output predicate P. If all process decisions are made at the end of round R + 2, then, by the global detection property of AD(2), there exists a synchronous execution of A in which the same outputs are decided, which break the predicate P, contradiction. If all decisions occur after round R + 2, first notice that, by the validity and colorless properties, the inputs processes propose to the Backup protocol are always valid inputs for the task. It follows that, since all decisions are output by Backup, there exists an execution of the Backup protocol in which the predicate P is broken, again a contradiction. Therefore, at least one process outputs at the end of round R+ 2, and some processes decide at some later round. We prove the following claim. 1
Since AD(2) returns YES at process p at the end of round R + 2, it follows that it must have returned YES at p at the end of round R as well. The local detection property of the asynchrony detector implies that the protocol A has to return a decision value, since it executes a synchronous execution.
Generating Fast Indulgent Algorithms
49
Claim. If a process decides at the end of round R + 2, then (i) all correct processes will have a non-empty support set Supp and (ii) there exists an R-round synchronous execution consistent with the views that all correct processes adopt at the end of round R + 2. Proof (Sketch). First, let d be a process that decides at the end of round R + 2. Then, in round R + 2, process d received a message from at least N − t processes that got YES from AD(2) at the end of round R + 1. Since N ≥ 2t + 1, it follows that every process that has not crashed by the end of round R + 2 will have received at least one message from a process that has received YES from AD(2) in round R + 1; therefore, all non-crashed processes that get NO from AD(2) in round R + 2 will execute case 2, which ensures the first claim. Let Q = {q1 , . . . , q } be the non-crashed processes at the end of round R + 2. By the above claim, we know that these processes either decide or simulate an execution. We prove that all views simulated in this round are consistent with a synchronous execution up to the end of round R, in the sense of Lemma 3. To prove that the intersection of their simulated views in round R contains at least (N − t) messages, notice that the processes from which process d receives messages in round R + 2 are necessarily in this intersection, since otherwise process d would receive NO in round R + 2. To prove the first condition of Lemma 3, note that process d’s view of round R, i.e. [R], contains all messages simulated as received in round R by the the set msgSet R+2 d processes that receive NO in round R + 2. Since N − t > t, every process that receives NO in round R + 2 from the detector also receives a message supporting d’s decision in round R + 2; process d receives the same message and does not notice any asynchrony. Therefore, we can apply Lemma 3 to obtain that there exists a synchronous execution of the protocol A in which the processes in Q obtain the same decision values as the values obtained through the simulation or decision at the end of round R + 2. Returning to the proof of the output predicate, recall that there exists at least process d which outputs at the end of round R + 2. From the above Claim, it follows that all non-crashed processes simulate synchronous views of the first R rounds. Therefore all non-crashed processes will receive an output from the synchronous protocol A. Moreover, these synchronous views of processes are consistent with a synchronous execution, therefore the set of outputs received by non-crashed processes verifies the predicate P. Hence all the inputs that the processes propose to the Backup protocol verify the predicate P. Since Backup respects validity, it follows that the outputs of Backup will also verify the predicate P.
6 A Protocol for Strong Indulgent Renaming 6.1 Protocol Description In this section, we present an emulation technique that transforms any synchronous renaming protocol into an indulgent renaming protocol. For simplicity, we will assume that the synchronous renaming protocol is the one by Herlihy et al. [8], which is timeoptimal, terminating in log N + 1 synchronous rounds. The resulting indulgent protocol will rename in N names using log N + 3 rounds of communication if the system is initially synchronous, and will eventually rename into N + t names if the system is
50
D. Alistarh et al.
asynchronous, by safely reverting to a backup constituted by the asynchronous renaming algorithm by Attiya et al. [7]. Again, the protocol is structured into two phases. First Phase. During the first log N + 1 rounds, processes run the AD(2) asynchrony detector in parallel with the synchronous renaming algorithm. Note that the protocol’s messages are included in the detector’s messages. If the detector returns NO at one of these rounds, then the process stops running the synchronous algorithm, and continues only with the detector. If at the end of round [log N ] + 1, the process receives YES from AD(2), then it also receives a name name i as the decision value of the synchronous protocol. Second Phase. At the end of round [log N ] + 1, the processes start the asynchronous renaming algorithm of [7]. More precisely, each process builds a vector V with a single entry, which contains the tuple vi , namei , Ji , bi , ri , where vi is the processes’ initial value. The entry namei is the proposed name, which is either the name returned by the synchronous renaming algorithm, if the process received YES from the detector, or ⊥, otherwise. The entry Ji counts the number of times the process proposed a name–it is 1 if the process has received YES from the detector, and 0 otherwise; bi is the decision bit, which is initially 0. Finally, ri is the round number2 when the entry was last updated, which is in this case log n + 1. The processes broadcast their vectors V for the next two rounds, while continuing to run the asynchrony detector in parallel. The contents of the vector V are updated at every round, as follows: if a vector V containing new entries is received, the process adds all the new entries to its vector; if there are conflicting entries corresponding to the same process, the tie is broken using the round number ri . If, at the end of round log N + 3 the process receives YES from the detector, then it decides on namei . Otherwise, it continues runnning the AttiyaRenaming algorithm until decision is possible. 6.2 Proof of Correctness The first step in the proof of correctness of the transformation provides some properties of the asynchronous renaming algorithm of [7]. More precisely, the first Lemma states that the asynchronous renaming algorithm remains correct even though processes propose names initially, that is at the beginning of round log N + 2. The proof follows from an examination of the protocol and proofs from [7]. Lemma 5. The asynchronous renaming protocol of [7] ensures termination, name uniqueness, and a name space bound of N + t, even if processes propose names at the beginning of the first round. The previous Lemma ensures that the transformation guarantees termination, i.e. that every correct process eventually returns a name. The non-triviality property of the asynchrony detector ensures that the resulting algorithm will terminate in log N +3 rounds in any synchronous run. In the following, we will concentrate on the uniqueness of the names and on the bounds on the resulting namespace. We start by proving that the protocol does not generate duplicate names. 2
This entry in the vector is implied in the original version of the algorithm [7].
Generating Fast Indulgent Algorithms
51
Lemma 6 (Uniqueness). Given any two names ni , nj returned by processes in an execution, we have that ni = nj . Proof (Sketch). Assume for the sake of contradiction that there exists a run in which two processes pi , pj decide on the same name n0 . First, we consider the case in which both decisions occurred at round log N + 3, the first round at with a process can decide using our emulation. Notice that, if a decision is made, the processes necessarily decide on the decision value of the simulated synchronous protocol3. By the global detection property of AD(2) it then follows that there exists a synchronous execution of the synchronous renaming protocol in which two distinct processes return the same value, contradicting the correctness of the protocol. Similarly, we can show that if both decisions occur after round log N + 3, we can reduce the correctness of the transformation to the correctness of the asynchronous protocol. Therefore, the remaining case is that in which one of the decisions occurs at round log N + 3, and the other decision occurs at a later round, i.e. it is a decision made by the asynchronous renaming protocol. In this case, let pi be the process that decides on n0 at the end of round log N + 3. This implies that process pi received YES at the end of round log N + 3 from AD(2). Therefore, since pi sees a synchronous view, there exists a set S of at least N − t processes that received pi ’s message reserving name n0 in round log N + 2. It then follows that each non-crashed process receives a message from a process in the set S in round log N + 3. By the structure of the protocol, we obtain that each process has the entry vi , n0 , 1, 0, log N + 1 in their V vector at the end of round log N + 3. It follows from the structure of the asynchronous protocol that no process other than pi will ever decide on the name n0 at any later round, which concludes the proof of the Lemma. Finally, we prove that the transformation ensures the following guarantees on the size of the namespace. Lemma 7 (Namespace Size). The transformation ensures the following properties: (1) In synchronous executions, the resulting algorithm will rename in a namespace of at most N names. (2) In any execution, the resulting algorithm will rename in a namespace of at most N + t names. Proof (Sketch). For the proof of the first property, notice that, in a synchronous execution, any output combination for the transformation is an output combination for the synchronous renaming protocol. For the second property, let ≥ 0 be the number of names decided on at the end of round log N + 3 in a run of the protocol. These names are clearly between 1 and N . Lemma 6 guarantees that none of these names is decided on in the rest of the execution. On the other hand, Lemma 5 and the namespace bound of N + t for the asynchronous protocol ensure that the asynchronous protocol decides exclusively on names between 1 and N + t, which concludes the proof of the claim.
7 Conclusions and Future Work In this paper, we have introduced a general transformation technique from synchronous algorithms to indulgent algorithms, and applied it to obtain indulgent solutions for a 3
A simple analysis of the asynchronous renaming protocol shows that a process cannot decide after two rounds of communication, unless it had already proposed a value at the beginning of the first round.
52
D. Alistarh et al.
large class of distributed tasks, including consensus, set agreement and renaming. Our results suggest that, even though it is generally hard to design asynchronous algorithms in fault-prone systems, one can obtain efficient algorithms that tolerate asynchronous executions starting from synchronous algorithms. In terms of future work, we first envision generalizing our technique to generate algorithms that also work in a window of synchrony, and investigating its limitations in terms of time and communication complexity. Another interesting research direction would be to analyze if similar techniques exist in the case of Byzantine failures–in particular, if, starting from a synchronous fault-tolerant algorithm, one can obtain a Byzantine fault-tolerant algorithm, tolerating asynchronous executions. Acknowledgements. The authors would like to thank Prof. Hagit Attiya and Nikola Kneˇzevi´c for their help on previous drafts of this paper, and the anonymous reviewers for their useful feedback.
References 1. Guerraoui, R.: Indulgent algorithms. In: PODC 2000, pp. 289–297. ACM, New York (July 2000) 2. Dwork, C., Lynch, N.A., Stockmeyer, L.: Consensus in the presence of partial synchrony. J. ACM 35, 288–323 (1988) 3. Dutta, P., Guerraoui, R.: The inherent price of indulgence. In: PODC 2002: Proceedings of the Annual ACM Symposium on Principles of Distributed Computing, pp. 88–97 (2002) 4. Lamport, L.: Fast paxos. Distributed Computing 19(2), 79–103 (2006) 5. Lamport, L.: Generalized consensus and paxos. Microsoft Research Technical Report MSRTR-2005-33 (March 2005) 6. Alistarh, D., Gilbert, S., Guerraoui, R., Travers, C.: How to solve consensus in the smallest window of synchrony. In: Taubenfeld, G. (ed.) DISC 2008. LNCS, vol. 5218, pp. 32–46. Springer, Heidelberg (2008) 7. Attiya, H., Bar-Noy, A., Dolev, D., Peleg, D., Reischuk, R.: Renaming in an asynchronous environment. Journal of the ACM 37(3), 524–548 (1990) 8. Chaudhuri, S., Herlihy, M., Tuttle, M.R.: Wait-free implementations in message-passing systems. Theor. Comput. Sci. 220(1), 211–245 (1999) 9. Dutta, P., Guerraoui, R.: The inherent price of indulgence. Distributed Computing 18(1), 85–98 (2005) 10. Alistarh, D., Gilbert, S., Guerraoui, R., Travers, C.: Of choices, failures and asynchrony: The many faces of set agreement. In: Dong, Y., Du, D.-Z., Ibarra, O. (eds.) ISAAC 2009. LNCS, vol. 5878, Springer, Heidelberg (2009) 11. Chandra, T.D., Toueg, S.: Unreliable failure detectors for asynchronous systems (preliminary version). In: ACM Symposium on Principles of Distributed Computing, pp. 325–340 (August 1991) 12. Gafni, E.: Round-by-round fault detectors (extended abstract): Unifying synchrony and asynchrony. In: Proceedings of the 17th Symposium on Principles of Distributed Computing (1998) 13. Dutta, P., Guerraoui, R., Keidar, I.: The overhead of consensus failure recovery. Distributed Computing 19(5-6), 373–386 (2007) 14. Delporte-Gallet, C., Fauconnier, H., Guerraoui, R., Tielmann, A.: The disagreement power of an adversary. In: Keidar, I. (ed.) DISC 2009. LNCS, vol. 5805, pp. 8–21. Springer, Heidelberg (2009)
An Efficient Decentralized Algorithm for the Distributed Trigger Counting Problem Venkatesan T. Chakaravarthy1, Anamitra R. Choudhury1 , Vijay K. Garg2 , and Yogish Sabharwal1 1
IBM Research - India, New Delhi {vechakra,anamchou,ysabharwal}@in.ibm.com 2 University of Texas at Austin
[email protected] Abstract. Consider a distributed system with n processors, in which each processor receives some triggers from an external source. The distributed trigger counting problem is to raise an alert and report to a user when the number of triggers received by the system reaches w, where w is a user-specified input. The problem has applications in monitoring, global snapshots, synchronizers and other distributed settings. The main result of the paper is a decentralized and randomized algorithm with expected message complexity O(n log n log w). Moreover, every processor in this algorithm receives no more than O(log n log w) messages with high probability. All the earlier algorithms for this problem have maximum message load of Ω(n log w).
1
Introduction
In this paper, we study the distributed trigger counting (DTC) problem. Consider a distributed system with n processors, in which each processor receives some triggers from an external source. The distributed trigger counting problem is to raise an alert and report to a user when the number of triggers received by the system reaches w, where w is a user specified input. We note w may be much larger than n. The sequence of processors receiving the w triggers is not known apriori to the system. Moreover, the number of triggers received by each processor is also not known. We are interested in designing distributed algorithms for the DTC problem that are communication efficient and are also decentralized. The DTC problem arises in applications such as distributed monitoring and global snapshots. Monitoring is an important issue in networked systems such as sensor networks and data networks. Sensor networks are typically employed to monitor physical or environmental conditions such as traffic volume, wildlife behavior, troop movements and atmospheric conditions, among others. For example, in traffic management, one may be interested in raising an alarm when the number of vehicles on a highway exceeds a certain threshold. Similarly, one may wish to monitor a wildlife region for the sightings of a particular species, and raise an alert, when the number crosses a threshold. In the case of data networks, M.K. Aguilera et al. (Eds.): ICDCN 2011, LNCS 6522, pp. 53–64, 2011. c Springer-Verlag Berlin Heidelberg 2011
54
V.T. Chakaravarthy et al.
example applications are monitoring the volume of traffic or the number of remote logins. See, for example, [7] for a discussion of applications of distributed monitoring. In the context of global snapshots (example, checkpointing), a distributed system must record all the in-transit messages in order to declare the snapshot to be valid. Garg et al. [4] showed the problem of determining whether all the in-transit messages have been received can be reduced to the DTC problem (they call this the distributed message counting problem). In the context of synchronizers [1], a distributed system is required to generate the next pulse when all the messages generated in the current pulse have been delivered. Any message in the current pulse can be viewed as a trigger of the DTC problem. Our goal is to design a distributed algorithm for the DTC problem that is communication efficient and decentralized. We use the following two natural parameters that measure these two important aspects. – The message complexity, i.e., the number of messages exchanged between the processors. – The MaxRcvLoad, i.e., the maximum number of messages received by any processor in the system. Garg et al. [4] studied the DTC problem for a general distributed system. They presented two algorithms: a centralized algorithm and a tree-based algorithm. The centralized algorithm has message complexity O(n log w). However, the MaxRcvLoad of this algorithm can be as high as Ω(n log w). The tree-based algorithm has message complexity O(n log n log w). This algorithm is more decentralized in a heuristic sense, but its MaxRcvLoad can be as high as O(n log n log w), in the worst case. They also proved a lowerbound on the message complexity. They showed that any deterministic algorithm for the DTC problem must have message complexity Ω(n log(w/n)). So, the message complexity of the centralized algorithm is optimal asymptotically. However, this algorithm has MaxRcvLoad as high as the message complexity. In this paper, we consider a general distributed system where any processor can communicate with any other processor and all the processors are capable of performing basic computations. We assume an asynchronous model of computation and messages. We assume that the messages are guaranteed to be delivered but there is no fixed upper bound on the message arrival time. Also, messages are not corrupted or spuriously introduced. This setting is common in data networks. We also assume that there are no faults in the processors and that the processors do not fail. Our main result is a decentralized randomized algorithm called LayeredRand that is efficient in terms of both the message complexity and MaxRcvLoad. Its message complexity is O(n log n log w). Moreover, with high probability, its MaxRcvLoad is O(log n log w). The message complexity of our algorithm is the same as that of the tree based algorithm of Garg et al. [4]. However, the MaxRcvLoad of our algorithm is significantly better than both their tree based and centralized algorithms. It is important to minimize MaxRcvLoad for many applications. For example, in sensor networks where the message processing may
An Efficient Decentralized Algorithm for the DTC Problem Algorithm
Message Complexity Tree-based[4] O(n log n log w) Centralized[4] O(n log w) LayeredRand O(n log n log w)
55
MaxLoad O(n log n log w) O(n log w) O(log n log w)
Fig. 1. Summary of DTC Algorithms
consume limited power available at the node, a high MaxRcvLoad may reduce the lifetime of a node. Another important aspect of our algorithm is its simplicity. In particular, our algorithm is much simpler than both the algorithms of Garg et al. A comparison of our algorithm with the earlier results is summarized in Fig. 1. Designing an algorithm with message complexity O(n log w) and MaxRcvLoad O(log w) remains a challenging open problem. Our main result is formally stated next. For 1 ≤ i ≤ w, the external source delivers the ith trigger to some processor xi . We call the sequence x1 , x2 , . . . , xw as a trigger pattern. Theorem 1. Fix any trigger pattern. The message complexity of the LayeredRand algorithm is O(n log n log w). Furthermore, there exist constants c and d ≥ 1 such that 1 Pr[MaxRcvLoad ≥ c log n log w] ≤ d . n The above bounds hold for any trigger pattern, even if fixed by an adversary. Related work. Most prior work (e.g. [3,7,6]) primarily consider the DTC problem in a centralized setting where one of the processors acts as a master and coordinates the system, and the other processors act as slaves. The slaves can communicate only with the master (they cannot communicate among themselves). Such a scenario applies where a communication network linking the slaves does not exist or the slaves have only limited computational power. Prior work addresses various issues arising in such a setup, such as message complexity. They also consider variations and generalizations of the DTC problem. One such variation is approximate threshold computation, where system need not raise an alert on seeing exactly w triggers; it suffices if the alert raised upon seeing at most (1 + )w triggers, where is some user specified tolerance parameter. Prior work also considers aggregate function more general than counting. Here, each input trigger i is associated with a value αi . The goal is to raise an alert when some aggregate of these values crosses the threshold (an example, aggregate function is sum). Note that the Echo or Wave algorithms [2,9,10] and the framework of repeated global computation [5] are not easily applicable for the DTC problem because the triggers arrive at processors asynchronously at unknown times. Computing the sum of all the trigger counts just once is not enough and repeated computation results in an excessive number of messages.
56
2
V.T. Chakaravarthy et al.
A Deterministic Algorithm
For the DTC problem, Garg et al. [4] presented an algorithm with the message complexity of O(n log w). In this section, we describe a simple alternative deterministic algorithm having the same message complexity. The aim of presenting this algorithm is to highlight the difficulties in designing an algorithm that simultaneously achieves good message complexity and MaxRcvLoad bounds. A naive algorithm for the DTC problem works as follows. One of the processors acts as a master and every processor sends a message to the master upon receiving each trigger. The master keeps count on the total number of triggers received. When the count reaches w, the user is informed and the protocol ends. The disadvantage with this algorithm is that its message complexity is O(w). A natural idea is avoid sending a message to the master for every trigger received. Instead, a processor will send one message for every B triggers received. Clearly, setting B to a high value will reduce the number of messages. However, care should taken to ensure that the system does not enter the dead state. For instance, suppose we set B = w/2. Then, the adversary can send w/4 triggers to some selected four processors. Notice that none of these processors would send a message to the master. Thus, even though all the w triggers have been delivered by the adversary, the system will not detect the termination. We say that the system is the dead state. Our deterministic algorithm with message complexity O(n log w) is described next. A predetermined processor would serve as the master. The algorithm works in multiple rounds. We start by setting two parameters: w ˆ = w and B = w/(2n). ˆ Each processor would send a message to the master for every B triggers received. The master will keep count of the triggers reported by other processors and the triggers received by itself. When the count reaches w/2, ˆ it declares end-of-round and sends a message to all the processors to this effect. In return, each processor sends the number of unreported triggers to the master (namely, the triggers not reported to the master). This way, the master can compute w , the total number of triggers received so far in the system. It recomputes w ˆ = w−w ˆ ; the new w ˆ is the number of triggers yet to be received. The master recomputes B = w/(2n) ˆ and sends this number to every processor. The next round starts. When w ˆ < (2n), we set B = 1. We now argue that the system never enters a dead state. Consider the state of the system in the middle of any round. Each processor has less than w/(2n) ˆ unreported triggers. Thus, the total number of unreported triggers is less than w/2. ˆ The master’s count of reported triggers is less than w/2. ˆ Thus, the total number of triggers delivered so far is less than w. ˆ So, some more triggers are yet to be delivered. It follows that the system is never in a dead state and the system will correctly terminate upon receiving all the w triggers. Notice that in each round, w ˆ decreases at least by a factor of 2. So, the algorithm terminates after log w rounds. Consider any single round. A message is sent to the master for every B triggers received and the rounds gets completed when the master’s count reaches w/2. ˆ Thus, the number of messages sent to the master is w/(2B) ˆ = n. At the end of each round, the O(n) messages are exchanged between the master and the other processors. Thus, the number of
An Efficient Decentralized Algorithm for the DTC Problem ()*+ /.-, w; Gc GG GG ww GG ww w GG ww GG ww GG w w GG w GG ww w ()*+wKe Si S /.-, ()*+ /.-, F 0X 0 KKSSS kk95 9 F 0X kksksss 00 k 00 KKKSKSSSS k k ss k 00 S k S k K s 00 KK SSSkSkkk ss 00 KK kkk SSS ss 00 00 00 kkkkkkKKKK sssSsSSSSS 00 SSS k K s k 0 K s k s KK 0 SSSS kk 00 s k k K s SSS 00 KK kk 0 sss SSS K kkk ()*+kk /.-, ()*+s /.-, /.-, ()*+ /.-, ()*+
57
Layer 0 (root)
Layer 1
Layer 2
. . . .
()*+ /.-,
/.-, ()*+
/.-, ()*+
/.-, ()*+
/.-, ()*+
/.-, ()*+
/.-, ()*+
/.-, ()*+
Layer 3
Fig. 2. Illustration for LayeredRand
messages per round is O(n). The total number messages exchanged during all the rounds is O(n log w). The above algorithm is efficient in terms of message complexity. However, the master may receive upto O(n log w) messages and so, the MaxRcvLoad of the algorithm is O(n log w). In the next section, we present an efficient randomized algorithm which simultaneously achieves provably good message complexity and MaxRcvLoad bounds.
3
LayeredRand Algorithm
In this section, we present a randomized algorithm called LayeredRand. Its message complexity is O(n log n log w) and with high probability, its MaxRcvLoad is O(log n log w). For the ease of exposition, we first describe our algorithm under the assumption that the triggers are delivered one at a time; meaning, all the processing required for handling a trigger is completed before the next trigger arrives. This assumption allows us to better explain the core ideas of the algorithm. We will discuss how to handle the concurrency issues in Sect. 5. For the sake of simplicity, we assume that n = 2L − 1, for some integer L. The n processors are arranged in L layers numbered 0 through L − 1. For 0 ≤ < L, layer consists of 2 processors. Layer 0 consists of a single processor, which we refer to as the root. Layer L − 1 is called the leaf layer. The layering is illustrated in Fig. 2, for n = 15. Only processors occupying adjacent layers will communicate with each other. The algorithm proceeds in multiple rounds. In the beginning of each round, the system needs to know how many triggers are yet to be received. This can be computed by keeping track of the total number of triggers received in all the previous rounds and subtracting this quantity from w. Let the term initial value of a round mean the number of triggers yet to be received at the beginning of the round. We use a variable w ˆ to store the initial value of any round. In the first round, we set w ˆ = w, since all the w triggers are yet to be received.
58
V.T. Chakaravarthy et al.
We next describe the procedure followed in a single round. Let w ˆ denote the initial value of this round. For each 1 ≤ < L, we compute a threshold τ () for the layer : w ˆ τ () = . 4 · 2 · log(n + 1) Each processor x maintains a counter C(x), which is used to keep track of some of the triggers received by x and other processors occupying the layers below of that of x. The exact semantics C(x) will become clear shortly. The counter is reset to zero in the beginning of the round. Consider any non-root processor x occupying a level . Whenever x receives a trigger, it will increment C(x) by one. If C(x) reaches the threshold τ (), x chooses a processor y occupying level − 1 uniformly at random and sends a message to y. We refer to such a message as a coin. Upon receiving the coin, the processor y updates C(y) by adding τ () to C(y). Intuitively, receipt of a coin by y means that y has evidence that some processors below the layer − 1 have received τ ( − 1) triggers. After the update, if C(y) ≥ τ ( − 1), y will pick a processor z occupying level − 2 uniformly at random and send a coin to z. Then, processor y updates C(y) = C(y) − τ ( − 1). Processor z handles the coin similarly. See Fig. 2. A directed edge from a processor u to a processor v means that u may send a coin to v. Thus, a processor may send a coin to any processor in the layer above. This is illustrated for the top three layers in the figure. We now formally describe the behavior of a non-root processor x occupying a level . Whenever x receives a trigger from the external source or a coin from level + 1, it behaves as follows: – If a trigger is received, increment C(x) by one. – If a coin is received from level + 1, update C(x) = C(x) + τ ( + 1). – If C(x) ≥ τ (), • Among the 2−1 processors occupying level − 1, pick a processor y uniformly at random and send a coin to y. • Update C(x) = C(x) − τ (). The behavior of the root is similar to that of the other processors, except that it does not send coins. The root processor r also maintains a counter C(r). Whenever it receives a trigger from the external source, it increments C(r) by one. If it receives a coin from level 1, it updates C(r) = C(r) + τ (1). An important observation is that at any point of time, any trigger received by the system in the current round is accounted in the counter C(x) of exactly one processor x. This means that the sum of C(x) over all the processors gives us the exact count of the triggers received in the system so far in this round. This observation will be useful in proving the correctness of the algorithm. The crucial activity of the root is to initiate an end-of-round procedure. When C(r) reaches w/2 ˆ (i.e., when C(r) ≥ w/2), ˆ the root declares end-of-round. Now, the root needs to get a count of the total number of triggers received by all the processors in this round. Let this count be w . The processors are arranged in a pre-determined binary tree formation such that each processor x
An Efficient Decentralized Algorithm for the DTC Problem
59
has exactly one parent from the layer above and exactly two children from the layer below. The end-of-round notification can be broadcast to all the processors in a recursive top-down manner. Similarly, the sum of C(x) over all the processors can be reduced at the root in a recursive bottom-up manner. Thus, the root obtains the value w , i.e., the total number of triggers received in the system in this round. The root then updates the initial value for the next round by computing w ˆ = w ˆ − w , and broadcasts this to all the processors, again in a recursive fashion. All the processors then update their τ () values for the new round. This marks the start of the next round. Notice that in the end-of-round process, each processor receives at most a constant number of messages. At the end of any round, if the newly computed wˆ is zero, we know that all the w triggers have been received. So, the root can raise an alert to the user and the algorithm is terminated. It is easy to derive a bound on the number of rounds taken by the algorithm. Observe that in successive rounds the initial value drops by a factor of two (meaning, w ˆ of round i + 1 is at most half the w ˆ of round i). Thus, the algorithm takes at most log w rounds.
4
Analysis of the LayeredRand Algorithm
Here, we prove the correctness of the algorithm and then prove message bounds. 4.1
Correctness of the Algorithm
We now show that the system will correctly raise an alert to the user when all the w triggers are received. The main part of the proof involves showing that after starting a new round, the root always enters the end-of-round procedure, i.e., the system does not get stalled in the middle of the round, when all the triggers have been delivered. We denote the set of all processors by P. Consider any round and let w ˆ be the initial value of the round. Let x be any non-root processor and let be the layer in which x is found. Notice that at any point of time, we have C(x) ≤ τ () − 1. Thus, we can derive a bound on the sum of C(x): x∈P−{r}
C(x) ≤
L−1 =1
2 (τ () − 1) ≤
(L − 1)w ˆ 4 · log(n + 1)
≤
w ˆ 4
Now suppose that all the outstanding w ˆ triggers have been delivered to the system in this round. We already saw that at any point of time, x∈P C(x) gives the number of triggers received by the system so far in the current round.1 Thus, x∈P C(x) = w. ˆ It follows that the counter at the root C(r) satisfies C(r) ≥ 3w/4 ˆ ≥ w/2. ˆ But, this means that the root would initiate the end-ofround procedure. We conclude that the system will not enter a dead state. 1
We note that C(r) is an integer, and hence this holds even when w ˆ = 1.
60
V.T. Chakaravarthy et al.
The above argument shows that the system always makes progress by moving into the next round. As we observed earlier, the initial value w ˆ drops by a factor of at least two in each round. So, eventually, w ˆ must become zero and the system will raise an alert to the user. 4.2
Bound on the Message Complexity
Lemma 1. The message complexity of the algorithm is O(n log n log w). Proof: As argued before, the algorithm takes only O(log w) rounds to terminate. Consider any round and let w ˆ be the initial value of the round. Consider any layer 1 ≤ < L. Every coin sent from layer to − 1 means that at least τ () triggers have been received by the system in this round. Thus, the number of coins sent from layer to the layer − 1 can be at most w/τ ˆ (). Summing up over all the layers, we can get a bound on the total number of coins (messages) sent in this round: L−1 L−1 w ˆ Number of coins sent ≤ ≤ 4 · 2 log n ≤ 4 · (n − 1) log n τ () =1
=1
The end-of-round procedure involves only O(n) messages, in any particular round. Summing up over all log w rounds, we see that the message complexity of the algorithm is O(n log n log w). 4.3
Bound on the MaxRcvLoad
In this section, we show that with high probability, the MaxRcvLoad is bounded by O(log n log w). We use the following Chernoff bound (see [8]) for this purpose. Theorem 2 (see [8], Theorem 4.4). Let X be the sum of a finite number of independent 0−1 random variables. Let the expectation of X be μ = E[X]. Then, for any r ≥ 6, Pr[X ≥ rμ] ≤ 2−rμ . Moreover, for any μ ≥ μ, the inequality is true, if we replace μ by μ on both sides. Lemma 2. Pr[MaxRcvLoad ≥ c log n log w] ≤ n−47 , for some constant c. Proof: Let us first consider the number coins received by any processor. Processors in the leaf layer do no receive any coins and so, it suffices to consider the processors occupying other layers. Consider any layer 0 ≤ ≤ L − 2 and let x be any processor found in layer . Let Mx be the random variable denoting the number of coins received by x. As discussed before, the algorithm takes at most log w rounds. In any given round, w ˆ the number of coins received by layer is at most τ (+1) ≤ 4 · 2+1 log n. Thus, the total number of coins received by layer is at most 4 · 2+1 log n log w. Each of these coins is sent uniformly and independently at random to one of the 2 processors occupying layer . Thus, expected number of coins received by x is E[Mx ] ≤
4 · 2+1 log n log w 2
= 8 log n log w
An Efficient Decentralized Algorithm for the DTC Problem
61
The random variable Mx is a sum of independent 0-1 random variables. Applying the Chernoff bound given by Theorem 2 (taking r = 6), we see that Pr[Mx ≥ 48 log n log w] ≤ 2−48 log n log w
< n−48 .
Applying the union bound, we see that Pr[There exists a processor x having Mx ≥ 48 log n log w] < n−47 . During the end-of-round process, a processor receives at most a constant number of messages in any round. So, the total of these messages received by any processor is O(log w).
5
Handling Concurrency
In this section, we discuss how to handle the concurrency issues. All triggers and coin messages received by a processor can be placed into a queue and processed one at a time. Thus, there is no concurrency issue related to triggers and coins received within a round. However, concurrency issues need to be handled during an end-of-round. Towards this goal, we slightly modify the LayeredRand algorithm. The core functioning of the algorithm remains the same as before; we mainly modify the end-of-round procedure by adding some additional features (such as counters and queues). The rest of this section explains these features and the end-of-the round procedure in detail. We also prove correctness of the algorithm in the presence of concurrency issues. 5.1
Processing Triggers and Coins
Each processor x maintains two FIFO queues - a default queue and a priority queue. All triggers and coin messages received by a processor are placed in the default queue. The priority queue contains only the messages related to the endof-round procedure, which are handled on a priority basis. In the main event handling loop, a processor repeatedly checks for messages in queues. It first examines the priority queue and handles the first message in that queue, if any. If there is no message there, it examines the default queue and handles the first message in that queue (if any). Every processor also maintains a counter D(x) that keeps a count of triggers directly received and processed by x, since the beginning of the algorithm. The triggers received by x that are in the default queue (not yet processed) are not accounted in D(x). The counter D(x) is incremented every time the processor processes a trigger from the default queue. This counter is never reset. It is maintained in addition to the counter C(x) (which gets reset in the beginning of each round). Every processor x maintains another variable, RoundNum, that indicates the current round number for this processor. Whenever x sends a coin to some other processor, it includes its RoundNum in the message. The processing of triggers and coins is done as before (as in Sect. 3).
62
5.2
V.T. Chakaravarthy et al.
End-of-Round Procedure
Here, we describe the end-of-round procedure in detail, highlighting the modifications. The procedure consists of four phases. The processors are arranged in the form of a binary tree as before. In the first phase, the root processor broadcasts a RoundReset message down the tree to all nodes requesting them to send their D(x) counts. In the second phase, these counts are reduced at the root using Reduce messages; the root computes the sum of D(x) over all the processors. Note that, unlike the algorithm described in Sect. 3, here the root computes the sum of D(x) counters, rather than the sum of C(x) counters. We shall see that this is useful in proving correctness. Using the sum of D(x) counters, the root computes the initial value w ˆ for the next round. In the third phase, the root broadcasts this value w ˆ to all nodes using Inform messages. In the fourth phase, each processor sends an acknowledgement InformAck back to the root and enters the next round. We now describe the four phases in detail. First Phase: In this phase, the root processor initiates the broadcast of a RoundReset message by sending it down to its children. A processor x on receiving RoundReset message, does the following: – At this point, the processor suspends processing of the default queue until the end-of-round processing is completed. Thus all new triggers are queued up without being processed. This ensures that the D(x) value is not modified while end-of-round procedure is in progress. – If x is not a leaf processor, it forwards the RoundReset message to its children; if it is a leaf-processor, it initiates the second phase as described below. Second Phase: In this phase, the D(x) values are sum-reduced at the root from all the processors. The second phase starts when a leaf processor receives a RoundReset message, in response to which it initiates a Reduce message containing its D(x) value and passes it to its parent. When a non-leaf processor has received Reduce messages from all its children, it adds up the values in these messages to its own D(x) and sends a Reduce message to its parent with this sum. Thus, the root collects the sum of D(x) over all the processors. This sum w is the total numbers of triggers received in the system so far. Subtracting w from w, the root computes the initial value w ˆ for the next round. If w ˆ = 0, the root raises an alert and terminates the algorithm. Otherwise, the root initiates the third phase. Third Phase: In this phase, the root processor broadcasts the new w ˆ value by sending an Inform message to its children. A processor x on receiving the Inform message, performs the following: – It computes the threshold τ () value for the new round, where is the layer number of x. – If x is a non-leaf processor, it forwards the Inform message to its children; if x is a leaf processor, it initiates the fourth phase as described below.
An Efficient Decentralized Algorithm for the DTC Problem
63
Fourth Phase: In this phase, the processors send an acknowledgement upto the root and enter the new round. The fourth phase starts when a leaf processor x receives an Inform message. After performing the processing for the Inform message, it performs the following actions: – It increments RoundNum. This signifies that the processor has entered the next round. After this point, the processor does not process any coins from the previous rounds. Whenever the processor receives a coin generated in the previous rounds, it simply discards the coin. – C(x) is reset to zero. – It sends an InformAck to its parent. – The processor x resumes processing of the default queue. This way, x will start processing the outstanding triggers (if any). When a non-leaf node receives InformAck messages from all its children, it performs the same processing as above. When the root processor has received InformAck messages from all its children, the system enters the new round. We note that it is possible to implement the end-of-round procedure using three phases. However, the fourth phase (of sending acknowledgements) ensures that at any point of time, the processors can only be in two different (consecutive) rounds. Moreover, when the root receives the InformAck messages from all its children, all the processors in the system are in the same round. Thus, end-ofround processing for different rounds cannot be in progress simultaneously. 5.3
Correctness of Algorithm
We now show that the system correctly raises an alert to the user when all the w triggers are delivered. The main part of the proof involves showing that after starting a new round, the root always enters the end-of-round procedure. Furthermore, we also show that system does not incorrectly raise an alert to the user before w triggers are delivered. We say that a trigger is unprocessed, if the trigger has been delivered to a processor and is waiting in its default queue. A processor is said to be in round k, if its RoundNum equals k. A trigger is said to be processed in round k, if the processor that received this trigger is in round k when it processed the trigger. Consider the point in time t when the system has entered a new round k. Let w ˆ be the initial value of the round. Recall that in the second phase, the root computes w = x∈P D(x) and sets w ˆ = w − w , where P is the set of all processors. Notice that in the first phase, all processors suspend processing triggers from the default queue. The trigger processing is resumed only in the fourth phase after the RoundNum is incremented. Therefore, no more triggers are processed in the round k − 1. It follows that w is the total number of triggers that have been processed in the (previous) rounds k ≤ k − 1. Thus, any triggers processed in round k will be accounted in the counter C(x) of some processor x. This observation leads to the following argument. We now show that the root initiates the end-of-round procedure upon receiving at most w ˆ triggers. Suppose all the w ˆ triggers have been delivered and
64
V.T. Chakaravarthy et al.
processed in this round. Furthermore, assume that all the coins generated and sent in the above process have also been received and processed. Clearly, such a state will happen at some point in time, sincewe assume a reliable communicaˆ tion network. At this point of time, we have x∈P C(x) = w. At any point of time after t, we have x∈P−{r} C(x) ≤ w/4, ˆ where P is the set of all processors and r is the root processor. The claim is proved using the same arguments as in Sect. 4.1 and the fact that the processors discard the coins generated in previous rounds. From the above relations, we get that C(r) ≥ 3w/4 ˆ ≥ w/2. ˆ The root initiates the end-of-round procedure whenever C(r) crosses w/2. ˆ Thus, the root will eventually start the end-of-round procedure. Hence the system never gets stalled in the middle of the round. Clearly, the system raises an alert on receiving w triggers. We now argue that the system does not raise an alert before receiving w triggers. This follows from the fact that w ˆ for a new round is calculated on the basis of D(x) counters. The analysis of message complexity and MaxRcvLoad are unaffected.
6
Conclusions
We have presented a randomized algorithm to the DTC problem which reduces the MaxRcvLoad of any node from O(n log w) to O(log n log w) with high probability. The ultimate goal of this line of work would be to design a deterministic algorithm with MaxRcvLoad O(log w).
References 1. Awerbuch, B.: Complexity of network synchronization. J. ACM 32(4), 804–823 (1985) 2. Chang, E.: Echo algorithms: Depth parallel operations on general graphs. IEEE Trans. Software Eng. 8(4), 391–401 (1982) 3. Cormode, G., Muthukrishnan, S., Yi, K.: Algorithms for distributed functional monitoring. In: SODA (2008) 4. Garg, R., Garg, V.K., Sabharwal, Y.: Scalable algorithms for global snapshots in distributed systems. In: 20th Int. Conf. on Supercomputing, ICS (2006) 5. Garg, V., Ghosh, J.: Repeated computation of global functions in a distributed environment. IEEE Trans. Parallel Distrib. Syst. 5(8), 823–834 (1994) 6. Huang, L., Garofalakis, M., Joseph, A., Taft, N.: Communication-efficient tracking of distributed cumulative triggers. In: ICDCS (2007) 7. Keralapura, R., Cormode, G., Ramamirtham, J.: Communication-efficient distributed monitoring of thresholded counts. In: SIGMOD Conference (2006) 8. Mitzenmacher, M., Upfal, E.: Probability and Computing: Randomized Algorithms and Probabilistic Analysis. Cambridge Univ. Press, Cambridge (2005) 9. Segall, A.: Distributed network protocols. IEEE Transactions on Information Theory 29(1), 23–34 (1983) 10. Tel, G.: Distributed infimum approximation. In: Lupanov, O.B., Bukharajev, R.G., Budach, L. (eds.) FCT 1987. LNCS, vol. 278, pp. 440–447. Springer, Heidelberg (1987)
Deterministic Dominating Set Construction in Networks with Bounded Degree Roy Friedman and Alex Kogan Department of Computer Science Technion, Israel {roy,sakogan}@cs.technion.ac.il
Abstract. This paper considers the problem of calculating dominating sets in networks with bounded degree. In these networks, the maximal degree of any node is bounded by Δ, which is usually significantly smaller than n, the total number of nodes in the system. Such networks arise in various settings of wireless and peer-to-peer communication. A trivial approach of choosing all nodes into the dominating set yields an algorithm with the approximation ratio of Δ + 1. We show that any deterministic algorithm with non-trivial approximation ratio requires Ω(log∗ n) rounds, meaning effectively that no o(Δ)-approximation deterministic algorithm with a running time independent of the size of the system may ever exist. On the positive side, we show two deterministic algorithms that achieve log Δ and 2 log Δ-approximation in O(Δ3 + log∗ n) and O(Δ2 log Δ + log∗ n) time, respectively. These algorithms rely on coloring rather than node IDs to break symmetry.
1
Introduction
The dominating set problem is a fundamental problem in graph theory. Given a graph G, a dominating set of the graph is a set of nodes such that every node in G is either in the set or has a direct neighboring node in the set. This problem, along with its variations, such as the connected dominating set or the k-dominating set, play significant role in many distributed applications, especially in those running over networks that lack any predefined infrastructure. Examples include mobile ad-hoc networks (MANETs), wireless sensor networks (WSNs), peer-to-peer networks, etc. The main application of dominating sets in such networks is to provide a virtual infrastructure, or overlay, in order to achieve scalability and efficiency. Such overlays are mainly used to improve routing schemes, where only nodes in the set are responsible for routing messages in the network (e.g., [29, 30]). Other applications of dominating sets include efficient power management [11, 30] and clustering [3, 14]. In many cases, the network graph is such that each node has a limited number of direct neighbors. Such a limitation may result from several reasons. First,
This work is partially supported by the Israeli Science Foundation grant 1247/09 and by the Technion Hasso Plattner Center.
M.K. Aguilera et al. (Eds.): ICDCN 2011, LNCS 6522, pp. 65–76, 2011. c Springer-Verlag Berlin Heidelberg 2011
66
R. Friedman and A. Kogan
it can represent a hardware limitation, such as a bounded number of communication ports in a device [8]. Second, it can be an outcome of an inherent communication protocol limitation, like in the case of BlueTooth networks composed of units, called piconets, that include at most eight devices [10]. Finally, performance considerations, such as space complexity and network scalability, may limit the number of nodes with which each node may communicate directly. This is a common case for structured peer-to-peer networks, where each node selects a constant number of neighbors when it joins the network [17, 25]. The problem of finding a dominating set that has a minimal number of nodes is known to be N P -complete [12], and, in fact, it is also hard for approximation [9]. Although the approximation ratio of existing solutions for the dominating set problem, O(log Δ), was found to be the best possible (to within a lower order additive factor, unless NP has an nO(log log n) -time deterministic algorithm [9]), the gap between lower and upper bounds on the running time of distributed deterministic solutions remains wide. Kuhn et al. [19] showed that any distributed approximation algorithm for the dominating set problem with a polylogarithmic approximation ratio requires at least Ω( log n/ log log n) communication rounds. Along with that, the existing distributed deterministic algorithms incur a linear (in number of nodes) running time [7, 23, 29]. This worst-case upper bound remains valid even when graphs of interest are restricted to the bounded degree case, like the ones described above. The deterministic approximation algorithms [7, 23, 29] are based on the centralized algorithm of Guha and Khuller [13], which in turn is based on a greedy heuristic for the related set-cover problem [5]. Following the heuristic, these algorithms start with an empty dominating set and proceed as following. Each node calculates the span, the number of uncovered neighbors, including the node itself. (A node is uncovered if it is not in the dominating set and does not have any neighbor in the set.) Then it exchanges the span with all nodes within distance of 2 hops and decides whether to select itself to the dominating set based on its span and the span of nodes within distance 2. These iterations are repeated by a node v until v or at least one of its neighbors is uncovered. The decision whether to join the dominating set in the above iterative process is taken based on the lexicographic order of the pair span, ID [7, 23, 29]. The use of IDs to break ties leads to long dependency chains, where a node cannot join the set because of another node having higher ID. This, in turn, leads to a time complexity that is linear in the number of nodes. To see that, consider a ring, where nodes have IDs starting from 1 and increasing clockwise. At the first iteration, only the node with the highest ID = n will join the set. At the second iteration, only the node with ID = n − 3 will join the set, since it has 3 uncovered neighbors (including itself), while nodes n − 2 and n − 1 have only 2 and 1, respectively. At the third iteration, the node with ID = n − 6 will join, and so on. Thus, such an approach will require roughly n/3 phases. In this paper, we employ coloring to reduce the length of such dependency chains. Our approach is two-phased: we first run a coloring algorithm that assigns each node with a color, which is different from a color of any other node within
Deterministic Dominating Set Construction
67
distance 2. Then, we run the same iterative process described above, while using colors instead of IDs to break ties between nodes with equal span, shortening the length of the maximal chain. This approach results in a distributed deterministic algorithm with approximation ratio of log Δ (or, more precisely, log Δ + O(1)) and running time of O(Δ3 +log∗ n). Notice, though, that the coloring required by our algorithm can be precomputed for other purposes, e.g., time slot scheduling for the wireless channel access [15, 28]. When the coloring is given, the running time of the algorithm becomes O(Δ3 ), independent of the size of the system. We also describe a modification to our algorithm that reduces its running time to O(Δ2 log Δ + log∗ n) (O(Δ2 log Δ) in case of coloring is already given) while the approximation ratio is increased by a constant factor. An essential question that arises in the context of bounded degree networks is whether it is possible to construct a local approximation algorithm, i.e., an algorithm with a running time that depends solely on the degree bound. As have been already stated above, in the general case, Kuhn et al. [19] provide a negative answer and state that at least Ω( log n/ log log n) communication rounds are needed. Along with that, in several other related communication models, such as the unit disc graph, local approximation algorithms are known to exist [6]. In this paper, we show that any deterministic algorithm with a nontrivial approximation ratio requires at least Ω(log∗ n) rounds, thus answering negatively to the question stated above. In light of this lower bound, our modified algorithm leaves an additive gap of O(Δ2 log Δ).
2
Related Work
Due to its importance, the dominating set problem was considered in various networking models. For general graphs, the best distributed deterministic O(log Δ)-approximation algorithms have linear running time [7, 23, 29]. In fact, these algorithms perform no better than a trivial approach where each node collects a global view of the network by exchanging messages with its neighbors and then calculates locally a dominating set approximation by running, e.g., the centralized algorithm of Guha and Khuller [13]. The only lower bound known for general graphs is due to Kuhn et. al. [19], which states that at least Ω( log n/ log log n) communication rounds are needed to find a constant or polylogarithmic approximation1. Their proof relies on a construction of a special family of graphs in which the maximal node degree depends on the size of the graph. Thus, this construction cannot be realized in the bounded degree model. Another body of works considers unit-disk graphs (UDG), which are claimed to model the communication in wireless ad-hoc networks. Although the dominating set problem remains NP-hard in this model, approximation algorithms with a constant ratio are known (e.g., [6, 20]). Recently, Lenzen and Wattenhofer [22] showed that any f -approximation algorithm for the dominating set problem in the UDG model runs in g(n) time, where f (n)g(n) ∈ Ω(log∗ n). In 1
This work assumes unbounded local computations.
68
R. Friedman and A. Kogan
Table 1. Comparison of results on distributed deterministic O(log Δ)-approximation of optimal dominating sets Model
Running time Algorithm/Lower bound log n/ log log n) [19] General O(n) [7, 23, 29] Ω(log∗ n) this paper Bounded degree O(log ∗ n + Δ3 ), O(log ∗ n + Δ2 log Δ) this paper Ω(
contrary, we consider a different model of graphs with bounded degree nodes, in which Δ is not a constant number, but rather an independent parameter of the problem. This enables us to obtain more refined lower bound. Specifically, we show that while obtaining O(Δ)-approximation for the optimal dominating set in our model is possible even without any communication, any o(Δ)-approximation algorithm requires Ω(log∗ n) time. Although our proof employs a similar (ring) graph, which can be realized also in the UDG model, the formalism we use allows us to obtain our lower bound in a shorter and more straight-forward way. The dominating set problem in bounded degree networks was considered by Chlebik and Chlebikova [4], who derive explicit lower bounds on the approximation ratios of centralized solutions. While we are not aware of any previous work on distributed approximation of dominating sets in bounded degree networks, several related problems were considered in this setting. Very recently, Astrand and Suomela et al. provided distributed deterministic approximation algorithms to a series of such problems, e.g., vertex cover [1, 2] and set cover [2]. Panconesi and Rizzi considered maximal matchings and various colorings [26]. It is worth to mention several randomized approaches that have been proposed for the general graph model and which can also be applied in the setting of networks with bounded degree. For instance, Jia et al. [16] propose an algorithm with O(log n log Δ) running time, while Kuhn et. al. [21] achieve even better O(log2 Δ) running time. These solutions, however, provide only probabilistic guarantees on the running time and/or approximation ratio (for example, the former achieves the approximation ratio of O(log Δ) in expectation and O(log n) with high probability), while our approach deterministically achieves the approximation ratio of log Δ. The results of previous work along with the contribution of this paper are summarized in Table 1.
3
Model and Preliminaries
We model the network as a graph G = (V, E). The number of nodes is n and the degree of any node in the graph is limited by a global parameter Δ. We assume that both n and Δ are known to any node in the system. Also, we assume that each node has a unique identifier of size O(log n). In fact, both assumptions are required only by the coloring procedure we use as a subroutine [18]. Our lower bound does not require the latter assumption and, in particular, holds for anonymous networks as well.
Deterministic Dominating Set Construction
69
k · o(log∗ n) nodes vj
f (n) nodes
vi
Fig. 1. A (partial) 2-ring graph R(n, 2)
k · o(log∗ n) nodes
Fig. 2. A subgraph G of R(n, 2)
Our model of computation is a synchronous, message-passing system (denoted as LOCAL in [27]) with reliable processes and reliable links. In particular, time is divided into rounds and in every round, a node may send one message of an arbitrary size to each of its direct neighbors in G, receive all messages sent to it by its direct neighbors at the same round and perform some local computation. Consequently, for any given pair of nodes v and u at distance of k edges in G, a message sent by v in round i may reach u not before round i + k − 1. All nodes start the computation at the same round. The time complexity of the algorithms presented below is the number of rounds from the start until the last node ceases to send messages. Let Nk (v, G) denote the k-neighborhood of a node v in a graph G, that is Nk (v, G) is a set of all nodes (not including v itself), which are at most k hops from v in G. In the following definitions, all node indices are taken modulo n. Definition 1. A ring graph R(n) = (Vn , En ) is a circle graph consisting of n nodes where Vn = {v1 , v2 , ..., vn } and En = {(vi , vi+1 ) | 1 ≤ i ≤ n}. A k-ring graph R(n, k) = (Vn , Enk ) is an extension of the ring graph, where Vn = {v1 , v2 , ..., vn } and Enk = {(vi , u) | u ∈ Nk (vi , R(n)) ∧ 1 ≤ i ≤ n}. Notice that in R(n, k) each node v has exactly 2k edges, one to each of its neighbors in Nk (v, R(n)) (see Fig. 1). Given R(n, k) and two nodes vi , vj ∈ Vn , i ≤ j, let Sub(R(n, k), vi , vj ) be a subgraph (V, E) where V = {vk ∈ Vn | i ≤ k ≤ j}. Thus, assuming a clockwise ordering of nodes on the ring, Sub(R(n, k), vi , vj ) contains the sequence of nodes between vi and vj in the clockwise direction. The nodes vi and vj are referred to as boundary nodes in the sequence. Definition 2. Suppose A is an algorithm operating on R(n, k) and assigning each node vi ∈ Vn a value c(vi ) ∈ {0, 1}. Let r(vi ) = minj {j ≤ i | ∀k, j ≤ k ≤ i : c(vk ) = c(vi )}. Similarly, let l(vi ) = maxj {i ≤ j | ∀k, i ≤ k ≤ j : c(vk ) = c(vi )}. Then Seq(vi ) = Sub(R(n, k), vr(vi ) , vl(vi ) ) is the longest sequence of nodes
70
R. Friedman and A. Kogan
containing vi and in which all nodes have the value c(vi ). We call vl(vi ) as the leftmost node in Seq(vi ), vl(vi )+1 as the second leftmost node, and so on.
4 4.1
Proof of Bounds Lower Bound
The minimal dominating set of any bounded degree graph has a size of at least n . Thus, a simple approach for choosing all nodes of the graph into the domiΔ+1 nating set gives a trivial (Δ + 1)-approximation for the optimal set. An essential question is whether a non-trivial approximation can be calculated deterministically in the bounded degree graphs in an effective way, i.e., independent of the system size. The following theorem gives a negative answer to this question. Theorem 1. Any distributed deterministic o(Δ)-approximation algorithm for the dominating set problem in a bounded degree graph requires Ω(log∗ n) time. Proof. Assume, by way of contradiction, that there exists a deterministic algorithm A that finds an o(Δ)-approximation in o(log∗ n) time. Given a ring of n nodes, R(n), the following algorithm colors it with 3 colors, for any given k. – Construct the k-ring graph R(n, k) and run A on it. For each node vi ∈ Vn , denote the value c(vi ) as 1 if A selects v into the dominating set, and as 0 otherwise. – Every node vi ∈ Vn chooses its color according to whether or not vi and some of its neighbors are chosen to the dominating set by A. Specifically, consider the sequence Seq(vi ) as defined in Def. 2. • If vi is not in the set, the nodes in the sequence are colored with colors 2 and 1 interchangeably. That is, the leftmost node in the sequence chooses color 2, the second leftmost node chooses color 1, the third leftmost node chooses color 2, and so on. • If vi is in the set, the nodes in the sequence are colored with colors 0 and 1 interchangeably. That is, the leftmost node in the sequence chooses color 0, the second leftmost node chooses color 1, the third leftmost node chooses color 0, and so on. The produced coloring uses 3 colors and is a subject to a straight-forward distributed implementation. Notice that the coloring is legal (i.e., no two adjacent nodes share the same color) inside sequences of nodes chosen and not chosen to the dominating set by A. Thus, the legality of the produced coloring should be verified in cases where the sequences end. Consider two neighboring nodes (in R(n)) v and u, where v is a left neighbor of u (i.e., v appears immediately after u in the ring when considering nodes in the clockwise direction). If v is in the set and u is not, then the color of u, being the leftmost in the sequence of nodes not in the set, is 2, while the color of v is 0 or 1. Similarly, if u is in the set and v is not, then the color of u, being the leftmost in the sequence of nodes in the set, is 0, while the color of v is 2 or 1. Thus, the produced coloring is legal.
Deterministic Dominating Set Construction
71
The running time of the algorithm is g(n) ∈ o(log∗ n) rounds spent for running A and an additional number of rounds to decide on colors. The length of the longest sequence of nodes not in the dominating set cannot exceed 2k, since otherwise there will be a node that is not covered by any other node in the selected dominating set. Thus, the implementation of the first rule for the coloring decision requires a constant number of rounds. In the following, we show that there exists k such that the length of the longest sequence chosen to the dominating set by A is o(log∗ n). Thus, for this k, nodes decide on their colors in o(log∗ n) time. Thus, the running time of the algorithm to color a ring with 3 colors sums up to o(log∗ n), contradicting the famous lower bound of Linial [24]. We are left with the claim that for some k, the length of the longest sequence of nodes chosen to the dominating set by A is o(log∗ n). Suppose, by way of contradiction, that for any k there exists a function f (n) ∈ Ω(log∗ n) such that A produces a sequence of length f (n). Let vi and vj be the boundary nodes of such a sequence s.t. i ≤ j, and construct a subgraph G = Sub(R(n, k), vi−k·g(n) , vj+k·g(n) ). Notice that this subgraph contains the same f (n) nodes chosen by A into the dominating set plus additional 2k · g(n) nodes (see Fig. 2). Also note that a minimum domi1 nating set in G , Opt(G ), contains 2k+1 (f (n) + 2k · g(n)) nodes. When A is run on G , nodes in the original sequence of length f (n) cannot distinguish between the two graphs, i.e., R(n, k) and G . This is because in our model, a node can collect information in o(log∗ n) rounds only from nodes at distance of at most o(log∗ n) edges from it. Thus, being completely deterministic, A must select the same f (n) nodes (plus some additional nodes to ensure that all nodes in G are covered). Consequently, |A(G )| ≥ f (n), where |A(G )| denotes the size of the dominating set calculated by A for the graph G . On the other hand, A has an o(Δ)-approximation ratio, thus for any graph G, |A(G)| ≤ o(Δ) · |OP T (G)| + c, where c is some non-negative constant. For simplicity, we will assume c = 0; the proof does not change much for c > 0. In the graph R(n, k) (and G ), Δ = 2k, thus there exist Δ and k s.t. 2o(Δ ) = 2o(2k ) < 2k + 1. In addition, since f (n) ∈ Ω(log∗ n) and g(n) ∈ o(log∗ n), there exists n > k s.t. 2k · g(n ) < f (n ). Thus, for Δ , k and n , we get: 1 (f (n ) + 2k · g(n )) +1 2 < o(Δ ) · f (n ) < f (n ) ≤ |A(G )|, 2k + 1
o(Δ ) · |OP T (G )| = o(Δ ) ·
2k
contradicting the fact that A has an o(Δ)-approximation ratio.
It follows immediately from the previous theorem that no local deterministic algorithm that achieves an optimal O(log Δ)-approximation may exist. Corollary 1. Any distributed deterministic O(log Δ)-approximation algorithm for the dominating set problem in a bounded degree graph requires Ω(log∗ n) time.
72
4.2
R. Friedman and A. Kogan
Upper Bound
First, we describe an algorithm that achieves log Δ-approximation in O(Δ3 + log∗ n) time. Next, we show a modified version that runs in O(Δ2 log Δ + log∗ n) time and achieves 2 log Δ-approximation. We will use the following notion: Definition 3. A k-distance coloring is an assignment of colors to nodes such that any two nodes within k hops of each other have distinct colors. Our algorithm consists of two parts. The first part is a 2-distance coloring routine, implemented by means of a coloring algorithm provided by Kuhn [18]. Kuhn’s distributed deterministic algorithm produces 1-distance coloring for any input graph G using Δ + 1 colors in O(Δ + log∗ n) time. For our purpose, we run this algorithm on G2 graph, created from G by (virtually) connecting each node with any of its neighbors at distance 2. This means that any message sent on such a virtual link is routed by an intermediate node to its target, increasing the running time of the algorithm by a constant factor. The second part of the algorithm is the approximation routine, which is a simple application of the greedy heuristic described in Sect. 1, where colors obtained in the first phase are used to break ties instead of IDs. That is, nodes exchange their span and color with all neighbors at distance 2 and decide to join the set if their span, color pair is lexicographically higher than any of the received pairs. The pseudo-code for the algorithm is given in Algorithm 1. It denotes the set of immediate neighbors of a node i by N1 (i) and the set of neighbors of i at distance 2 by N2 (i). Additionally, each node i uses the following local variables: – color: array with values of colors assigned to each node j ∈ N2 (i) by a 2-distance coloring routine. Initially, all values are set to ⊥. – state: array that holds the state of each node j ∈ N1 (i). The state can be uncovered, covered or marked. Initially, all values are set to uncovered. The nodes chosen to the dominating set are those that finish the algorithm with their state set to marked. – span: array with values for each node j ∈ N2 (i); span[j] holds the number of nodes in N1 (j) ∪ {j} that are uncovered by any other node already selected into the dominating set, as reported by j. Initially, all values are set to ⊥. – done: boolean array that specifies for each node j ∈ N1 (i) whether j has finished the algorithm. Initially, all values are set to false. Theorem 2. The algorithm in Algorithm 1 computes a dominating set with an approximation ratio of log Δ in O(Δ3 + log∗ n) time. Proof. We start by proving the bound on the running time of the algorithm. The 2-distance coloring routine requires O(Δ2 + log∗ n) time. This is because the maximal degree of nodes in the graph G2 is bounded by Δ(Δ − 1) and each round of the coloring algorithm of Kuhn [18] in G2 can be simulated by at most 2 rounds in the given graph G.
Deterministic Dominating Set Construction
73
Algorithm 1. code for node i 1: color[i] = calc-2-dist-coloring() 2: distribute-and-collect(color, 2)
use the coloring algorithm of [18]
3: while state[j] = uncovered for any j ∈ N1 (i) ∪ {i} do 4: span[i] := |{state[j] = uncovered | j ∈ N1 (i) ∪ {i}}| 5: distribute-and-collect(span, 2) 6: if span[i], color[i] > max{span[j], color[j] | j ∈ N2 (i) ∧ span[j] = ⊥} then 7: state[i] := marked 8: distribute-and-collect(state, 1) 9: if state[j] = marked for any j ∈ N1 (i) then 10: state[i] := covered 11: distribute-and-collect(state, 1) 12: done 13: broadcast done to all neighbors distribute-and-collect(arrayi , radius): 14: foreach q in [1,2, ..., radius] do 15: broadcast arrayi to all neighbors 16: receive arrayj from all j ∈ N1 (i) s.t. done[j] = f alse 17: foreach node l at distance q from i do 18: if ∃j ∈ N1 (i) s.t. done[j] = f alse ∧ node l at distance q − 1 from j then 19: arrayi [l] = arrayj [l] 20: done 21: done when done is received from j: 22: done[j] = true 23: span[j] = ⊥
The maximal value that the span can be assigned to is Δ + 1, while the number of colors produced by the coloring procedure is O(Δ2 ). Thus, the maximal number of distinct values for all span, color pairs is O(Δ3 ). In every other iteration of the greedy heuristic (while-do loop in Lines 3–12 in Algorithm 1), all nodes having a maximal value of the span, color pair join the set. Thus, after at most O(Δ3 ) iterations, all nodes are covered, while each iteration can be implemented in O(1) synchronous communication rounds. Summing over both phases produces the required bound on the running time. Note that if coloring is not required, the running time is independent of n. For the approximation ratio proof, observe that the span of a node is influenced only by its neighbors at distance of at most 2 hops. Also, notice that the dominating set problem is easily reduced to the set-cover problem (by creating a set for each node along with all its neighbors [13]). Thus, the algorithm chooses essentially exactly the same nodes as the well-known centralized greedy heuristic for the set-cover problem [5], which picks sets based on the number of uncovered elements they contain. Thus, the approximation ratio of the algorithm follows directly from the analysis of that heuristic (for details, see [5]).
74
R. Friedman and A. Kogan
To reduce the running time of the algorithm (at the price of increasing the approximation ratio by a factor of 2), we modify the algorithm to work with an adjusted span for each node u. The adjusted span is the smallest power of 2 that is at least as large as the number of u’s uncovered neighbors (including u itself). Thus, during the second phase of the algorithm, u exchanges its adjusted span and color with all nodes at distance 2 and decides to join the dominating set if its adjusted span, color is lexicographically higher than that of any node at distance 2. Note that one might adjust the span to the power of any other constant c > 1 improving slightly the approximation ratio, but not the asymptotic running time. Theorem 3. The modified algorithm computes a dominating set with an approximation ratio of 2 log Δ in O(Δ2 log Δ + log∗ n) time. Proof. The maximal value that the adjusted span can be assigned to is log Δ, while the number of colors produced by the coloring procedure is O(Δ2 ). Thus, similarly to the proof of Theorem 2, we can infer that the running time is O(Δ2 log Δ + log∗ n). The factor 2 in the approximation ratio appears due to the span adjustment. In order to prove this claim, consider the centralized greedy heuristic for the set-cover problem [5] with the adjusted span modification. That is, the number of uncovered elements in a set S is replaced (adjusted) by the smallest power of 2 which is at least as large as this number, and at each step, the heuristic chooses a set that covers the largest adjusted number of uncovered elements. Following the observation in the proof of Theorem 2, setting the approximation ratio for the centralized set-cover heuristic that uses the adjusted span modification will set the proof for the approximation ratio of the modified dominating set algorithm. When the (modified or unmodified) greedy heuristic chooses a set S, suppose that it charges each element of S by the price 1/i, where i is the number of uncovered elements in S. As a result, the total price paid by the heuristic is exactly the number of sets it chooses, while each element is charged only once. Consider the set S ∗ = {ek , ek−1 , . . . , e1 } in the optimal set-cover solution Sopt , and assume without loss of generality that the greedy heuristic covers the elements of S ∗ in the given order: ek , ek−1 , . . . , e1 . Consider the step at which the heuristic chooses a set that covers ei . At the beginning of that step, at least i elements are uncovered. Thus, if the heuristic were to choose the set S ∗ at that step, it would pay the price of 1/i per element. Using the adjusted span modification, the heuristic might pay at that step at most twice the price per element covered, i.e., it pays for ei at most 2/i. Consequently, the total price paid by the heuristic to cover all elements in the set S ∗ is at most Σ1≤i≤k 2/i = 2Hk , where Hk = Σ1≤i≤k 1/i = log k + O(1) is the k-th harmonic number. Thus, since every element is in some set of Sopt , we get that in order to cover all elements, the modified greedy heuristic pays at most ΣS∈Sopt 2Hm = 2Hm ΣS∈Sopt 1 = 2Hm |Sopt |, where m is the size of the biggest set in Sopt . In the instance of the set-cover problem produced from the graph with the bounded degree Δ, m = Δ+1, which establishes the required approximation ratio.
Deterministic Dominating Set Construction
5
75
Conclusions
In this paper, we examined distributed deterministic solutions for the dominating set problem, one of the most important problems in graph theory, in the scope of graphs with bounded node degree. Such graphs are useful for modeling networks in many realistic settings, such as various types of wireless and peer-to-peer networks. For these graphs, we showed that no purely local, i.e., independent of the number of nodes, deterministic algorithms that calculate a non-trivial approximation may ever exist. This lower bound is complemented by two approximation algorithms. The first algorithm finds a log Δ-approximation in O(Δ3 + log∗ n) time, while the second one achieves a 2 log Δ-approximation in O(Δ2 log Δ + log∗ n) time. These results compare favorably to previous deterministic algorithms with running time of O(n). With regard to the lower bound, they leave an additive gap of O(Δ2 log Δ) for further improvements. In the full version of this paper, we show a simple extension of our bounds for weighted bounded degree graphs.
Acknowledgments We would like to thank Fabian Kuhn and Jukka Suomela for fruitful discussions on the subject, and to anonymous reviewers whose valuable comments helped to improve the presentation of this paper.
References 1. ˚ Astrand, M., Flor´een, P., Polishchuk, V., Rybicki, J., Suomela, J., Uitto, J.: A local 2-approximation algorithm for the vertex cover problem. In: Keidar, I. (ed.) DISC 2009. LNCS, vol. 5805, pp. 191–205. Springer, Heidelberg (2009) 2. Astrand, M., Suomela, J.: Fast distributed approximation algorithms for vertex cover and set cover in anonymous networks. In: Proc. 22nd ACM Symposium on Parallelism in Algorithms and Architectures (SPAA), pp. 294–302 (2010) 3. Chen, Y.P., Liestman, A.L.: Approximating minimum size weakly-connected dominating sets for clustering mobile ad hoc networks. In: Proc. ACM Int. Symp. on Mob. Ad Hoc Networking and Computing (MobiHoc), pp. 165–172 (2002) 4. Chlebik, M., Chlebikova, J.: Approximation hardness of dominating set problems in bounded degree graphs. Inf. Comput. 206(11) (2008) 5. Chvatal, V.: A greedy heuristic for the set-covering problem. Mathematics of Operations Research 4(3), 233–235 (1979) 6. Czyzowicz, J., Dobrev, S., Fevens, T., Gonzalez-Aguilar, H., Kranakis, E., Opatrny, J., Urrutia, J.: Local algorithms for dominating and connected dominating sets of unit disk graphs with location aware nodes. In: Laber, E.S., Bornstein, C., Nogueira, L.T., Faria, L. (eds.) LATIN 2008. LNCS, vol. 4957, pp. 158–169. Springer, Heidelberg (2008) 7. Das, B., Bharghavan, V.: Routing in ad-hoc networks using minimum connected dominating sets. In: Proc. IEEE Int. Conf. on Comm (ICC), pp. 376–380 (1997) 8. Dong, Q., Bejerano, Y.: Building robust nomadic wireless mesh networks using directional antennas. In: Proc. IEEE INFOCOM, pp. 1624–1632 (2008)
76
R. Friedman and A. Kogan
9. Feige, U.: A threshold of ln n for approximating set cover. Journal of the ACM 45, 314–318 (1998) 10. Ferro, E., Potorti, F.: Bluetooth and Wi-Fi wireless protocols: a survey and a comparison. IEEE Wireless Communications 12(1), 12–26 (2005) 11. Friedman, R., Kogan, A.: Efficient power utilization in multi-radio wireless ad hoc networks. In: Abdelzaher, T., Raynal, M., Santoro, N. (eds.) OPODIS 2009. LNCS, vol. 5923, pp. 159–173. Springer, Heidelberg (2009) 12. Garey, M.R., Johnson, D.S.: Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman & Co. Ltd., New York (1979) 13. Guha, S., Khuller, S.: Approximation algorithms for connected dominating sets. Algorithmica 20, 374–387 (1998) 14. Han, B., Jia, W.: Clustering wireless ad hoc networks with weakly connected dominating set. Journal of Parallel and Distr. Computing 67(6), 727–737 (2007) 15. Herman, T., Tixeuil, S.: A distributed TDMA slot assignment algorithm for wireless sensor networks. In: Nikoletseas, S.E., Rolim, J.D.P. (eds.) ALGOSENSORS 2004. LNCS, vol. 3121, pp. 45–58. Springer, Heidelberg (2004) 16. Jia, L., Rajaraman, R., Suel, T.: An efficient distributed algorithm for constructing small dominating sets. In: Proc. ACM Symp. on Principles of Distr. Comp (PODC), pp. 33–42 (2001) 17. Kaashoek, M.F., Karger, D.R.: Koorde: A simple degree-optimal distributed hash table. In: Kaashoek, M.F., Stoica, I. (eds.) IPTPS 2003. LNCS, vol. 2735, pp. 98–107. Springer, Heidelberg (2003) 18. Kuhn, F.: Weak graph colorings: distributed algorithms and applications. In: Proc. Symp. on Paral. in Algorithms and Architectures (SPAA), pp. 138–144 (2009) 19. Kuhn, F., Moscibroda, T., Wattenhofer, R.: What cannot be computed locally! In: Proc. ACM Symp. on Principles of Distr. Comp. (PODC), pp. 300–309 (2004) 20. Kuhn, F., Moscibroda, T., Wattenhofer, R.: On the locality of bounded growth. In: Proc. ACM Symp. on Principles of Distr. Comp (PODC), pp. 60–68 (2005) 21. Kuhn, F., Moscibroda, T., Wattenhofer, R.: The price of being near-sighted. In: Proc. ACM-SIAM Symp. on Discrete Algorithms (SODA), pp. 980–989 (2006) 22. Lenzen, C., Wattenhofer, R.: Leveraging linial’s locality limit. In: Taubenfeld, G. (ed.) DISC 2008. LNCS, vol. 5218, pp. 394–407. Springer, Heidelberg (2008) 23. Liang, B., Haas, Z.J.: Virtual backbone generation and maintenance in ad hoc network mobility management. In: Proc. IEEE INFOCOM, pp. 1293–1302 (2000) 24. Linial, N.: Locality in distributed graph algorithms. SIAM Journal on Computing 21(1), 193–201 (1992) 25. Malkhi, D., Naor, M., Ratajczak, D.: Viceroy: a scalable and dynamic emulation of the butterfly. In: Proc. ACM Symp. on Principles of Distr. Comp (PODC), pp. 183–192 (2002) 26. Panconesi, A., Rizzi, R.: Some simple distributed algorithms for sparse networks. Distributed Computing 14(2), 97–100 (2001) 27. Peleg, D.: Distributed computing: a locality-sensitive approach. SIAM, Philadelphia (2000) 28. Rhee, I., Warrier, A., Min, J., Xu, L.: DRAND: distributed randomized TDMA scheduling for wireless ad-hoc networks. In: Proc. 7th ACM Int. Symp. on Mobile Ad Hoc Networking and Computing (MobiHoc), pp. 190–201 (2006) 29. Sivakumar, R., Das, B., Bharghavan, V.: Spine routing in ad hoc networks. Cluster Computing 1(2), 237–248 (1998) 30. Wu, J., Dai, F., Gao, M., Stojmenovic, I.: On calculating power-aware connected dominating sets for efficient routing in ad hoc wireless networks. Journal of Communications and Networks, 59–70 (2002)
PathFinder: Efficient Lookups and Efficient Search in Peer-to-Peer Networks Dirk Bradler1 , Lachezar Krumov1 , Max M¨ uhlh¨ auser1 , and Jussi Kangasharju2 1
TU Darmstadt, Germany {bradler,krumov,max}@cs.tu-darmstadt.de 2 University of Helsinki, Finland
[email protected] Abstract. Peer-to-Peer networks are divided into two main classes: unstructured and structured. Overlays from the first class are better suited for exhaustive search, whereas those from the second class offer very efficient key-value lookups. In this paper we present a novel overlay, PathFinder, which combines the advantages of both classes within one single overlay for the first time. Our evaluation shows that PathFinder is comparable or even better in terms of lookup and complex query performance than existing peer-to-peer overlays and scales to millions of nodes.
1
Introduction
Peer-to-peer overlay networks can be classified into unstructured and structured networks, depending on how they construct the overlay. In an unstructured network the peers are free to choose their overlay neighbors and what they offer to the network.1 In order to discover if a certain piece of information is available a peer must somehow search through the overlay. There are several implementations of such search algorithms. The original Napster used a central index server, Kazaa relied on a hybrid network with supernodes and the original Gnutella used a decentralized flooding of queries [4]. The BubbleStorm network [5] is a fully decentralized network based on random graphs and is able to provide efficient exhaustive search. Structured networks, on the other hand, have strict rules about how the overlay is formed and where content should be placed within the network. Structured networks are also often called distributed hash tables (DHT) and the research world has seen several examples of DHTs [3,7]. DHTs are very efficient for simple key-value lookups. Because objects are addressed with their unique names, searching in a DHT is hard to be made more efficient [6]. However, wildcard searching and complex queries either impose extensive complexity and costs in terms of additional messages or are not supported at all. Given the attractive properties of both these different network structures: it is natural to ask the question: Is it possible to combine these two properties in 1
In this paper we focus on networks where peers store and share content, e.g., files, database items, etc.
M.K. Aguilera et al. (Eds.): ICDCN 2011, LNCS 6522, pp. 77–82, 2011. c Springer-Verlag Berlin Heidelberg 2011
78
D. Bradler et al.
one single network? Our answer to this question is PathFinder, a peer-to-peer overlay which combines an unstructured and a structured network in a single overlay. PathFinder is based on a random graph which gives it short average path length, large number of alternative paths for fault tolerable, highly robust and reliable overlay topology. Our main contribution is the efficient combination of exhaustive searching and key-value lookups in a single overlay. The rest of this paper is organized as follows. In Section 2 we present an overview of PathFinder. Section 3 compares it to existing P2P overlays and we conclude in Section 4. Due to space limitations, the reader is referred to [1] for technical aspects such as node join and leave, handling crashed nodes, and network size adaptation. In [1] an extensive evaluation of PathFinder under churn and attacks is also presented.
2
PathFinder Design
In this section we present the system model and preliminaries of PathFinder. We also describe how the basic key-value lookup and exhaustive search work. For further basic operations like node join/leave, handling crashed nodes, see [1]. 2.1
Challenges
We designed PathFinder to be fully compliant with the concept of BubbleStorm [5], namely an overlay structure based on random graphs. We augment the basic random graph with a deterministic lookup mechanism (see Section 2.4) to add efficient lookups into the exhaustive search provided by BubbleStorm. The challenge and one of the key contributions of this paper is developing a deterministic mechanism for exploiting these short paths in implement DHT-like lookups. 2.2
System Model and Preliminaries
All processes in PathFinder benefit from the properties of its underlying random graph and the routing scheme built on top of it. PathFinder construction principle. The basic idea of PathFinder is to build a robust network of virtual nodes on top of the physical peers (i.e. actual physical nodes). Routing among peers is carried out in the virtual network. The actual data transfer still takes place directly among the physical peers. PathFinder builds a random graph of virtual nodes and then distributes them among the actual peers. At least one virtual node is assigned to each peer. From the routing point of view, the data in the network is stored on the virtual nodes. When a peer B is looking for a particular piece of information it has to find a path from one of its virtual nodes to the virtual node containing the requested data. Then B directly contacts the underlying peer A which is responsible for the targeted virtual node. B retrieves the requested data directly from A. This process is described in detail in Section 2.4.
PathFinder: Efficient Lookups and Efficient Search in Peer-to-Peer Networks
79
It is known that the degree sequence in a random graph is Poisson distributed. We need two pseudorandom number generators (PRNG) which initialized with the same ID always produce a deterministic sequence of numbers. Given a number c, the first generator returns Poisson distributed numbers with mean value c. The second PRNG given a node ID produces a deterministic sequence of numbers which we use as IDs for the neighbors of the given node. The construction principle of PathFinder is as follows. First we fix a number c (see [1] on how to chose c according to the number of peers and how to adapt it once the network becomes too small/large). Then, for each virtual node we determine the number of neighbors with the first number generator. The actual nodes IDs to which the current virtual node should be connected are chosen with the second number generator. The number generator is started with the ID of the virtual node. The process can be summarized in the following steps: 1. The underlying peer determines how many virtual nodes it should handle. See [1] for details. 2. For every virtual node handled by the peer: (a) The peer uses the poisson number generator to determine the number of neighbors of the current virtual node. (b) The peer then draws as many pseudo random numbers according to the number drawn in the previous step. (c) The peer selects the virtual nodes with IDs matching to those numbers as neighbors for its current virtual node. The construction mechanism of PathFinder allows the peers to build a random graph out of their virtual nodes. It is of crucial importance that a peer only needs a PRNG to perform that operation. There is no need for network communication. Similarly, any peer can determine the neighbors of any virtual node, simply seeding the pseudo random generator with the corresponding ID. Now we have both, a random graph topology suited for exhaustive search and a mechanism for each node to compute the neighbor list of any other node. i.e. DHT-like behavior within PathFinder. Routing table example of PathFinder. Figure 1 shows a small sample of PathFinder with a routing table for the peer with ID 11. The random graph has 5 virtual nodes (1 through 5) and there are 4 peers (with IDs from 11 through 14). Peer 11 handles two virtual nodes (4 and 5) and all the rest of the peers have 1 virtual node each. The arrows between the virtual nodes show the directed neighbor links. Each peer keeps track of its own outgoing links as well as incoming links from other virtual nodes. A peer learns the incoming links when the other peers attempt to connect to it. Keeping track of the incoming links is strictly speaking not necessary, but makes key lookups much more efficient (see Section 2.4). The routing table of peer marked as 11 therefore consists of all outgoing links from its virtual nodes 4 and 5 and the incoming link from virtual node 3.
D. Bradler et al.
80
Fig. 1. A small example of PathFinder
2.3
10k nodes 1M nodes 100M nodes
100
Cumulative Lookups
80
60
40
20
0
2
4
6
8
10
Path Length
Fig. 2. Key lookup with local expanding ring search from source and target
Fig. 3. Distribution of complete path length, 5000 key lookups with c = 20
Storing Objects
An object is stored on the virtual node (i.e. on the peer responsible for the virtual node) which matches the object’s identifier. If the hash space is larger than the number of virtual nodes, then we map the object to the virtual node whose identifier matches the prefix of the object hash. 2.4
Key Lookup
Key lookup is the process when a peer contacts another peer possessing a given data of interest. Using the structure of the network, the requesting peer traverses only one single and usually short path from itself to the target peer. Key lookup is the main function of a DHT. In order to perform quick lookups, the average number of hops between peers as well as the variance needs to be kept small. We now show how PathFinder achieves efficient lookups and thus behaves as any other DHT. Suppose that peer A wants to retrieve an object O. Peer A determines that the virtual node w is responsible for object O by using the hash function described above. Now A has to route in the virtual network from one of its virtual nodes to w and directly retrieve O from the peer responsible for w. Denote with V the set of virtual nodes managed by the peer A. For each virtual node in V , A calculates the neighbors of those nodes. (Note that this calculation is already done, since these neighbors are the entries in peer A’s routing table.) A checks if any of those neighbors is the virtual node w. If yes, A contacts the underlying peer to retrieve O. If none of peer A’s virtual node neighbors is responsible for O, A calculates the neighbors of all of its neighbors, i.e. its second neighbors. Because the neighbors of each virtual node are pre-known (see Section 2.2), this is a simple local computation. Again, peer A checks if any of the new calculated neighbors is responsible for O. If yes, peer A sends its request to the virtual node whose neighbor is responsible for O. If still no match is found, peer A expands its search by calculating the neighbors of the nodes from the previous step and checks again. The process continues until a match is found. A may have to calculate several neighbors, but a match is guaranteed.
PathFinder: Efficient Lookups and Efficient Search in Peer-to-Peer Networks
81
Because peer A is able to compute w’s neighboring virtual nodes, A can expand the search rings locally from both the source and target sides, which is called forward and backward chaining. In every step the search depth of the source and target search ring is increased by one. In that way the number of rings around the source are divided between the source itself and the target. This leads to exponential decrease in the number of IDs that have to be computed. We generated various PathFinder networks from 103 up to 108 nodes with average degree 20. In all of them we performed 5000 arbitrary key lookups. It turned out that, expanding rings of depth 3 or 4 (i.e., path length between 6 and 8) is sufficient for a successful key lookup, as shown in Figure 3. 2.5
Searching with Complex Queries
PathFinder supports searching with complex queries with tunable success rate almost identical to BubbleStorm [5]. In fact, since both PathFinder and BubbleStorm are based on random graphs, we implemented the search mechanism of BubbleStorm directly into PathFinder. In BubbleStorm both data and queries are sent to some number of nodes, where the exact number of messages depends on how we set the probability of finding the match. We use exactly the same algorithm in PathFinder for searching and the reader is referred to [5] for details.
3
Comparison and Analysis
Most DHT overlays provide the same functionality, since they all support the common interface for key based routing. The main differences between various DHT implementations are average lookup path length, resilience to failures, and load balancing. In this Section we compare PathFinder to established DHTs. ) The lookup path length of Chord is well studied: Lavg = log(N . The maximum 2 log(N ) ) path length of Chord is log(1+d) . The average path length of PathFinder is log(N , log(c) where c is the average number of neighbors. The path length of the Pastry model can be estimated by log2b (N ) [3], where b is a tunable parameter. The Symphony overlay is based on a small world graph. This leads to key lookups in 2 O( log k(N) ) hops [2]. The variable k refers only to long distance links. The actual 1 amount of neighbors is indeed much higher [2]. The diameter of CAN is 12 dN d with a degree for each node 2d, with a fixed d. With large d the distribution of path length becomes gaussian, like Chord. We use simulations to evaluate the practical effects of the individual factors. Figure 4 shows the results for a 20,000 nodes network. We perform 5,000 lookups among random pairs of nodes and measure the number of hops each DHT takes to find the object. Figure 5 displays the results. Note that PathFinder results come from actual simulation, not analytical calculations. PathFinder also inherits the exhaustive search mechanism of BubleStorm. Hence, as an unstructured overlay it performs identical to BubleStorm and the reader is referred to [5] for thorough comparison to other unstructured systems.
82
D. Bradler et al.
10
Pastry PathFinder (c=20) SkipNet Chord Symphony
80
8 6
Number of Hops
Cumulative Lookups
100
60
40
4
2
20
0
5
10
15
20
25
Path Length
Fig. 4. Average number of hops for 5,000 key lookups in different DHTs
4
1 1⋅103
Chord Pastry PathFinder (c=20) PathFinder (c=50) DeBruijn 1⋅104
1⋅105
1⋅106
1⋅107
1⋅108
Number of Nodes
Fig. 5. Average number of hops for different DHTs measured analytically. Numbers for PathFinder are simulated.
Conclusions
In this paper we have presented PathFinder, an overlay which combines efficient exhaustive search and efficient key-value lookups in the same overlay. Combining these two mechanisms in the same overlay is very desirable, since it allows efficient and overhead-free implementation of natural usage patterns. PathFinder is the first overlay to combine exhaustive search and key-value lookups in an efficient manner. Our results show that PathFinder has performance comparable or better to existing overlays. It scales easily to millions of nodes and its key lookup performance is in large networks better than in existing DHTs. Because PathFinder is based on a random graph, we are able to directly benefit from existing search mechanisms (BubbleStorm) for enabling efficient exhaustive search.
References 1. Bradler, D., Krumov, L., Kangasharju, J., Weihe, K., M¨ uhlh¨ auser, M.: Pathfinder: Efficient lookups and efficient search in peer-to-peer networks. Tech. Rep. TUD-CS2010872, TU Darmstadt (October 2010) 2. Manku, G., Bawa, M., Raghavan, P.: Symphony: Distributed Hashin In A Small World. In: Proc. 4th USENIX Symposium on Internet Techn. and Systems (2003) 3. Rowstron, A., Druschel, P.: Pastry: Scalable, decentralized object location, and routing for large-scale peer-to-peer systems. In: Liu, H. (ed.) Middleware 2001. LNCS, vol. 2218, p. 329. Springer, Heidelberg (2001) 4. Steinmetz, R., Wehrle, K. (eds.): Peer-to-Peer Systems and Applications. LNCS, vol. 3485. Springer, Heidelberg (2005) 5. Terpstra, W., Kangasharju, J., Leng, C., Buchmann, A.: Bubblestorm: resilient, probabilistic, and exhaustive peer-to-peer search. In: Proc. SIGCOMM, pp. 49–60 (2007) 6. Yang, Y., Dunlap, R., Rexroad, M., Cooper, B.: Performance of full text search in structured and unstructured peer-to-peer systems. In: Proc. IEEE INFOCOM (2006) 7. Zaho, B., Kubiatowicz, J., Joseph, A.: Tapestry: An infrastructure for fault-tolerant wide-area location and routing. Comp. 74 (2001)
Single-Version STMs Can Be Multi-version Permissive (Extended Abstract) Hagit Attiya1,2 and Eshcar Hillel1 1
2
Department of Computer Science, Technion Ecole Polytechnique Federale de Lausanne (EPFL)
Abstract. We present PermiSTM, a single-version STM that satisfies a practical notion of permissiveness, usually associated with keeping many versions: it never aborts read-only transactions, and it aborts other transactions only due to a conflicting transaction (which writes to a common item), thereby avoiding spurious aborts. It avoids unnecessary contention on the memory, being strictly disjointaccess parallel.
1 Introduction Transactional memory is a leading paradigm for programming concurrent applications for multicores. It is seriously considered as part of software solutions (abbreviated STMs) and as a basis for novel hardware designs, which exploit the parallelism offered by contemporary multicores and multiprocessors. A transaction encapsulates a sequence of operations on a set of data items: it is guaranteed that if a transaction commits, then all its operations appear to be executed atomically. A transaction may abort, in which case none of its operations are executed. The data items written by the transaction are its write set, the data items read by the transaction are its read set, and together they are the transaction’s data set. When an executing transaction may violate consistency, the STM can forcibly abort it. Many existing STMs, however, sometimes spuriously abort a transaction, even when in fact, the transaction may commit without compromising data consistency [9]. Frequent spurious aborts can waste system resources and significantly impair performance; in particular, this reduces the chances of long transactions, which often only read the data, to complete. Avoiding spurious aborts has been an important goal for STM design, and several conditions have been proposed to evaluate how well it is achieved [8, 9, 12, 16, 20]. A permissive STM [9] never aborts a transaction unless necessary to ensure consistency. A stronger condition, called strong progressiveness [12], further ensures that even when there are conflicts, at least one of the transactions involved in the conflict is not aborted. Alternatively, multi-version (MV) permissiveness [20] focuses on read-only transactions (whose write set is empty), and ensures they never abort; update transactions, with non-empty write set, may abort when in conflict with other transactions writing to the same items. As its name suggests, multi-version progressiveness was meant to be provided by a multi-version STM, maintaining multiple versions of each data item.
This research is supported in part by the Israel Science Foundation (grant number 953/06).
M.K. Aguilera et al. (Eds.): ICDCN 2011, LNCS 6522, pp. 83–94, 2011. c Springer-Verlag Berlin Heidelberg 2011
84
H. Attiya and E. Hillel
It has been suggested [20] that refraining to abort read-only transactions mandates the overhead associated with maintaining multiple versions: additional storage, a complex implementation of a precedence graph (to track versions), as well as an intricate garbage collection mechanism, to remove old versions. Indeed, MV-permissiveness is satisfied by current multi-version STMs, both practical [20, 21] and more theoretical [16, 19], keeping many versions per item. It can be achieved by other multi-version STMs [22,3], if enough versions of the items are maintained. This paper shows it is possible to achieve MV-permissiveness while keeping only a single version of each data item. We present PermiSTM, a single-version STM that is both MV-permissive and strongly progressive, indicating that multiple versions are not the only design choice when seeking to reduce spurious aborts. By maintaining a single version, PermiSTM avoids the high space complexity associated with multiversion STMs, which is often unacceptable in practice. This also eliminates the need for intricate mechanisms of maintaining and garbage collecting old versions. PermiSTM is lock-based, like many contemporary STMs, e.g., [6, 5, 7, 23]. For each data item, it maintains a single version, as well as a lock, and a read counter, counting the number of pending transactions that have read the item. Read-only transactions never abort (without having to declare them as such, in advance); update transactions abort only if some data item in their read set is written by another transaction, i.e., at least one of the conflicting transactions commits. Although it is blocking, PermiSTM is deadlock-free, i.e., always some transaction can make progress. The design choices of PermiSTM offer several benefits, most notably: – Simple lock-based design makes it easier to argue about correctness. – Read counters avoid the overhead of incremental validation, thereby improving performance, as demonstrated in [6, 17], especially in read-dominated workloads. Read-only transactions do not require validation at all, while update transactions validate their read sets only once. – Read counters circumvent the need for a central mechanism, like a global version clock. Thus, PermiSTM is strictly disjoint-access parallel [10], namely, processes executing transactions with disjoint data sets do not access the same base objects. It has been proved [20, Theorem 2] that a weakly disjoint-access parallel STM [2, 14] cannot be MV-permissive. PermiSTM, satisfying the even stronger property of strict disjoint-access parallelism, shows that this impossibility result depends on a strong progress condition: a transaction delays only due to a pending operation (by another transaction). In PermiSTM, a transaction may delay due to another transaction reading from its write set, even if no operation of the reading transaction is pending.
2 Preliminaries We briefly describe the transactional memory model [15]. A transaction is a sequence of operations executed by a single process. Each operation either accesses a data item or tries to commit or abort the transaction. Specifically, a read operation specifies the item to read, and returns the value read by the operation; a write operation specifies the item and value to be written; a try-commit operation returns an indication whether
Single-Version STMs Can Be Multi-version Permissive boolean CAS(obj, exp, new) { // Atomically if obj = exp then obj ← new return TRUE return FALSE }
85
boolean k CSS (o[k], e[k], new) { // Atomically if o[1] = e[1] and . . . o[k] = e[k] then o[1] ← new return TRUE return FALSE }
Fig. 1. The CAS and k-compare-single-swap primitives
the transaction committed or aborted; an abort operation returns an indication that the transaction is aborted. While trying to commit, a transaction might be aborted, e.g., due to conflict with another transaction.1 A transaction is forcibly aborted if an invocation of a try-commit returns an indication that the transaction is aborted. Every transaction begins with a sequence of read and write operations. The last operation of a transaction is either an access operation, in which case the transaction is pending, or a try-commit or an abort operation, in which case the transaction is committed or aborted. A software implementation of transactional memory (STM) provides data representation for transactions and data items using base objects, and algorithms, specified as primitives on the base objects, which asynchronous processes follow in order to execute the operations of the transactions. An event is a computation step by a process consisting of local computation and the application of a primitive to base objects, followed by a change to the process’s state, according to the results of the primitive. We employ the following primitives: READ(o) returns the value in base object o; WRITE(o, v) sets the value of base object o to v; CAS(o, exp, new) writes the value new to base object o if its value is equal to exp, and returns a success or failure indication; k CSS is similar to CAS, but compares the values of k independent base objects (see Figure 1). 2.1 STM Properties We require the STM to be opaque [11]. Very roughly stated, opacity is similar to requiring strict view serializability applied to all transactions (included aborted ones). Restrictions on spurious aborts are stated by the following two conditions. Definition 1. A multi-version (MV-)permissive STM [20] forcibly aborts a transaction only if it is an update transaction that has a conflict with another update transaction. Definition 2. An STM is strongly progressive [12] if a transaction that has no conflicts cannot be forcibly aborted, and if a set of transactions have conflicts on a single item then not all of them are forcibly aborted. These two properties are incomparable: strong progressiveness allows a read-only transaction to abort, if it has a conflict with another update transaction; on the other hand, MVpermissiveness does not guarantee that at least one transaction is not forcibly aborted in case of a conflict. 1
Two transactions conflict if they access the same data item; the conflict is nontrivial if at least one of the operations is a write. In the rest of the paper all conflicts are nontrivial conflicts.
86
H. Attiya and E. Hillel
Finally, an STM is strictly disjoint-access parallel [10] if two processes, executing transactions T1 and T2 , access the same base object, at least one with a non-trivial primitive, only if the data sets of T1 and T2 intersect.
3 The Design of PermiSTM The design of PermiSTM is very simple. The first and foremost goal is to ensure that a read-only transaction never aborts, while maintaining only a single-version. This suggests that the data returned by a read operation issued by a read-only transaction T should not be overwritten until T completes. A natural way to achieve this goal is to associate a read counter with each item, tracking the number of pending transactions reading from the item. Transactions that write to the data items respect the read counters; an update transaction commits and updates the items in its write set only in a “quiescent” configuration, where no (other) pending transaction is reading an item in its write set. This yields read-only transactions that guarantee consistency without requiring validation and without specifying them as such in advance. The second goal is to guarantee consistent updates of data items, by using ordinary locks to ensure that only one transaction is modifying a data item at each point. Thus, before writing its changes, at commit time, an update transaction acquires locks. Having two different mechanisms—locks and counters—in our design, requires care in combining them. One question is when during the executing, a transaction decrements the read counters of the items in its read set? The following simple example demonstrates how a deadlock may happen if an update transaction does not decrement its counters, before acquiring locks: T1 : read(a) write(b) try-commit T2 : read(b) write(a) try-commit T1 and T2 incremented the read counters of a and b, respectively, and later, in commit time, T1 acquires a lock on b, while T2 acquires a lock on a. To commit, T1 has to wait for T2 to complete and decrement the read counter of b, while T2 has to wait for the same to happen with T1 and item a. Since an update transaction first decrements read counters, it must ensure consistency by acquiring locks also for items in its read set. Therefore, an update transaction acquires locks for all items in its data set. Finally, read counters are incremented as they are encountered during the execution of the transaction. What happens if read-only transactions wait for locks to be released? The next example demonstrates how this can create a deadlock: read(b) T1 : read(a) T2 : write(b) write(a) try-commit If T2 acquires a lock on b, then T1 cannot read b until T2 completes; T2 cannot commit as it has to wait for T1 to complete and decrease the read counter of a; MV-permissiveness does not allow both transactions to be forcibly aborted. Thus, read counters get preference over locks, and they can always be incremented. Prior to committing, an update transaction first decrements its read counters, and then acquires locks on all items in its
Single-Version STMs Can Be Multi-version Permissive
write set (ws)
87
item seq data
lock
read set (rs)
read counter
item seq data
owner seq
rcounter
data
status
Fig. 2. Data structures used in the algorithm: an item (left) and a transaction descriptor (right)
data set, in a fixed order (while validating the consistency of its read set); this avoids deadlocks due to blocking cycles, and livelocks due to repeated aborts. Since committing a transaction and committing its updates are not done atomically, a committed transaction that has not yet completed updating all the items in its write set, can yield an inconsistent view for a transaction reading one of these items. If a read operation simply reads the value in the item, it might miss the up-to-date value of the item. Therefore, a read operation is required to read the current value of the item, which can be found either in the item, or in the data of the transaction.2 To simplify the exposition of PermiSTM, k-compare-single-swap (k CSS) [18] is applied to commit an update transaction while ensuring that the read counters of the items in its write set are all zero. Section 4 describes how the implementation can be modified to use only CAS; the resulting implementation is (strongly) disjoint-access parallel but is not strictly disjoint-access parallel. Data Structures. Figure 2 presents the data structures of items and transactions’ descriptors used in our algorithm. We associate a lock and a read counter with each item, as follows: – A lock includes an owner field, and an unbounded sequence number, seq, that are accessed atomically. The owner field is set to the id of the update transaction owning the lock and is 0 if no transaction holds the lock. The seq field holds the sequence number of the data, it is incremented whenever a new data is committed to the item, and it is used to assert the consistency of reads. – A simple read counter, rcounter, tracks how many transactions are reading the item. – The data field holds the value that was last written to the item, or its initial value if no transaction yet written to the item. The descriptor of a transaction consists of the read set, rs, the write set ws, and the status of the transaction. The read and write sets are collections of data items. – A data item in the read set includes a reference to an item, the data read from the item, and the sequence number of this data, seq. 2
This is analogous to the notion of current version of a transactional object in DSTM [13].
88
H. Attiya and E. Hillel
– A data item in the write set includes a reference to an item, the data to be written in the item, and the sequence number of the new data, seq, i.e., the sequence number of the current data plus 1. – A status indicates if the transaction is COMMITTED or ABORTED, initially NULL. The current data and sequence number of an item are defined as follows: If the lock of the item is owned by a committed transaction that writes to this item, then the current data and sequence number of the item appear in the write set of the owner transaction. Otherwise (owner is 0, or the owner is not committed, or the item is not in the owner’s write set), the current data and current sequence number appear in the item. The Algorithm. Next we give a detailed description of the main methods, for handling the operations; the code appears in Pseudocodes 1 and 2. The reserved word self in the pseudocode is a self-reference to the descriptor of the transaction whose code is being executed. read method: If the item is already in the transaction’s read set (line 2), return the value from the read set (line 3). Otherwise, increment the read counter of the item (line 5). Then, the reading transaction adds the item to its read set (line 7) with the current data and sequence number of the item (line 6). write method: If the item is not already in the transaction’s write set (line 11), then add the item to the write set (line 12). Set data of the item in the transaction’s write set to the new data to be written (line 13). No lock is acquired at this stage. tryCommit method: Decrement all the read counters of the items in the transaction’s read set (line 16). If the transaction is read-only, i.e., the write set of the transaction is empty (line 17), then commit (line 18); the transaction completes and returns (line 19). Otherwise, this is an update transaction and it continues: acquire locks on all items in the data set (line 20); commit the transaction (line 22) and the changes to the items (lines 23-25); release locks on all items in the data set (line 26). The transaction may abort while acquiring locks due to a conflict with another update transaction (line 21). acquireLocks method: Acquire locks on all items in the data set of the transaction by their order (line 30). If the item is in the read set (line 33), check that the sequence number in the read set (line 34) is the same as the current sequence number of the item (line 32). If the sequence number has changed (line 35) then the data read is overwritten by another committed transaction and the transaction aborts (line 36). Use CAS to acquire the lock: set owner from 0 to the descriptor of the transaction; if the item is in the read set this is done while asserting that seq is unchanged (line 38). If the CAS failed then owner is non-zero since there is another owner (or seq has changed), so spin, re-reading the lock (line 38) until owner is 0. If the item is in the write set (line 39), set the sequence number of the item in the transaction’s write set, seq, to the sequence number of the current data plus 1 (line 41). commitTx method: Use k CSS to set status to COMMITTED , while ensuring that all read counters of items in the transaction’s write set are 0 (line 47). If the read counter of one of these items is not 0, a pending transaction is reading from this item, then spin, until all rcounters are 0.
Single-Version STMs Can Be Multi-version Permissive
Pseudocode 1. Methods for read, write and try-commit operations 1: Data read(Item item) { 2: if item in rs then 3: di ← rs.get(item) 4: else 5: incrementReadCounter(item) 6: di ← getAbsVal(item) 7: rs.add(item,di) 8: return di.data 9: }
10: write(Item item, Data data) { 11: if item not in ws then 12: ws.add(item,item,0,0) 13: ws.set(item,item,0,data) 14: }
15: tryCommit() { 16: decrementReadCounters() // decrement read counters 17: if ws is empty then // read-only transaction 18: WRITE (status, COMMITTED ) 19: return // update transaction 20: acquireLocks() // lock all the data set 21: if ABORTED = READ(status) then return 22: commitTx() // commit update transaction 23: for each item in ws do // commit the changes to the items 24: di ← owner.ws.get(item) 25: WRITE (item.data, di.data) 26: releaseLocks() // release locks on all the data set 27: } 28: acquireLocks() { 29: ds ← ws.add(rs) // items in the data set (read and write sets) 30: for each item in ds by their order do 31: do 32: cur ← getAbsVal(item) // current value 33: if item in rs then // check validity of read set 34: di ← rs.get(item) // value read by the transaction 35: if di.seq != cur.seq then // the data is overwritten 36: abort() 37: return 38: while ! CAS(item.lock, 0,cur.seq, self,cur.seq) 39: if item in ws then 40: di ← ws.get(item) 41: ws.set(item,item,cur.seq+1,di.data) 42: } 43: commitTx() { 44: kCompare[0] ← status // the location to be compared and swaped 45: for i = 1 to k − 1 do // k − 1 locations to be compared 46: kCompare[i] ← ws.get(i).item.rcounter 47: while !k CSS (kCompare, NULL ,0 . . . 0, COMMITTED ) do 48: no-op // until no reading transactions is pending 49: }
89
90
H. Attiya and E. Hillel
Pseudocode 2. Additional methods for PermiSTM 50: incrementReadCounter(Item item) { 51: do m ← READ(item.rcounter) 52: while ! CAS(item.rcounter, m, m + 1) 53: }
59: releaseLocks() { 60: ds ← ws.add(rs) 61: for each item in ds do 62: di ← ds.get(item) 63: WRITE (item.lock, 0,di.seq) 64: }
54: decrementReadCounters() { 55: for each item in rs do 56: do m ← READ(item.rcounter) 57: while ! CAS(item.rcounter, m, m − 1) 58: }
65: DataItem getAbsVal(Item item) { 66: lck ← READ(item.lock) 67: dt ← READ(item.data) 68: di ← item, lck.seq, dt // values from the item 69: if lck.owner != 0 then 70: sts ← READ(lck.owner.status) 71: if sts = COMMITTED then 72: if item in lck.owner.ws then 73: di ← lck.owner.ws.get(item) // values from the write set of the owner 74: return di 75: } 76: abort() { 77: ds ← ws.add(rs) 78: for each item in ds do 79: lck ← READ(item.lock) 80: if lck.owner = self then // the transaction owns the item 81: WRITE (item.lock, 0,lck.seq) // release lock 82: WRITE (status, ABORTED ) 83: }
Properties of PermiSTM. Since PermiSTM is lock-based, it is easier to argue that it preserves opacity than in implementations that do not use locks. Specifically, an update transaction holds locks on all items in its data set before committing, which allows updating transactions to be serialized at their commit point. The algorithm ensures that an update transaction does not commit, leaving the value of the items in its write set unchanged, as long as there is a pending transaction reading one of the items to be written. A read operation reads the current value of the item, after incrementing its read counter. So, if an update transaction commits before the read counter is incremented, but the changes are not yet committed in the items, the reading transaction still maintains a consistent state as it reads the value from the write set of the committed transaction, which is the up-to-date value of the item. Hence, a read-only transaction is serialized after the update transaction that writes to one of the read items and is last to be committed. Since all transactions do not decrement read counters until commit time, and since all read operations return the up-to-date value of the item, all transactions maintain a consistent view. As this holds for committed as well as aborted transactions, PermiSTM is opaque.
Single-Version STMs Can Be Multi-version Permissive
91
Next we discuss the progress properties of the algorithm. After an update transaction acquires locks on all items in its data set it may wait for other transactions reading items in its write set to complete, it may even starve due to continual stream of readers; thus, our STM is blocking. However, the STM guarantees strong progressiveness, as transactions are forcibly aborted only due to another committed transaction with a read-afterwrite conflict; since read-only transactions are never forcibly aborted, PermiSTM is MV-permissive. Furthermore, read-only transactions are obstruction-free [13]. A readonly transaction may delay due to contention with concurrent transactions, updating the same read counters, but once it is running solo it is guaranteed to commit. Write, try-commit and abort operations only access the descriptor of the transaction and the items in the data set of the transaction; this may result in contention only with non-disjoint transactions. A read operation, in addition to accessing the read counter of the item, also reads the descriptor of the owning transaction, which may result in contention only with non-disjoint transactions; thus, PermiSTM is strictly disjoint-access parallel. Note that disjoint transactions may concurrently read the descriptor of a transaction owning items the transactions read, however, this does not violate strict disjointaccess parallelism. Furthermore these disjoint transaction read from the same base object only if they all intersect with the owning transaction; This property is called 2-local contention [1] and it implies (strong) disjoint-access parallelism [2].
4
CAS-Based PermiSTM
The k CSS operation can be implemented in software from CAS [18], without sacrificing the properties of PermiSTM. However, this implementation is intricate and incurs a step complexity that can be avoided in our case. This section outlines the modifications of PermiSTM needed to obtain an STM with similar properties using CAS instead of a k CSS primitive; this results in more costly read operations. We still wish to guarantee that an update transaction commits only in a “quiescent” configuration, in which no other transaction is reading an item in its write set. If the committing update transaction does not use k CSS, then the responsibility of “notifying” the update transaction that it cannot commit is shifted to the read operations, and they pay the extra cost of preventing the update transactions from committing in a nonquiescent configuration. A transaction commits by changing its status from NULL to COMMITTED ; a way to prevent an update transaction from committing is by invalidating its status. For this purpose, we attach a sequence number to the transaction status. Prior to committing, an update transaction reads its status, which now includes the sequence number, and repeats the following for each item in its write set: spin on the item’s read counter until the read counter becomes zero, then annotate the zero with a reference to its descriptor, and the status sequence number. The transaction changes its status to COMMITTED only if the sequence number of its status has not changed since it has read it. Once it completes annotating all zero counters, and unless it is notified by some read operation that one of the counters changed and it is no longer “quiescent”, the update transaction can commit—using only a CAS. A read operation basically increases the read counter, and then reads the current value of the item. The only change is when it encounters a “marked” counter. If the
92
H. Attiya and E. Hillel
update transaction annotating the item already committed, the read operation simply increases the counter. Otherwise, the read operation invalidates the status of the update transaction, by increasing its status sequence number. If more than one transaction is reading an item from the write set of the update transaction, at least one of them prevents the update transaction from committing, by changing its status sequence number. The changes in the data structures used by the algorithm are as follows: The status of a transaction descriptor now includes the state of the transaction (NULL, COMMITTED, or ABORTED ), as well as a sequence number, seq, that is used to invalidate the status; these fields are accessed atomically. The read counter, rcount, of an item is a tuple including a counter of the number of readers, the owner transaction of the item (holding its lock), and seq matching the status sequence number of the owner. We reuse the core implementation of operations from Pseudocodes 1 and 2. The most crucial modification is in the protocol for incrementing the read counter, which invalidates the status of the owner transaction when increasing the item’s read counter. Pseudocode 3 presents the main modifications. In order to commit, an update transaction reads the read counter of every item in its write set (lines 87-88), and when the read counter is 0, the update transactions annotates the 0 with its descriptor and status sequence number, using CAS (line 89). Finally it sets the status to COMMITTED while increasing the status sequence number, using CAS. If the status was invalidated and the last CAS fails, the transaction re-reads the status (line 86) and goes over the procedure again. A successful CAS implies that the transaction committed while no other transaction is reading any item in its write set.
Pseudocode 3. methods for avoiding k CSS 84: commitTx() { 85: do 86: sts ← READ(status) 87: for each item in ws do 88: do rc ← READ(item.rlock) // spin until no readers 89: while ! CAS(item.rcounter, 0,rc.owner,rc.seq, 0,self,sts.seq) // annotated 0 // commit in a “quiescent” configuration 90: while ! CAS(status, NULL ,sts.seq, COMMITTED ,sts.seq+1) 91: } 92: incrementReadCounter(Item item) { 93: do 94: rc ← READ(item.rcounter) 95: if rc.owner != 0 then // the read counter is “marked” 96: CAS(rc.owner.status, NULL ,rc.seq, NULL ,rc.seq+1) // invalidate status 97: while ! CAS(item.rcounter, rc, rc.counter+1,rc.owner,rc.seq) // increase counter 98: } 99: decrementReadCounters() { 100: for each item in rs do 101: do rc ← READ(item.rcounter) 102: while !CAS (item.rcounter, rc, rc.counter−1,0,0) // clean and decrease counter 103: }
Single-Version STMs Can Be Multi-version Permissive
93
To ensure that an update transaction only commits in a “quiescent” configuration, a read operation that finds the read counter of the item “marked” (lines 94-95) continuous as follows: use CAS to invalidate the status of the owner transaction—by increasing its sequence number (line 96), if the status sequence number has changed, either the owner is committed or its status was already invalidated; finally, the reader transaction simply increases the reader counter using CAS (line 97). If increasing the reader counter fails, the reader repeats the procedure. While decreasing the read counters the reader transaction cleans each read counter by setting its owner and seq fields to 0 (line 102). In addition, methods such as tryCommit and abort are adjusted to handle the new structure, for example, accessing the read counter and state indicator through the new read counters and status fields. The resulting algorithm is not strictly disjoint-access parallel. Two transactions, T1 and T2 , reading items a and b, respectively, may access the same base object, when checking and invalidating the status of a third transaction, T3 , updating these items. The algorithm, however, has 2-local contention [1] and is (strongly) disjoint-access parallel, as this memory contention is always due to T3 , which intersects both T1 and T2 .
5 Discussion This paper presents PermiSTM, a single-version STM that is both MV-permissive and strongly progressive; it is also disjoint-access parallel. PermiSTM has simple design, based on read counters and locks, to provide consistency without incremental validation. This also simplifies the correctness argument. The first variant of PermiSTM uses a k-compare-single-swap to commit update transactions. No architecture currently provides k CSS in hardware, but it can be supported by best-effort hardware transactional memory (cf. [4]). In PermiSTM, update transactions are not obstruction free [13], since they may block due to other conflicting transactions. Indeed, a single-version, obstruction-free STM cannot be strictly disjoint-access parallel [10]. Read-only transactions modify the read counters of all items in their read set. This matches the lower bound for read-only transactions that never abort, for (strongly) disjoint-access parallel STMs [2]. Several design principles of PermiSTM are inspired by TLRW [6], which uses readwrite locks. TLRW, however, is not permissive as read-only transactions may abort due to a timeout while attempting to acquire a lock. We avoid this problem by tracking readers through read counters (somewhat similar to SkySTM [17]) instead of read locks. Our algorithm improves on the multi-versioned UP-MV STM [20], which is not weakly disjoint-access parallel (nor strictly disjoint-access parallel), as it uses a global transaction set, holding the descriptors of all completed transactions yet to be collected by the garbage collection mechanism. UP-MV STM requires that operations execute atomically; its progress properties depend on the precise manner this atomicity is guaranteed, which is not detailed. We remark that simply enforcing atomicity with a global lock or a mechanism similar to TL2 locking [5] could make the algorithm blocking. Acknowledgements. We thank the anonymous refereees for helpful comments.
94
H. Attiya and E. Hillel
References 1. Afek, Y., Merritt, M., Taubenfeld, G., Touitou, D.: Disentangling multi-object operations. In: PODC 1997, pp. 111–120 (1997) 2. Attiya, H., Hillel, E., Milani, A.: Inherent limitations on disjoint-access parallel implementations of transactional memory. In: SPAA 2009, pp. 69–78 (2009) 3. Aydonat, U., Abdelrahman, T.: Serializability of transactions in software transactional memory. In: TRANSACT 2008 (2008) 4. Dice, D., Lev, Y., Marathe, V.J., Moir, M., Nussbaum, D., Olszewski, M.: Simplifying concurrent algorithms by exploiting hardware transactional memory. In: SPAA 2010, pp. 325– 334 (2010) 5. Dice, D., Shalev, O., Shavit, N.: Transactional locking II. In: Dolev, S. (ed.) DISC 2006. LNCS, vol. 4167, pp. 194–208. Springer, Heidelberg (2006) 6. Dice, D., Shavit, N.: TLRW: Return of the read-write lock. In: SPAA 2010, pp. 284–293 (2010) 7. Ennals, R.: Software transactional memory should not be obstruction-free. Technical Report IRC-TR-06-052, Intel Research Cambridge Tech. Report (2006) 8. Gramoli, V., Harmanci, D., Felber, P.: Towards a theory of input acceptance for transactional memories. In: Baker, T.P., Bui, A., Tixeuil, S. (eds.) OPODIS 2008. LNCS, vol. 5401, pp. 527–533. Springer, Heidelberg (2008) 9. Guerraoui, R., Henzinger, T.A., Singh, V.: Permissiveness in transactional memories. In: Taubenfeld, G. (ed.) DISC 2008. LNCS, vol. 5218, pp. 305–319. Springer, Heidelberg (2008) 10. Guerraoui, R., Kapalka, M.: On obstruction-free transactions. In: SPAA 2008, pp. 304–313 (2008) 11. Guerraoui, R., Kapalka, M.: On the correctness of transactional memory. In: PPoPP 2008, pp. 175–184 (2008) 12. Guerraoui, R., Kapalka, M.: The semantics of progress in lock-based transactional memory. In: POPL 2009, pp. 404–415 (2009) 13. Herlihy, M., Luchangco, V., Moir, M., Scherer III., W.N.: Software transactional memory for dynamic-sized data structures. In: PODC 2003, pp. 92–101 (2003) 14. Israeli, A., Rappoport, L.: Disjoint-access-parallel implementations of strong shared memory primitives. In: PODC 1994, pp. 151–160 (1994) 15. Kapalka, M.: Theory of Transactional Memory. PhD thesis, EPFL (2010) 16. Keidar, I., Perelman, D.: On avoiding spare aborts in transactional memory. In: SPAA 2009, pp. 59–68 (2009) 17. Lev, Y., Luchangco, V., Marathe, V.J., Moir, M., Nussbaum, D., Olszewski, M.: Anatomy of a scalable software transactional memory. In: TRANSACT 2009 (2009) 18. Luchangco, V., Moir, M., Shavit, N.: Nonblocking k-compare-single-swap. In: SPAA 2003, pp. 314–323 (2003) 19. Napper, J., Alvisi, L.: Lock-free serializable transactions. Technical Report TR-05-04, The University of Texas at Austin (2005) 20. Perelman, D., Fan, R., Keidar, I.: On maintaining multiple versions in STM. In: PODC 2010, pp. 16–25 (2010) 21. Perelman, D., Keidar, I.: SMV: Selective Multi-Versioning STM. In: TRANSACT 2010 (2010) 22. Riegel, T., Felber, P., Fetzer, C.: A lazy snapshot algorithm with eager validation. In: Dolev, S. (ed.) DISC 2006. LNCS, vol. 4167, pp. 284–298. Springer, Heidelberg (2006) 23. Saha, B., Adl-Tabatabai, A.-R., Hudson, R.L., Cao Minh, C., Hertzberg, B.: McRT-STM: a high performance software transactional memory system for a multi-core runtime. In: PPoPP 2006, pp. 187–197 (2006)
Correctness of Concurrent Executions of Closed Nested Transactions in Transactional Memory Systems Sathya Peri1, and Krishnamurthy Vidyasankar2 1
Indian Institute of Technology Patna, India
[email protected] 2 Memorial University, St John’s, Canada
[email protected] Abstract. A generally agreed upon requirement for correctness of concurrent executions in Transactional Memory systems is that all transactions including the aborted ones read consistent values. Opacity is a recently proposed correctness criterion that satisfies the above requirement. Our first contribution in this paper is extending the opacity definition for closed nested transactions. Secondly, we define conflicts appropriate for optimistic executions which are commonly used in Software Transactional Memory systems. Using these conflicts, we define a restricted, conflict-preserving, class of opacity for closed nested transactions the membership of which can be tested in polynomial time. As our third contribution, we propose a correctness criterion that defines a class of schedules where aborted transactions do not affect consistency of the other transactions. We define a conflict-preserving subclass of this class as well. Both the class definitions and the conflict definition are new for nested transactions.
1 Introduction In recent years, Software Transactional Memory (STM) has garnered significant interest as an elegant alternative for developing concurrent code. Importantly, transactions provide a very promising approach for composing software components. Composing simple transactions into a larger transaction is an extremely useful property which forms the basis of modular programming. This is achieved through nesting. A transaction is nested if it is invoked by another transaction. STM systems ensure that transactions are executed atomically. That is, each transaction is either executed to completion in which case it is committed and its effects are visible to other transactions or aborted and the effects of a partial execution, if any, are rolled back. In a closed nested transaction [2], the commit of a sub-transaction is local; its effects are visible only to its parent. When the top-level transaction (of the nestedcomputation) commits, the effects of the sub-transaction are visible to other top-level transactions. The abort of a sub-transaction is also local; the other sub-transactions and the top-level transaction are not affected by its abort.1 To achieve atomicity, a commonly used approach for software transactions is optimistic synchronisation (term used in [6]). In this approach, each transaction has local 1
This work was done when the author was a Post-doctoral Fellow at Memorial University. Apart from Closed nesting, Flat and Open nesting [2] are the other means of nesting in STMs.
M.K. Aguilera et al. (Eds.): ICDCN 2011, LNCS 6522, pp. 95–106, 2011. c Springer-Verlag Berlin Heidelberg 2011
96
S. Peri and K. Vidyasankar
buffers where it records the values read and written in the course of its execution. When the transaction completes, the contents of its buffers are validated. If the values in the buffers form a consistent view of the memory then the transaction is committed and the values are merged into the memory. If the validation fails, the transaction is aborted and the buffer contents are ignored. The notion of buffers extends naturally to closed nested transactions. When a sub-transaction is invoked, new buffers are created for all the data items it accesses. The contents of the buffers are merged with its parent’s buffers when the sub-transaction commits. A commonly accepted correctness requirement for concurrent executions in STM systems is that all transactions including aborted ones read consistent values. The values resulting from any serial execution of transactions are assumed to be consistent. Then, for each transaction, in a concurrent execution, there should exist a serial execution of some of the transactions giving rise to the values read by that transaction. Guerraoui and Kapalka [5] captured this requirement as opacity. An implementation of opacity has been given in [8]. On the other hand, the recent understanding (Doherty et al [3], Imbs et al [7]) is that opacity is too strong a correctness criterion for STMs. Weaker notions have been proposed: (i) The requirement of a single equivalent serial schedule is replaced by allowing possibly different equivalent serial schedules for committed transactions and for each aborted transaction, and these schedules need not be compatible; and (ii) the effects, namely, the read steps, of aborted transactions should not affect the consistency of the transactions executed subsequently. The first point refines the consistency notion for aborted transactions. (All the proposals insist on a single equivalent serial schedule consisting of all committed transactions.) The second point is a desirable property for transactions in general and a critical point for nested transactions, where the reads of an aborted sub-transaction may prohibit committing the entire top-level transaction. The above proposals in the literature have been made for non-nested transactions. In this paper, we define two notions of correctness and corresponding classes of schedules: Closed Nested Opacity (CNO) and Abort-Shielded Consistency (ASC). In the first notion, read steps of aborted (sub-)transactions are included in the serialization as in opacity [5, 8]. In the second, they are discarded. These definitions turn out to be non-trivial due to the fact that an aborted sub-transaction may have some (locally) committed descendents and similarly some committed ancestors. Checking opacity, like general serializability (for instance, view-serializability), cannot be done efficiently. Very much like restricted classes of serializability allowing polynomial membership test, and facilitating online scheduling, restricted classes of opacity can also be defined. We define such classes along the lines of conflict-serializability for database transactions: Conflict-Preserving Closed Nested Opacity (CP-CNO) and Conflict-Preserving Abort-Shielded Consistency (CP-ASC). Our conflict notion is tailored for optimistic execution of the sub-transactions and not just between any two conflicting operations. We give an algorithm for checking the membership in CP-CNO which can be easily modified for CP-ASC as well. The algorithm uses serialization graphs similar to those in [12]. Using this algorithm an online scheduler implementing these classes can be designed.
Correctness of Concurrent Executions of Closed Nested Transactions
97
We note that all online schedulers (implementing 2PL, timestamp, optimistic approaches, etc.) for database transactions allow only subclasses of conflict-serializable schedules. We believe similarly that all STM schedulers can only allow subclasses of conflict-preserving schedules satisfying opacity or any of its variants. Such schedulers are likely to use mechanisms simpler than serialization graphs as in the database area. An example is the scheduler described by Imbs and Raynal [8]. There have been many implementations of nested transactions in the past few years [2, 10, 1, 9]. However, none of them provide precise correctness criteria for closed nested transactions that can be efficiently verified. In [2], the authors provide correctness criteria for open nested transactions which can be extended to closed nested transactions as well. Their correctness criteria also look for a single equivalent serial schedule of both (read-only) aborted transactions and committed transactions. Roadmap: In Section 2, we describe our model and background. In Section 3, we define CNO, CP-CNO In Section 4, we present ASC and CP-ASC Section 5 concludes this paper.
2 Background and System Model A transaction is a piece of code in execution. In the course of its execution, a nested transaction performs read and write operations on memory and invokes other transactions (also referred to as sub-transactions). A compuation of nested transactions constitutes a computation tree. The operations of the computation are classified as: simplememory operations and transactions. Simple-memory operations are read or write on memory. In this document when we refer to a transaction in general, it could be a toplevel transaction or a sub-transaction. Collectively, we refer to transactions and simplememory operation as nodes (of the computation tree) and denote them as nid . If a transaction tX executes successfully to completion, it terminates with a commit operation denoted as cX . Otherwise it aborts, aX . Abort and commit operations are called terminal operations2. By default, all the simple-memory operations are always considered to be (locally) committed. In our model, transactions can interleave at any level. Hence the child sub-transactions of any transaction can execute in interleaved manner. To perform a write operation on data item x, a closed-nested transaction tP creates a x-buffer (if it is not already present) and writes to the buffer. A buffer is created for every data item tP accesses. When tP commits, it merges the contents of its local buffers with the buffers of its parent. Any peer (sibling) transaction of tP can read the values written by tP only after tP commits. We assume that there exists a hypothetical root transaction of the computation tree, denoted as tR , which invokes all the top-level transactions. On system initialization we assume that there exists a child transaction tinit of tR , which creates and initializes all the buffers of tR that are written or read by any descendant of tR . Similarly, we also assume that there exists a child transaction tf in of tR , which reads the contents of tR ’s buffers when the computation terminates. 2
A transaction starts with a begin operation. In our model we assume that the begin operation is superimposed with the first event of the transaction. Hence, we do not explicitly represent it in our schedules.
98
S. Peri and K. Vidyasankar
Coming to reads, a transaction maintains a read set consisting of all its read operations. We assume that for a transaction to read a data-item, say x, (unlike write) it has access to the buffers of all its ancestors apart from its own. To read x, a nested sub-transaction tN starts with its local buffers. If it does not contain a x-buffer, tN continues to read the buffers of its ancestors starting from its parent until it encounters a transaction that contains a x-buffer. Since tR ’s buffers have been initialized by tinit , tN will eventually read a value for x. When the transaction commits, its read set is merged with its parent’s read set. We will revisit read operations a few subsections later. 2.1 Schedules A schedule is a totally ordered sequence (in real time order) of simple-memory operations and terminal operations of transactions in a computation. These operations are referred to as events of the schedule. A schedule is represented by the tuple evts, nodes, ord, where evts is the set of all events in the schedule, nodes is the set of all the nodes (transactions and simple-memory operations) present in the computation and ord is an unary function that totally orders all the events in the order of execution. Example 1 shows a schedule, S1. In this schedule, the memory operations r2211 (x) and w2212 (y) belong to the transaction t221 . Transactions t22 and t31 are aborted. All the other transactions are committed. It must be noted that the t221 and t222 of t22 are committed sub-transactions of the aborted transaction t22 . Example 1 S1 : r111 (z)w112 (y)w12 (z)c11 r211 (b)r2211 (x)w2212 (y)c221 w212 (y)c21 w13 (y)c1 r2221 (y)w2222 (z)c222 a22 w23 (z)r311 (y)c2 w312 (y)a31 r321 (z)w322 (z)c32 c3 The events of the schedule are the real time representation of the leaves of the computation tree. The computation tree for a schedule S1 is shown in Figure 1. The order of execution of memory operations is from left to right as shown in the tree. The dotted edges represent terminal operations. The terminal operations are not part of the computation tree but are represented here for clarity. tR tinit
t1 t11
c1
t21 c w12 (z)11
r111(z) w112(y)
tf in
t2
t221 r211(b)
w13 (y) c21 w212 (y) c221
r2211(x)w
2212 (y)
t3 c2
t22
t31
t32
w23 (z) t222 a22
c222
r311 (y)
a31
w312(y)
r2221(y)w2222(z)
Fig. 1. Computation tree for Example 1
c3 c32
w (z) r321(z)322
Correctness of Concurrent Executions of Closed Nested Transactions
99
For a closed nested transaction, all its write operations are visible to other transactions only after it commits. In S1, w212 (y) occurs before w13 (y). When t1 commits, it writes w13 (y) onto tR ’s y-buffer. But t2 commits after t1 commits. When t2 commits it overwrites tR ’s y buffer with w212 (y). Thus when transaction t31 performs the read operation r311 (y), it reads the value written by w212 (y) and not w13 (y) even though w13 (y) occurs after w212 (y). To model the effects of commits clearly, we augment a schedule with extra write operations. For each transaction that commits, we introduce a commit-write operation for each data item x the transaction writes to or one of its children commit-writes. This writes the latest value in the transaction’s x-buffer. The commit-writes are added just before the commit operation and represent the merging of the local buffers with its parent’s buffers. Using this representation, the schedule for Figure 1 is: Example 2 112 2212 S2 : r111 (z)w112 (y)w12 (z)w11 (y)c11 r211 (b)r2211 (x)w2212 (y)w221 (y)c221 w212 (y) 212 12 13 2222 w21 (y)c21 w13 (y)w1 (z)w1 (y)c1 r2221 (y)w2222 (z)w222 (z)c222 a22 w23 (z)r311 (y) 322 w221 (y)w223 (z)c2 w312 (y)a31 r321 (z)w322 (z)w32 (z)c32 w332 (z)c3 112 212 Some examples of commit-write in S2 are w11 (y), w21 (y), w223 (z) etc. The com112 mitwrite w11 (y) represents t11 ’s write onto t1 ’s y buffer with the value written by w112 . There are no commit-writes for aborted transactions. Hence the writes of aborted transactions are not visible to its peers. Originally in the computation tree only the leaf nodes could write. With this augmentation of transactions, even non-leaf nodes (i.e. committed transactions) write with commit-write operations. For sake of brevity, we do not represent commit-writes in the computation tree. In the rest of this document, we assume that all the schedules we deal with are augmented with commit-writes. Generalizing the notion of commit-writes to any node of the tree, the commit-write of a simple-memory write is itself. It is nil for a read operation and aborted transactions. Collectively we refer to simple-memory operations along with commit-write operations as memory operations. With commit-write operations, we extend the definition of an operation, denoted as oX , to represent a transaction, a commit-write operation or a simple-memory operation. It can be seen that a schedule partially orders all the transactions and simple-memory operations in the computation. The partial order is called schedule-partial-order and is denoted <S . For a transaction tX in S, we define S.tX .f irst, S.tX .last as the first and last operations of tX . Thus, S.tX .last denotes the terminal operation of the tX . For a simple-memory operation mX , S.mX .f irst = S.mX .last. For two nodes nX , nY , in S: (nX <S nY ) ≡ (S.ord(S.nX .last) < S.ord(S.nY .f irst)).
2.2 Function Definitions For a commit-write operation wX we define its holder, S.holder(wX ) as the transaction tX to which it belongs to. Extending this function to a node (a transaction or simple-memory operation), the holder of a node is itself. For any operation oX , we define S.level(oX ) as the distance of S.holder(oX ) in the tree from the root. From this definition tR is at level 0. The level of a transaction and all its commit-write operations 212 are the same. For instance in Example 2, S2.level(w21 ) = S2.level(t21) = 2.
100
S. Peri and K. Vidyasankar
The functions on a tree, namely parent, children, ancestor, descendant, peer (siblings) can be extended to commit-write operations by defining them for S.holder(oX ) over the tree. For instance in S2 of Example 2, S2.parent(w221 ) = tR and S2.children( w221 = {t21 , t22 , w23 (z)}. Thus transactions and simple-memory operations are children of a transaction. Two commit-writes of the same node are not peers of each other since they have the same holder. For a transaction, tX in a computation, we define its dSet, denoted as S.dSet(tX ) as the set consisting of tX , tX ’s commit-writes, tX ’s begin and terminal operations and dSets of tX ’s descendants (including its children). This set comprises of all the operations in the sub-tree of tX . A simple-memory operation’s dSet is itself. A commitwrite’s dSet is its holder transactions’s dSet. In Example 2, S2.dSet(t2 ) = S2.dSet(w223 212 (z)) = {t2 , t21 , t22 , w23 (z), r211 (b), w212 (y), w21 (y), c21 , t221 , r2211 (x), w2212 (y), 2212 2222 w221 (y), c221 , t222 , r2221 (y), w2222 (z), w222 (z), c222 , a22 , w221 (y), w223 (z), c2 } Next we define a boolean function optVis on two operations oX , oY in a schedule S, denoted as S.optV is(oY , oX ). It is true if oY is a peer of oX or peer of an ancestor of oX , i.e., oY ∈ (S.peers(oX ) ∪ S.peers(S.ansc(oX ))). Otherwise it is false. This definition implies that if (oX ∈ S.dSet(oY )) then S.optV is(oY , oX ) is false. As a result for any commit-write of oY , say wY , S.optV is(wY , oX ) is false as well. One can see that optVis function is not symmetric (but not asymmetric). Hence S.optV is(oY , oX ) does not imply S.optV is(oX , oY ). In S2, S2.optV is(w112 (z), r211 (b)) is true as w112 (z) is a peer of t2 which is an ancestor of r211 (b). Similarly S2.optV is(t3 , r2221 (y)) is true because t3 is a peer of t2 which is an ancestor of r2221 (y). But S2.optV is(r2221 (y), t3 ) is false. We denote S.schOps(tX ) as the set of operations in S.dSet(tX ) which are also present in S.evts. Formally, S.schOps(tX ) = (S.dSet(tX ) ∩ S.evts). We define a few notations based on aborted transactions in a schedule S. For an aborted transaction tX , we denote S.abort(tX ) as the set of all aborted transactions in tX ’s dSet. It includes tX as well, if it is aborted. We define S.prune(tX ) as all the events in the schOps of tX after removing the events of all aborted transactions in tX . Formally, S.prune(tX ) = {S.schOps(tX ) − ( S.schOps(tA ))}. If tX has tA ∈S.abort(tX )
no aborted transaction in its dSet then S.prune(tX ) is same as S.schOps(tX ). If tX itself is an aborted transaction then its pruned set is nil. 2.3 Writes for Read Operations and Well-Formedness For a read operation rX (z) belonging to a transaction tP in S, we associate a write wY (z) as its lastWrite3 or S.lastW rite(rX (z)). The read operation will retrieve the value written by the lastWrite. We want the lastWrite wY (z) to satisfy the properties: (1) wY occurs before rX in the schedule; (2) wY is optVis to rX ; (3) The value written by wY is in the z-buffer of an ancestor (starting from its parent tP ) closest to rX in terms of level and (4) If there are multiple writes satisfying the above conditions then, the wY is closest to rX in the schedule S. The lastWrite definition ensures that all transactions read values only from committed nodes i.e. a committed transaction or a simple-write operation. Having lastWrite be 3
This term is inspired from [2]
Correctness of Concurrent Executions of Closed Nested Transactions
101
optVis to the read operation ensures that the buffer in which the lastWrite writes to is accessible by the read operation. In S2, the lastWrites are: (r111 (z) : winit (z)), (r211 (b) : 2212 winit (b)), (r2211 (x) : winit (x)), (r2221 (y) : w221 (y)), (r311 (y) : w113 (y)), (r321 (z) : 23 2212 w2 (z)). Note that the read r2221 (y) reads from w221 (y) even though w113 (y) is closer 2212 to r2221 (y) in the schedule. This is because w221 (y) is closer to it in terms of level. For a node nP with a read operation rX in its dSet, the read is said to be an externalread if its lastWrite is not in nP ’s dSet. Thus a read operation rX is an external-read of itself. It can be seen that a nested transaction interacts with its peers through externalreads and commit-writes. Thus, a nested transaction can be treated as a non-nested transaction consisting only of its external-reads and commit-writes. The external-reads and commit-writes of a transaction constitute its extOpsSet. A schedule is called well-formed if it satisfies: (1) Validity of Transaction limits: After a transaction executes a terminal operation no operation (memory or terminal) belonging to it can execute; and (2) Validity of Read Operations: Every read operation reads the value written by its lastWrite operation. We assume that all the schedules we deal with are well-formed. 2.4 Serial Schedules For the case of non-nested transactions a serial schedule is a schedule in which all the transactions execute serially (as the name suggests) without any interleaving. For a nested transaction we define a serial schedule SS as: for every transaction tX in SS, its children (both transactions and simple-memory operations) are totally ordered. Formally, ∀tX ∈ SS.trans : {nY , nZ } ⊆ S.children(tX ) : (SS.ord(nY .last) < SS.ord(nZ .f irst)) ∨ (SS.ord(nZ .last) < SS.ord(nY .f irst)). Thus in a serial schedule, all the events in the dSet of a transaction appear contiguously.
3 Conflict Preserving Closed Nested Opacity 3.1 Closed Nested Opacity Guerraoui and Kapalka [5] proposed the notion of opacity as a correctness criterion for software transactions. A schedule, consisting of an execution of transactions, is said to be opaque if there is an equivalent serial schedule such that it respects the original schedule’s real time ordering of the nodes and the lastWrites for every read operation, including the reads of aborted transactions, in the serial schedule is same as in the original schedule. Opacity ensures that all the reads are consistent. An implementation of opacity for non-nested transactions is given in [8] in which aborted transactions are treated as read-only (with read steps executed before the abort) when looking for an equivalent serial schedule consisting of all the transactions. In the context of nested transactions, an aborted transaction can have a committed sub-transaction whose values are read by other sub-transactions. For instance in S2, aborted transaction t22 ’s sub-transactions t221 and t222 are committed. The read operation r2221 (y) of t222 reads from t221 . This shows that some writes of aborted (sub) transactions should also be considered for correctness of other sub transactions. On
102
S. Peri and K. Vidyasankar
the other hand, a committed transaction can have aborted sub-transactions whose write values should be omitted. In our characterization of schedules, aborted transactions do not have commit-writes. Thus an aborted transaction’s writes do not affect any of its peers or ancestors. But committed sub-transactions of an aborted transactions can have commit-writes and other sub-transactions can read from it. Thus using our representation, opacity can be extended to closed nested transactions. Formally, we define a class of schedules called as Closed Nested Opacity or CNO as follows: A schedule S belongs to CNO if there exists a serial schedule SS such that: (1) Event Equivalence: The operations of S and SS are the same. (2) schedule-partial-order Equivalence: For any two nodes nY , nZ that are peers in the computation tree represented by S, if nY occurs before nZ in S then nY occurs before nZ in SS as well. (3) lastWrite Equivalence: The lastWrites of all read operations in S and SS are the same. Even though the definition of CNO is similar to opacity, the condition lastWrite equivalence captures the intricacies of nested transactions. This class ensures that the reads of all the transactions including all the sub-transactions of aborted transactions read consistent values. 3.2 Conflict Notion: optConf Checking opacity, like general serializability (for instance, view-serializability) cannot be done efficiently. Restricted classes of serializability (like conflict-serializability) have been defined based on conflicts which allow polynomial time membership test, and facilitate online scheduling. Along the same lines, we define a subclass of CNO, CPCNO. This subclass is defined based on a new conflict notion optConf for closed nested transactions. It is tailored for optimistic execution of sub-transactions. This notion is similar to the idea of conflicts presented in [4] for non-nested transactions. The conflict notion optConf is defined only between memory operations in extOpsSets (defined in SubSection2.3) of two peer nodes. As explained earlier, a node (or transaction) interacts with its peer nodes through its extOpsSet. Consider two peer nodes nA , nB . For two memory operations mX , mY on the same data buffer in the extOpsSets of nA , nB , S.isOptConf (mX , mY ) is true if mX occurs before mY in S and one of the following conditions hold: (1) r-w conflict: mX is an external-read rX of nA , mY is a commit-write wY of nB or (2) w-r conflict: mX is a commit-write wX of nA and mY is an external-read rY of nB or (3) w-w conflict: mX is a commit-write wX of nA and mY is a commit-write wY of nB . Consider a read rX that is in optConf with a write wY and let rX ’s lastWrite be wL . By defining the conflicts in this manner we ensure that wL is in w-r conflict with rX and if wY is also in w-r conflict with rX , then w-w conflict between wL and wY ensures that wY does not become rX ’s lastWrite in any optConf equivalent serial schedule. Similarly if wY is in r-w conflict with rX then it cannot become rX ’s lastWrite in the equivalent serial schedule. For S2 in Example 2, we get the set of conflicts as: (r111 (z), w12 (z)), (r111 (z), 112 2212 w223 (z)), (w11 (y), w13 (y)), (w112 (z), w223 (z)), (w113 (y), w221 (y)), (w221 (y), r2221 (y)), 13 21 12 (w1 (y), r311 (y)), (r311 (y), w2 (y)), (r311 (y), w312 (y)), (w1 (z), r321 (z)), (w223 (z), r321 (z)), (r321 (z), w322 (z))). It must be noted that there is no optConf between w113 (y)
Correctness of Concurrent Executions of Closed Nested Transactions
103
212 212 and r2221 (y) or between w21 (y) and r2221 (y) even though w113 (y) and w21 (y) are 2212 optVis to r2221 (y). This is because the w221 (y)’s level (which is the lastWrite of 212 r2221 (y)) is greater than w113 (y) and w21 (y). Hence r2221 (y) is not an external-read of 13 212 any peer of w1 (y) or w21 (y) Using optConf, we define a class of schedules called as Conflict-Preserving Closed Nested Opacity or CP-CNO. It differs from CNO in condition (3) in SubSection3.1. The lastWrite equivalence is replaced by optConf Implication: if two memory operations in S are in optConf then they are also in optConf in SS. Since optConf implication subsumes lastWrite equivalence, we have:
Theorem 1. If a schedule S is in the class CP-CNO then it is also in CNO. Benefits of optConf: Traditionally, two memory operations are said to be in conflict if one of them is a (simple) write operation. In STM systems that employ optimistic synchronization, a write of a transaction becomes visible only after it has committed. In this case for conflicts to be meaningful, two memory operations are said to be in conflict if one of them is a commit-write operation (and not a simple-write). Refining the conflict notion further, we define optConf only between an external-read and a commit-write operation (as well as between two commit-write operations). By defining optConf this way, the class CP-CNO is as non-restrictive as possible and yet does not compromise on any desired property. 3.3 Membership Verification Algorithm Now, we describe the algorithm for testing the membership in the class CP-CNO in polynomial time. Our algorithm is based on the graph construction algorithm by Resende and Abbadi [12] but adapted to optConf. For a schedule S, the algorithm constructs a conflict graph based on optConfs, denoted as S.optGraph, and checks for the acyclicity of that graph. We call this optGraphCons algorithm. The graph S.optGraph is constructed as follows: (1) Vertices: It comprises of all the nodes in the computation tree. The vertex for a node nX is denoted as vX . (2) Edges: Consider each transaction tX starting from tR . For each pair of children nP , nQ , (other than tinit and tf in ) in S.children(tX ) we add an edge from vP to vQ as follows: (2.1) Completion edges: If nP <S nQ . (2.2) Conflict edges: For any two memory operations, mY , mZ such that mY is in nP ’s dSet and mZ is in nQ ’s dSet, if S.isOptConf (mY , mZ ) is true. Since the position of the transactions tinit and tf in are fixed in the tree and in any schedule, we do not consider them in our graph construction algorithm. Now, we get the theorem, Theorem 2. For a schedule S, the graph S.optGraph is acyclic if and only if S is in CP-CNO. It must be noted that in our construction all the edges are between vertices corresponding to peer nodes. There are no edges between vertices that correspond to nodes of different levels. Thus the graph constructed consists of disjoint subgraphs. If the graph is acyclic, then an equivalent serial schedule can be constructed by executing topological sort on all the subgraphs [11]. Using this algorithm, it can be verified that S2 is not in CP-CNO. Further, S2 is also not in CNO.
104
S. Peri and K. Vidyasankar
4 Abort-Shielded Consistency Shortcoming of CNO: A single serial schedule involving all transactions, as required in CNO (and opacity) allows for the reads of an aborted transaction to affect the transactions that follow it. This effect is more pronounced in nested transactions. For instance in S2, transactions t1 and t2 write to the variables y and z. The aborted sub-transaction t31 reads y from t1 . The sub-transaction t32 reads z from t2 . As a result there is no equivalent serial schedule having the same lastWrites as in S2 and hence it is not in CNO. For that matter, any sub-transaction of t3 invoked after t31 ’s invocation (like t33 , t34 etc) that reads any variable written by t2 that has also been written by t1 will cause this schedule to be not opaque. In the worst case, all the sub-transactions of t3 invoked after t31 may satisfy this property and a scheduler (implementing CNO) will abort all of them. This effectively aborts t3 . This shows that with CNO, an aborted sub-transaction can cause its top-level transaction to abort. This can be avoided if the read operations of the aborted transactions are ignored as described below. 4.1 ASC Class Definition Let tA be an aborted transaction in a schedule S. If tA should not affect the transactions following it, then tA should be dropped while considering the correctness of the remaining transactions. Generalizing this idea to all aborted transactions, we construct a sub-schedule consisting of events only from committed transactions (and committed sub-transactions whose ancestors have not been aborted). Thus, the sub-schedule consists of all the events from S.prune(tR ) (prune is defined in SubSection2.2) and is denoted as commitSubSchR . The ordering of the events is same as in the original schedule. We check for the correctness of commitSubSchR. The sub-schedule 112 212 commitSubSchR for S2 is: r111 (z)w112 (y)w12 (z)w11 (y)c11 r211 (b)w212 (y)w21 (y) 12 13 21 23 322 32 c21 w13 (y)w1 (z)w1 (y)c1 w23 (z)w2 (y)w2 (z)c2 r321 (z)w322 (z)w32 (z)c32 w3 (z)c3 . As explained in [5], it is necessary that the aborted transaction tA also reads consistent values. To ensure this, we construct another sub-schedule of S denoted as ppref SubSchA (pruned prefix sub-schedule) for tA . We consider the prefix of all the events until tA ’s abort operation. From this prefix we construct the sub-schedule by removing (1) events from transactions that aborted earlier and (2) events from any aborted sub-transaction of tA . Thus, the sub-schedule consists of events from transactions that committed before tA , events from pruned sub-transactions of tA and events from live transactions (i.e., transactions that have not yet terminated) that executed until abort of tA . The ordering among the events is same as in the original schedule S. Finally, for each live transaction we add a commit operation after tA ’s abort operation to the sub-schedule. But we do not add the commit-writes for these transactions. Then we look for the correctness of this sub-schedule. In S2, for the aborted transaction 112 212 (y)c11 r211 (b)w212 (y)w21 (y)c21 t31 , ppref SubSch31 is: r111 (z)w112 (y)w12 (z)w11 12 13 21 23 w13 (y)w1 (z)w1 (y)c1 w23 (z)r311 (y)w2 (y)w2 (z)c2 w312 (y)a31 c3 . Similarly the sub-schedules for every aborted transaction can be constructed. Here all the sub-schedules have events from at most one aborted transaction. One can see that the sub-schedules commitSubSchR and ppref SubSchA for every aborted transaction tA have the property that if any event is in the sub-schedule, then any other
Correctness of Concurrent Executions of Closed Nested Transactions
105
event that is relevant to it is also in the sub-schedule. We call this property as causality completeness. Hence the lastWrite for any read operation in a sub-schedule will be same as the lastWrite as in the original schedule S. It can also be seen that the events of these sub-schedules form a valid sub-tree of the original computation tree represented by S. We verify the correctness of each of these sub-schedules by looking for an equivalent serial sub-schedule which has the same lastWrite for every read operation. Based on these sub-schedules, Abort-Shielded Consistency or ASC is defined. A schedule S belongs to class ASC if there exists a set of sub-schedules of S, denoted as subSchSet, such that the sub-schedules, commitSubSchR and pref SubSchA , for every aborted transaction tA in S, are in subSchSet and for every sub-schedule subS in subSchSet there exists a serial sub-schedule ssubS such that: (1) Sub-Schedule Event Equivalence: The operations of subS and ssubS are the same (2) schedule-partial-order Equivalence: For any two peer nodes nY , nZ in the computation tree represented by subS, if nY occurs before nZ in subS then nY occurs before nZ in ssubS as well. (3) lastWrite Equivalence: For all the read operations in ssubS, the lastWrites are the same as in subS. From this definition we get that, CNO is a subset of ASC. The schedule S2 is in ASC. Using optConfs with pprefSubSch we define a class of schedules, Conflict-Preserving Abort-Shielded Consistency or CP-ASC. It differs from the definition of the class ASC only in the condition (3), which is optConf Implication: If two memory operations in subS are in optConf then they are also in optConf in ssubS. Using the optGraphCons algorithm we can verify if there exists an equivalent serial sub-schedule for each subschedule in subSchSet. Thus checking whether a schedule is in CP-ASC or not, can be done in polynomial time [11]. Further, it can as well be proved that the class CP-CNO is a subset of CP-ASC. The schedule S2 is in CP-ASC. Using the optGraphCons algorithm an elegant online scheduler implementing CPASC can be designed [11]. The scheduler can be implemented in a completely distributed manner. The serialization graph has separate components for each (parent) sub-transaction. Each component can be maintained at a different site (process executing the sub-transaction) autonomously and the checking can be done in a distributed manner.
5 Conclusion Concurrent executions of transactions in Transactional Memory are expected to ensure that aborted transactions, as the committed ones, read consistent values. In addition, it is desirable that the aborted transactions do not affect the consistency for the other transactions. Incorporating these simple-sounding criteria has been non-trivial even for non-nested transactions as can be seen in recent publications [5, 8, 3]. In this paper, we have considered these requirements for closed nested transactions. We have also defined new conflict-preserving classes that allow polynomial time membership test, by means of constructing conflict-graphs and checking acyclicity. Further, a completely distributed STM scheduler can be designed using these conflict preserving classes. Our future work includes the study of how the above two properties manifest in executions with open nested transactions and with non-transactional steps.
106
S. Peri and K. Vidyasankar
References [1] Agrawal, K., Fineman, J.T., Sukha, J.: Nested parallelism in transactional memory. In: PPoPP 2008: Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 163–174. ACM, New York (2008) [2] Agrawal, K., Leiserson, C.E., Sukha, J.: Memory models for open-nested transactions. In: MSPC 2006: Proceedings of the 2006 Workshop on Memory System Performance and Correctness, pp. 70–81. ACM, New York (2006) [3] Doherty, S., Groves, L., Luchangco, V., Moir, M.: Towards formally specifying and verifying transactional memory. In: REFINE (2009) [4] Guerraoui, R., Henzinger, T., Singh, V.: Permissiveness in transactional memories. In: Taubenfeld, G. (ed.) DISC 2008. LNCS, vol. 5218, pp. 305–319. Springer, Heidelberg (2008) [5] Guerraoui, R., Kapalka, M.: On the correctness of transactional memory. In: PPoPP 2008: Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 175–184. ACM, New York (2008) [6] Harris, T., Marlow, S., Peyton-Jones, S., Herlihy, M.: Composable memory transactions. In: PPoPP 2005: Proceedings of the tenth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 48–60. ACM, New York (2005) [7] Imbs, D., de Mendivil, J.R., Raynal, M.: Brief announcement: virtual world consistency: a new condition for stm systems. In: PODC 2009: Proceedings of the 28th ACM Symposium on Principles of Distributed Computing, pp. 280–281. ACM, New York (2009) [8] Imbs, D., Raynal, M.: A lock-based stm protocol that satisfies opacity and progressiveness. In: Baker, T.P., Bui, A., Tixeuil, S. (eds.) OPODIS 2008. LNCS, vol. 5401, pp. 226–245. Springer, Heidelberg (2008) [9] Moss, J.E.B.: Open Nested Transactions: Semantics and Support. In: Workshop on Memory Performance Issues (2006) [10] Ni, Y., Menon, V.S., Adl-Tabatabai, A.-R., Hosking, A.L., Hudson, R.L., Moss, J.E.B., Saha, B., Shpeisman, T.: Open nesting in software transactional memory. In: PPoPP 2007: Proceedings of the 12th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 68–78. ACM, New York (2007) [11] Peri, S., Vidyasankar, K.: Correctness criteria for closed nested transactions (in preperation). Technical report, Memorial University of Newfoundland (2010) [12] Resende, R.F., El Abbadi, A.: On the serializability theorem for nested transactions. Inf. Process. Lett. 50(4), 177–183 (1994)
Locality-Conscious Lock-Free Linked Lists Anastasia Braginsky and Erez Petrank Dept. of Computer Science, Technion - Israel Institute of Technology {anastas,erez}@cs.technion.ac.il
Abstract. We extend state-of-the-art lock-free linked lists by building linked lists with special care for locality of traversals. These linked lists are built of sequences of entries that reside on consecutive chunks of memory. When traversing such lists, subsequent entries typically reside on the same chunk and are thus close to each other, e.g., in same cache line or on the same virtual memory page. Such cache-conscious implementations of linked lists are frequently used in practice, but making them lock-free requires care. The basic component of this construction is a chunk of entries in the list that maintains a minimum and a maximum number of entries. This basic chunk component is an interesting tool on its own and may be used to build other lock-free data structures as well.
1
Introduction
Lock-free (a.k.a. non-blocking) data structures provide a progress guarantee. If several threads attempt to concurrently apply an operation on the structure, it is guaranteed that one of the threads will make progress in finite time [7]. Many lock-free data structures have been developed since the original notion was presented [11]. Lock-free algorithms are error-prone and modifying existing algorithms requires care. In this paper we study lock-free linked lists and propose a design for a cache-conscious linked list. The first design of lock-free linked lists was presented by Valois [12]. He maintained auxiliary nodes in between the list’s normal nodes, in order to resolve concurrent operations’ interference problems. A different lock-free implementation of linked lists was given by Harris [6]. His main idea was to mark a node before deleting it in order to prevent concurrent operations from changing its next-entry pointer. Harris’ algorithm is simpler than Valois’s algorithm and his experimental results generally also perform better. Michael [8,10] proposed an extension to Harris’ algorithm that did not assume a garbage collection but reclaimed entries of the list explicitly. To this end, he developed an underlying mechanism of hazard pointers that was later used for explicit reclamation in other data structures as well. An improvement in complexity was achieved by Fomitchev and Rupert [3]. They use a smart retreat upon CAS failure, rather than the standard restart from scratch. In this paper we further extend Michael’s design to allow cache-conscious linked lists. Our implementation partitions the linked list into sub-lists that
Supported by THE ISRAEL SCIENCE FOUNDATION (grant No. 845/06).
M.K. Aguilera et al. (Eds.): ICDCN 2011, LNCS 6522, pp. 107–118, 2011. c Springer-Verlag Berlin Heidelberg 2011
108
A. Braginsky and E. Petrank
reside on consecutive areas in the memory, denoted chunks. Each chunk contains several consecutive list entries. For example, setting each chunk to be one virtual page, causes list traversals to form a page-oriented memory access pattern. This partition of the list into sub-lists, each residing on a small chunk of memory is often used in practice (e.g., [1,5]), but there is no lock-free implementation for such a list. Breaking the list into chunks can be trivial if there is no restriction on the chunk size. In particular, if the size of each chunk can decrease to a single element, then clearly, each chunk can trivially reside in a single memory block, Michael’s implementation will do, but no locality improvement will be obtained for list traversals. The sub-list’s chunk that our design provides maintains upper and lower bounds on the number of elements it has. The upper bound simply follows from the size of the memory block on which the chunk is located, and a lower bound is provided by the user. If a chunk grows too much and cannot be held in a memory block, then it is split (in a lock-free manner) creating two chunks, each residing at a separate location. Conversely, if a chunk shrinks below the lower bound, then it is merged (in a lock-free manner) with the previous chunk in the list. In order for the split to create acceptable chunks, it is required that the lower bound (on the number of objects in a chunk) does not exceed half of the maximum number of entries in the chunk. Otherwise, a split would create two chunks that violate the lower bound. A natural optimization of searching for such a list is to quickly jump to the next chunk (without traversing all its entries), if the desired key is not within the key-range of this chunk. This gives us an additional performance improvement since the search progress is done in skips, where the size of each skip is at least the chunk’s minimal boundary. Furthermore the retreat upon CAS failure, in the majority of the cases is done by returning to beginning of the chunk, rather than the standard restart from the beginning of the list. To summarize, the contribution of this paper is the presentation of a lock-free linked list, based on a single word CAS commands, were the keys are unique and ordered. The algorithm assumes no lock-free garbage collector. The list design is locality conscious. The design poses a restriction on the keys and data length. For 64-bit architecture the key is limited to 31 bit, and the data is limited to 32 bit. Organization. In Section 2 we specify the underlying structure we use to implement the chunked linked list. In Section 3 we introduce the freeze mechanism that will serve the split and join operations. In Section 4 we provide the implementation of the linked list functions. A closer look at the freezing mechanism details appear in Section 5 and we conclude in Section 6. More detailed explanations and pseudo-code can be found in full version of this article [2].
2
Preliminaries and Data Structure
A linked list is a data structure that consists of a sequence of data records. Each data record contains a key by which the linked list is ordered. We denote each data record an entry. We think of the linked list as representing a set of keys,
Locality-Conscious Lock-Free Linked Lists
109
Fig. 1. The entry structure
each associated with a data part. Following previous work [4,6], a key cannot appear twice in the list. Thus, an attempt to insert a key that exists in the list fails. Each entry holds the key and data associated with it. Generally, this data is a pointer, or a mapping from the key to a larger piece of data associated with it. Next, we present the underlying data structure employed in the construction. We assume a 64-bit platform in this description. A 32-bit implementation can be derived, by cutting each field in half, or by keeping the same structure, but using a wide compare-and-swap, which writes atomically to two consecutive words. The structure of an entry. A list entry consists of a key and a data fields, and the next pointer (pointing to next entry). These fields are arranged in two words, where the key and data reside in the first word and the next pointer in the second. Three more bits are embedded in these two words. First, we embed the delete bit in the least bit of the next pointer, following Harris [6]. The delete bit is set to mark the logical deletion of the entry. The freeze bits are new in this design. They take a bit from each of the entry’s words and their purpose is to indicate that the entire chunk holding the entry is about to be retired. These three flags consume one bit of the key and two bits from the next pointer. Notice that the three LSBs of a pointer do not really hold information on a 64-bit architecture. The entry structure is depicted in Figure 1. In what follows, we refer to the first word as the keyData word, and the second word as the nextEntry word. We further reserve one key value, denoted by ⊥ to signify that the entry is currently not allocated. This value is not allowed as a key in the data structure. As will be discussed in Section 4, an entry is available for allocation if its key is ⊥ and its other fields are zeroed. The structure of a chunk. The main support for locality stems from the fact that consecutive entries are kept on a chunk, so that traversals of the list demonstrate better locality. In order to keep a substantial number of entries on each chunk, the linked list makes sure that the number of entries in a chunk is always between the parameters min and max. The main part of a chunk is an array that holds the entries in a chunk and may hold up to max entries of the linked list. In addition, the chunk holds some fields that help manage the chunk. First, we keep one special entry that serves as a dummy header entry, whose next pointer points to the first entry in this chunk. The dummy header is not a must, but it simplifies the algorithm’s code. To identify chunks that are too sparse, each chunk has a counter of the number of entries currently allocated in it. In the presence of concurrent mutations, this counter will not always be accurate, but it will always hold a lower bound on the number of allocated
110
A. Braginsky and E. Petrank
counter
entriesArray[MAX]
new
64 bit (word)
…
64 bit (word)
head
key: 9 del. bit: 1
key: 5 del. bit: 0
key: 8 del. bit: 0
mergeBuddy
key: ┴ del. bit: 0
64 bit (word)
...
3 LSBs freezeState
nextChunk 64 bit (word)
key: 1 del. bit: 1
key: 12 del. bit: 0
Fig. 2. The chunk structure
entries in the chunk. When an attempt is made to insert too many entries into a chunk, the chunk is split. When it becomes too small due to deletions, it is merged with a neighboring chunk. We require max > 2·min+1, since splitting a large chunk must create two well-formed new chunks. In practice max will be substantially larger than 2·min to avoid frequent splits and merges. Additional fields (new, mergeBuddy and freezeState) are needed for running the splits and the merges and are discussed in Section 5. The chunk structure is depicted in Figure 2. The structure of entire list. The entire list consists of a list of chunks. Initially we have a head pointer pointing to an empty first chunk. We let the first chunk’s min boundary be set to 0, to allow small lists. The list grows and shrinks due to the splitting and merging of the chunks. Every chunk has a pointer nextChunk to the next chunk, or to null if it is the last chunk of the list. The keys of the entries in the chunks never overlap, i.e., each chunk contains a consecutive subset of keys in the set, and a pointer to the next chunk, containing the next subset (with strictly higher keys) in the set. The entire list structure is depicted in Figure 3. We set the first key in a chunk as its lowest possible key. Any smaller key is inserted in the previous chunk (except for the first chunk that can also get keys smaller than its first one.) Hazard pointers. Whole chunks and entries inside a chunk are reclaimed manually. Note that garbage collectors do not typically reclaim entries inside an array. To allow safe (and lock-free) reclamation of entries manually, we employ Michael’s hazard pointers methodology [8,10]. While a thread is processing an entry - and a concurrent reclamation of this entry can foil its actions - the thread registers the location of this entry in a special pointer called a hazard pointer. Reclamation of entries that have hazard pointers referencing them is avoided. Following Michael’s list implementation [10], each thread has two hazard pointers, denoted hp0 and hp1 that aid the processing of entries in a chunk. We further add four more hazard pointers hp2, hp3, hp4, and hp5, to handle the operations of the chunk list. Each thread only updates its own hazard pointers, though it can read the other threads’ hazard pointers.
Locality-Conscious Lock-Free Linked Lists H E A D
Chunk 1 Chunk's head
new, mergeBuddy, freezeState Entry with key 26
…
counter: 6 Entry with key 5
nextChunk
Chunk 2
Entry with key 90
Chunk's head
new, mergeBuddy, freezeState Entry with key 100
…
counter: 10 Entry with key 159
111
nextChunk Entry with key 123
Fig. 3. The list structure
3
Using a Freeze to Retire a Chunk
In order to maintain the minimum and maximum number of entries in a chunk, we devised a mechanism for splitting dense chunks, and for merging a sparse chunk with its predecessor. The main idea in the design of the split and merge lock-free mechanisms is the freezing of chunks. When a chunk needs to be split or merged, it is first frozen. No insertions or deletions can be executed on a frozen chunk. To split a frozen chunk, two new chunks are created and the entries of the frozen chunk are copied into them. To merge a frozen chunk with a neighbor, the neighbor is first frozen, and then one or two new chunks are allocated and the relevant entries from the two merging chunks are copied into them. Details of the freezing mechanism appear in Section 5. We now review this mechanism in order to allow the presentation of the list operations. The freezing of a chunk comprises three phases: Initiate Freeze. When a thread decides a chunk should be frozen, it starts setting the freeze bits in all its entries one by one. During the time it takes to set all these bits, other threads may still modify the entries not yet marked as frozen. During this phase, only part of the chunk is marked as frozen, but this freezing procedure cannot be reversed, and frozen entries cannot be reused. Stabilizing. Once all entries in a chunk are frozen, allocations and deletions can no longer be executed. At this point, we link the non-deleted entries into a list. This includes entries that were allocated, but not yet connected to the list. All entries that are marked as deleted are disconnected from the list. Recovery. The number of entries in the stabilized list is counted and a decision is made whether to split this chunk or merge it with a neighbor. Sometimes, due to changes that happen during the first phase, the frozen chunk becomes a good one that does not require a split or a join. Nevertheless, the retired chunk is never resurrected. We always allocate a new chunk to replace it and copy the appropriate values to the new chunk. Whatever action is decided upon (split, join, or copy chunk) must be carried through. Any thread that fails to insert or delete a key due to the progress of a freeze, joins in helping the freezing of the chunk. However, threads that perform a search, continue to search in frozen chunks with no interference.
112
4
A. Braginsky and E. Petrank
The List Operations: Search, Insert and Delete
We now turn to describe the basic linked list operations. The high-level code for an insertion, deletion, or search of a key is very simple. Each of this operations starts by invoking FindChunk method to find the relevant chunk. Then they call SearchInChunk, or InsertToChunk, or DeleteInChunkaccording to the desired operation, and finally, the hazard pointers hp2, hp3, hp4, and hp5 are nullified, to release the hazard pointers set by the FindChunk method and allow future reclamation. The main challenge is in the work inside the chunk and the handling of the freeze process, on which we elaborate below. More details appear in [2]. Turning to the operations inside the chunks, the delete and search methods are close to the previous design [10], except for the special treatment of the chunk bounds and the freeze status. For lack of space they are not specified in this short paper. The details appear in [2]. However, the insert method is quite different, because it must allocate an entry in a shared memory (on the chunk), whereas previously, it was assumed that the insert allocates a local space for a new entry and privately prepares it for insertion in the list. For the purpose of handling the entries list in the chunk, we maintain five variables that are global and appear in all the code below. These variables are global for each thread’s code, but are not shared between threads, and all of them follow Michael’s design [10]. The first three per-thread shared variables are (entry** prev), (entry* cur), and (entry* next). The other two are the two pointers (entry** hp0) and (entry** hp1) that point to the two hazard pointers of the thread. All other variables are local to the method that mentioned them. 4.1
The Insert Operation
The InsertToChunk method inserts a key with its associated data into a chunk. It first attempts to find an available entry and allocate it with the given key. If no available entry exists, a split is executed and the operation is retried. If an entry is obtained, the InsertEntry method is invoked to insert the entry into the list. The insertion will fail if the key already exists in the chunk. In this case InsertToChunk clears the entry to free it for future allocations. The InsertToChunk code is presented in Algorithm 1. It starts by an attempt to find an available entry for allocation. A failure occurs when all entries are in use and in this case a freeze is initiated. The Freeze method gets the key and data as an input, and also an input indicating that it is invoked by an insertion operation. This allows the Freeze method to try to insert the key to the newly created chunk. When successful, it returns a null pointer to indicate the completion of the insertion. It also sets a local variable result to indicate whether the completed insertion actually inserted the key or it completed by finding that the key already existed in the list (which is also a legitimate completion of the insertion operation). If the insertion is not completed by the Freeze method, then it returns a pointer to the chunk on which the insertion should be retried. Connecting the entry to the list is done by InsertEntry. If the entry gets allocated and linked to the list, then the chunk counter is incremented only by
Locality-Conscious Lock-Free Linked Lists
113
Algorithm 1. Insert a key and its associated data into a chunk Bool InsertToChunk (chunk* chunk, key, data) { 1: current = AllocateEntry(chunk, key, data); // Find an available entry 2: while ( current == null ) { // No available entry. Freeze and try again 3: chunk = Freeze(chunk, key, data, insert, &result); 4: if ( chunk == null ) return result; // Freeze completed the insertion. 5: current = AllocateEntry(chunk, key, data); // Otherwise, retry allocation 6: } 7: returnCode = InsertEntry(chunk, current, key); 8: switch ( returnCode ) { 9: case success this: 10: IncCount(chunk); result = true; break; // Increase the chunk’s counter 11: case success other: // Entry was inserted by other thread 12: result = true; break; // due to help in freeze 13: case existed: // This key exists in the list. Reclaim entry 14: if ( ClearEntry(chunk, current) ) // Attempt to clear the entry 15: result = false; 16: else // Failure to clear the entry implies that a freeze thread 17: result = true; // eventually inserts the entry 18: break; 19: } // end of switch 20: *hp0 = *hp1 = null return result; // Clear all hazard pointers and return }
the thread that linked the entry itself. If the key already existed in the list, then ClearEntry attempts to clear the entry for future reuse. However, a rare scenario may foil clearing of the entry. This happens when the other occurrence of the key (which existed previously in the list) gets deleted before our entry gets cleared. Furthermore, a freeze occurs, in which the semi-allocated entry gets linked by other threads into the new chunk’s list. At this point, clearing this entry is avoided, and ClearEntry returns false. In such a scenario, clearing the entry fails and the insert operation succeeds. At the end of InsertToChunk, all hazard pointers are cleared and we return with a code specifying if the insert was successful, or the key previously existed in the list. The allocation of an available entry is executed using the AllocateEntry method (depicted in [2]). An available entry contains ⊥ as a key and zeros otherwise. An available entry is allocated by assigning the key and data values in the keyData word in a single atomic compare-and-swap (CAS) that assumes this word has the ⊥ symbol and zeros in it. An entry whose keyData has the freeze bit set cannot be allocated as it is not properly zeroed. Note also that once an entry is allocated, all the information required for linking it to the list is available to all threads. Thus, if a freeze starts, then all threads may create a stabilized list of the allocated entries in a chunk. The AllocateEntry method searches for an available entry. If no free entry can be found, null is returned.
114
A. Braginsky and E. Petrank
Algorithm 2. Connecting an allocated entry into the list returnCode InsertEntry (chunk* chunk, entry* entry, key) { 1: while ( true) { 2: savedNext = entry→next; 3: // Find insert location and pointers to previous and current entries (prev, cur) 4: if ( Find(chunk, key) ) // This key existed in the list 5: if ( entry == cur ) return success other; else return existed; 6: // If neighborhood is frozen, keep it frozen 7: if ( isFrozen(savedNext) ) markFrozen(cur); // cur will replace savedNext 8: if ( isFrozen(cur) ) markFrozen(entry); // entry will replace cur 9: // Attempt linking into the list. First attempt setting next field 10: if ( !CAS(&(entry→next), savedNext, cur) ) continue; 11: if ( !CAS(prev, cur, entry) ) continue; // Attempt linking 12: return success this; // both CASes were successful 13: } }
Next, comes the InsertEntry method, which takes an allocated entry and attempts to link it to the linked list. The InsertEntry code is presented in Algorithm 2. The input parameter entry is a pointer to an entry that should be inserted. It is already allocated and initiated with key and data. Before searching for the location to which to connect this entry, we memorize this entry’s next pointer. Normally, this should be a null, but in the presence of concurrent executions of InsertEntry (which may happen during a freeze), we must make sure later that the entry’s next pointer was not modified before we atomically wrote it in Line 10. After saving the current next pointer, we search for the entry’s location via the Find method. If the key already exists in the list, InsertEntry checks whether the returned entry is the same as the one it is trying to insert (by address comparison). The result determines the return code: either the key existed and we failed, or the key was inserted, but not by the current thread. (This can happen during a freeze when all threads attempt to stabilize the frozen list.) Otherwise, the key does not exist, and Find sets the global variable cur with a pointer to the entry that should follow our entry in the list, and the global variable prev with the pointer that should reference our entry. The Find method protects the entries referenced by prev and cur with the hazard pointers hp1 and hp0, respectively. There is no need to protect the newly allocated entry because it cannot be reclaimed by a different thread. If any to-be-modified pointer is marked as frozen, we make sure that its replacement is marked as frozen well. An allocation of an entry can never occur on a frozen entry. However, once the allocation is successful, the new entry may freeze and still InsertEntry should connect it to the list. Finally, two CASs are used to link the entry to the list. Whenever a CAS fails, the insertion starts from scratch.
Locality-Conscious Lock-Free Linked Lists
115
Algorithm 3. The main freeze method chunk* Freeze(chunk* chunk, key, data, triggerType tgr, Bool* res) { 1: CAS(&(chunk→freezeState), no freeze, internal freeze); 2: // At this point, the freeze state is either internal freeze or external freeze 3: MarkChunkFrozen(chunk); 4: StabilizeChunk(chunk); 5: if ( chunk→freezeState == external freeze ) { 6: // This chunk was marked external freeze before Line 1 executed. 7: master = chunk→mergeBuddy; // Get the master chunk 8: // Fix the buddy’s mergeBuddy pointer. 9: masterOldBuddy = combine(null, internal freeze); 10: masterNewBuddy = combine(chunk, internal freeze); 11: CAS(&(master→mergeBuddy), masterOldBuddy, masterNewBuddy); 12: return FreezeRecovery(chunk→mergeBuddy, key, data, merge, chunk, tgr, res); 13: } 14: decision = FreezeDecision(chunk); // The freeze state is internal freeze 15: if ( decision == merge ) mergePartner = FindMergeSlave(chunk); 16: return FreezeRecovery(chunk, key, data, decision, mergePartner, trigger, res); }
5
The Freeze Procedure
We now provide more details about the freeze procedure. The full description is presented in [2]. The freezing process occurs when the number of entries in a chunk exceeds its boundaries. At this point, splitting or merging happens by copying the relevant keys (and data) into a newly allocated chunk (or chunks). This process comprises three phases: initiation, stabilization and recovery. The code for the Freeze method is presented in Algorithm 3. The input parameters are the chunk that needs to be frozen, the key, the data, and the event that triggered the freeze: insert, delete, enslave (if the freeze was called to prepare the chunk for merge with a neighboring chunk), or none (if the freeze is called while clearing an entry). The freeze will attempt to execute the insertion, deletion, or enslaving and will return a null pointer when successful. It will also set an input boolean flag to indicate the return code of the relevant operation. When unsuccessful, it will return a pointer to the new chunk on which the operation should be retried. The Freeze method starts with an attempt to atomically change the freeze state from no freeze to internal freeze. This freeze state of the chunk is normally no freeze and is switched to internal freeze when a freeze process of this chunk begins. But it can also be external freeze when a neighbor requested a freeze on this chunk to allow a merge between the two. Thus, an external freeze can start even when no size violation is detected in this chunk. Whether or not the modification succeeds, we know that the freeze state can no longer be no freeze. It can be either internal freeze or external freeze. The Freeze method then calls MarkChunkFrozen to mark each
116
A. Braginsky and E. Petrank
entry in the chunk as frozen and StabilizeChunk to finish stabilizing the entries list in the chunk. At this point, the entries in the chunk cannot be modified anymore. Freeze then checks if the freeze is external or internal. An external freeze can occur when a freeze is concurrently executed on the next chunk, and it has already enslaved the current chunk as its merge buddy. In this case, we cooperate with the joint freeze and joint recovery. When the state of the freeze is external, then the current chunk must have its mergeBuddy pointer already pointing to the chunk that initiated the merge, denoted the master. To finish this freeze, we make sure that the master has its merge buddy properly pointing back at the current chunk. The master chunk’s mergeBuddy pointer must be either null or already pointing to the buddy we found. Thus it is enough to use one CAS command to verify that it is not null. Finally, we execute the recovery phase on the master chunk and return its output. We do not need to check the decision about the freeze of the buddy. It must be a merge. If the freeze is internal, then we invoke FreezeDecision to see what should be done next (Line 14). If the decision is to merge, then we find the previous chunk and “enslave” it for a joint merge using the FindMergeSlave method. Finally, the FreezeRecovery method is called to complete the freeze process. Next, we explain each of the stages. The full details including pseudo-code appear in [2]. Marking the chunk as frozen. The MarkChunkFrozen method simply goes over the entries one by one and marks each one as frozen. The setting of the freeze flags is atomic and it is retried repeatedly until successful. By the end of this process all entries (including the free ones) are marked as frozen. Stabilizing the chunk. After all the entries in the chunk are marked as frozen, new entries cannot be allocated and existing entries cannot be marked as deleted. However, the frozen chunk may contain allocated entries that were not yet linked, and entries that were marked as deleted, but which have not yet been disconnected. The StabilizeChunk method disconnects all deleted entries and links all allocated ones. It uses the Find method to disconnect all entries that are marked as deleted. Such entries do not need to be reclaimed (when marked as frozen), but they should not be copied to the new chunk. Next, StabilizeChunk attempts to connect entries. It goes over all entries and searches for ones that are disconnected, but neither reclaimed nor deleted. Each such entry is linked to the list by invoking InsertEntry, which will only fail if the key already exists in a different entry in the chunk’s list. In this case, this entry should indeed not be connected to the stabilized list. Reaching a decision. After stabilizing the chunk, everything is frozen, the list is completely connected, and nothing changes in the chunk anymore. At this point, we need to decide whether or not splitting or merging is required. To that end, a count is performed and a decision is made by comparison to min and max. It may happen that the resulting count is higher than min and lower than max, and then no operation is required. Nevertheless, the frozen chunk is never resurrected. Instead, we copy the chunk to a new chunk in the (upcoming) recovery stage.
Locality-Conscious Lock-Free Linked Lists
117
Making the recovery. Once a decision is reached, a recovery starts. The recovery procedure allocates a chunk (or two) and copies the relevant information into the new chunk (or chunks). If a merge is involved, the previous chunk in the list is first frozen (under an external freeze) and both chunks bring entries for the merge. Several threads may perform the freeze procedure concurrently, but all of them will make the same recovery decision about the freeze, as the frozen stabilized chunk looks the same to all threads. A thread that performs the recovery creates a local chunk (or chunks) into which it copies the relevant entries. At this point all threads create the same new chunk (or chunks). But now, each thread performs the operation with which it initiated the freeze on the new chunks. It can be an insert, delete, or enslave. Performing the operation is easy because the new chunks are local to this thread and no race can occur. (Enslaving a chunk is simply done by modifying its freeze state from no freeze to external freeze and registration of the merge buddy.) But the success of making the local operation visible in the data structure is determined by whether the thread succeeds in creating a link to its new chunks in the frozen chunk, as explained next. After creating the new chunks locally and executing the original operation on them, there is an attempt to atomically insert the address of its local chunk into a dedicated pointer in the frozen chunk (new). When two chunks are created, the second one is locally linked to the first one by the nextChunk field. If the insertion is successful, then this thread has also completed the the operation it was performing (insert, delete, or enslave). If the insertion is unsuccessful, then this means that a different thread has already completed the installation of new chunks and this thread’s local new chunks will not be used (i.e., can be reclaimed). In this case, the thread must try its operation again from scratch. According to the number of (live) entries on the frozen chunk there are three ways to recover from the freeze. Case I: min< count < max. In this case, the required action is to allocate a new chunk and copy all of the entries from the frozen chunk to the new chunk. Next we perform the insert, delete, or enslave operation on the local new chunk and attempt to link it to the frozen one. Case II: count == min. In this case we need to merge the frozen chunk with its previous chunk. We assume that the previous chunk has already been frozen by an external freeze before the recovery is executed, and that the freeze states in both chunks are properly set so that no thread can interfere with the freeze process. We start by checking the overall number of entries in these two chunks, to decide if the merged entries will fit into one or two chunks. We then allocate a second new chunk, if needed, and perform the (local) copy to the new chunk or chunks. When copying into two new chunks, we split the entries evenly, and return the smallest key in the second chunk as the separating key. As before, we perform the original operation that started the freeze and try to create a link from the old chunk to the new chunk or chunks.
118
A. Braginsky and E. Petrank
Case III: count == max. In this case we need to split the old chunk into two new chunks. The basic operations of this case resemble those of the previous cases. We allocate two new chunks, perform the split locally, perform the original operation, and attempt to link the new chunks to the old one.
6
Conclusion
We have presented a chunking and freezing mechanisms that build a cacheconscious lock-free linked list. Our list consists of chunks, each containing consecutive list entries. Thus, a traversal of the list stays mostly within a chunk’s boundary (a virtual page or a cache line), and therefore, the traversal enjoys a reduced number of page faults (or cache misses) compared to a traversal of randomly allocated nodes, each containing a single entry. Maintaining a linked list in chunks is often used in practice (e.g., [1,5]) but a lock-free implementation of a cache-conscious linked list has not been available heretofore. We believe that the building blocks of this list, i.e., the chunks and the freeze operation, can be used for building additional data structures, such as lock-free hash functions, and others.
References 1. Unrolled Linked Lists, http://blogs.msdn.com/devdev/archive/2005/08/22/454887.aspx 2. Full Version of Locality-Conscious Lock-Free Linked Lists, http://www.cs.technion.ac.il/~ erez/Papers/lf-linked-list-full.pdf 3. Fomitchev, M., Rupert, E.: Lock-free linked lists and skip lists. In: Proc. PODC (2004) 4. Fraser, K.: Practical lock-freedom, Technical Report UCAM-CL-TR-579, University of Cambridge, Computer Laboratory (February 2004) 5. Frias, L., Petit, J., Roura, S.: Lists revisited: Cache-conscious STL lists. J. Exp. Algorithmics 14, 3.5–3.27 (2009) 6. Harris, T.L.: A pragmatic implementation of non-blocking linked-lists. In: Proc. PODC (2001) 7. Herlihy, M.: Wait-free synchronization. TOPLAS (1991) 8. Michael, M.M.: High Performance Dynamic Lock-Free Hash Tables and List-Based Sets. In: Proc. SPAA (2002) 9. Michael, M.M.: Safe memory reclamation for dynamic lock-free objects using atomic reads and writes. In: Proc. PODC (2002) 10. Michael, M.M.: Hazard pointers: Safe memory reclamation for lock-free objects. TPDS (June 2004) 11. Herlihy, M., Shavit, N.: The art of multiprocessor programming. Morgan Kaufmann, San Francisco (2008) 12. Valois, J.D.: Lock-free linked lists using compare-and-swap. In: Proc. PODC (1995) 13. Treiber, R.K.: Systems programming: Coping with parallelism, Research report RJ 5118, IBM Almaden Research Center (1986)
Specification and Constant RMR Algorithm for Phase-Fair Reader-Writer Lock Vibhor Bhatt and Prasad Jayanti Department of Computer Science, Dartmouth College, NH, USA
Abstract. Brandenburg and Anderson [1,2] recently introduced a phase-fair readers/writers lock [1,2], where read and write phases alternate: when the writer leaves the CS, any waiting reader will enter the CS before the next writer enters the CS; similarly, if a reader is in the CS and a writer is waiting, any new reader that now enters the Try section will not enter the CS before some writer enters the CS. Thus, neither class of processes–readers or writer–has priority over the others, and no process starves. Brandenburg and Anderson [1,2] informally specify a phase fair lock and present an algorithm to implement it with O(n) remote memory reference complexity (RMR), where n is the number of processes in the system. In this work we give a rigorous specification of a phase fair lock and present an algorithm that implements it with O(1) RMR complexity.
1 Introduction Mutual exclusion [3] is a well-studied, fundamental problem in distributed computing. Here processes repeatedly cycle through four sections of code–Remainder Section, Try Section, Critical Section (CS), and Exit Section–in that order, and the problem consists of designing the code for the Try and Exit sections so that the mutual exclusion property–at most one process is in the CS at any time–is satisfied. Readers/Writers Exclusion [4] is a well known variant of Mutual exclusion, commonly used in operating systems and in parallel applications to implement shared data structures. In Readers/Writers Exclusion, processes are divided into two classes–readers and writers–and the exclusion property is revised to allow for more concurrency: multiple readers can be in the CS at the same time, although no process may be in the CS at the same time as a writer. Starting from the earliest paper [4], most works on Readers/Writers Exclusion studied the problem in three natural variants–one in which readers have priority over writers, one in which the writers have priority and one in which neither class of processes– readers or writers–has priority over the other. This work deals with the third variant. When neither class has priority, Brandenburg and Anderson [1,2] suggested a desirable property, which they called the phase-fairness property, that requires read and write phases to alternate: when the writer leaves the CS, any waiting reader will enter the CS before the next writer enters the CS; similarly, if a reader is in the CS and a writer is waiting, any new reader that now enters the Try section will not enter the CS before some writer enters the CS. Their algorithm to realize this property has O(n) remote memory reference complexity (RMR complexity), where n is the number of M.K. Aguilera et al. (Eds.): ICDCN 2011, LNCS 6522, pp. 119–130, 2011. c Springer-Verlag Berlin Heidelberg 2011
120
V. Bhatt and P. Jayanti
processes in the system. (A memory reference to a shared variable X by a processor p is considered remote in Cache Coherent (CC) machines if X is not in p’s cache; it is considered remote in Distributed Shared Memory (DSM) machines if X is at a memory module of a different processor. The RMR complexity of an algorithm is the worst-case number of remote memory references that a process makes in order to execute the Try and Exit sections once.) Our paper makes two contributions. Brandenburg and Anderson stated the phasefairness property only informally, and did not include all of the elements that one would intuitively associate with phase-fairness. Our first contribution is a more comprehensive and rigorous specification of the phase-fairness property. Our second contribution is an algorithm that achieves this property with O(1) RMR complexity on CC machines,1 in contrast to their O(n) RMR complexity algorithm. The rest of the paper is organized as follows. Section 2 describes the model and states the basic properties required by the Reader-Writer Exclusion problem, followed by a rigorous formulation of the phase-fairness properties. Section 3 describes related work. The last two sections contain the algorithms. Section 4 describes the single-writer algorithm satisfying the phase fairness properties. This algorithm and its description are taken almost verbatim from [5]. Section 5 shows how to transform the single-writer algorithm into multi-writer algorithm satisfying the basic and phase-fairness properties. Proofs are omitted due to space constraint.
2 Model and Specification of the Reader-Writer Problem The system consists of a set of n asynchronous processes {p0 . . . pn−1 }, communicating with each other through shared variables that support the atomic operations read, write, and fetch&add (F&A). Each process is labeled a reader or a writer,2 and its program is a loop that consists of two sections of code—Try section and Exit section. We say a process is in the Remainder section if its program counter is at the first statement of the Try section; and it is in the Critical section (CS) if its program counter is at the first statement of the Exit section. The Try section, in turn, consists of two code fragments—a doorway, followed by a waiting room—with the requirement that the doorway is a bounded “straight line” code [6]. Intuitively, a process “registers” its request for the CS by executing the doorway, and then busywaits in the waiting room until it has the “permission” to enter the CS. Initially, all processes are in their remainder section. Each execution of the Try and Exit sections by a process is called an attempt; it is a read attempt (respectively, write attempt) if the process is a reader (respectively, writer). An attempt by a process p spans from a time t when p executes the first statement of its Try section to the earliest time t > t when p completes the Exit section. An attempt A 1
2
For DSM machines, Danek and Hadzilacos’ lower bound proof for 2-Session Group Mutual Exclusion implies that there is no O(1) RMR complexity algorithm for Readers/Writers exclusion. Our algorithms work even when this labeling is not static, i.e., when the same process performs read attempts sometimes and write attempts some other times. We assume static labeling only for simplicity.
Specification and Constant RMR Algorithm
121
is active in a configuration C of a run if A starts before C and does not complete before C. The following definitions of “precedence” and “enabled” will be useful for defining fairness properties in the next section. If A is an attempt by a process p, henceforth we write “A completes the doorway at time t” as a shorthand for “p completes the doorway at time t during the attempt A.” Definition 1. If A and A are any two attempts in a run (possibly by different processes), A doorway precedes A if A completes the doorway before A begins the doorway. A and A are doorway concurrent if neither doorway precedes the other. Definition 2. A process p is enabled to enter the CS in configuration C if p is in the Try section in C and there is an integer b such that, in all runs from C, p enters the CS in at most b of its own steps. The Phase-fair Reader-Writer Problem is to design the code for the Try and the Exit sections so that properties P1 through P5 stated below and the properties PF1 and PF2 of Subsection 2.2 hold in all runs of the algorithm. – (P1). Mutual Exclusion : If a writer is in the CS at any time, then no other process is in the CS at that time. – (P2). Bounded Exit : There is an integer b such that in every run, every process completes the Exit section in at most b of its steps. – (P3). First-Come-First-Served (FCFS) among writers : If w and w are any two write attempts in a run and w doorway precedes w , then w does not enter the CS before w. – (P4). First-In-First-Enabled (FIFE) among readers : Let r and r be any two read attempts in a run such that r doorway precedes r . If r enters the CS before r, then r is enabled to enter the CS at the time r enters the CS. – (P5). Concurrent Entering : Informally, if all writers are in the remainder section, readers should be able to enter the CS in a bounded number of steps. More precisely: There is an integer b such that, if σ is any run from a reachable configuration such that all writers are in the remainder section in every configuration in σ, then every read attempt in σ executes at most b steps of the Try section before entering the CS. Finally, we state the liveness property. When one class of processes (e.g., readers) has priority over the other class (e.g., writers), the starvation of processes belonging to the lower priority class is unavoidable. Therefore, instead of starvation-freedom, we require the weaker livelock-freedom property, which is appropriate in all three cases of readerpriority, writer-priority, and no-priority. Livelock-freedom guarantees that, under the standard assumption that no process crashes in the middle of the Try, CS or Exit section, some process in the Try section will eventually enter the CS and some process in the Exit section will eventually enter the Remainder section. – (P6). Livelock-freedom : If no process crashes in an infinite run, then infinitely many attempts complete in that run.
122
V. Bhatt and P. Jayanti
2.1 Reader-Priority and Writer-Priority Formulations In a recent paper [5] we presented the reader- and writer-priority formulations and presented constant RMR algorithms for those cases. This submission studies the nopriority formulation, described next. 2.2 Phase-Fairness Properties When neither readers nor writers have priority over the other, a most natural additional property to require is starvation-freedom—that no reader or writer gets stuck forever in the Try or Exit section. However, Brandenburg and Anderson have pointed out that we could demand more—that readers and writers take turns fairly, while still allowing for concurrency (by enabling multiple readers to cohabit the CS).[1,2]. Specifically, if readers are waiting when a writer leaves the CS, then all such waiting readers should be allowed to enter the CS before the next writer may enter the CS, i.e., the “session” should switch from being a “write session” to a ‘read session.” Likewise if a read session is in progress and one or more writers are waiting, then no new readers should be allowed into the CS. These “fair switching” properties were stated informally in Brandenburg and Anderson’s work, and we formulate these rigorously below. – ( PF1). Fair switch from writer to readers : If at some time in a run a write attempt w is in the CS and a read attempt r is in the waiting room, then r enters the CS before any write attempt w = w enters the CS in the future. – ( PF2). Fair switch from readers to writer : If at time t a read attempt is in the CS and a write attempt is in the waiting room, then some write attempt enters the CS in the future before any read attempt initiated after t enters the CS. Our quest is to identify properties that are desirable in any algorithm that aims to ensure fairness between readers and writers. In this quest, the two properties stated above may be considered necessary, but they are surely not sufficiently strong. To see this, consider a scenario where a writer w is in the CS while a set W of writers and a set R of readers are in the waiting room. When w leaves the CS, the first property blocks writers from entering the CS until all readers in R enter the CS, but it makes no guarantee about how quickly these waiting readers enter the CS. In particular, even after w completes the Exit section and goes back to the Remainder section, the writers in W may temporarily block the readers in R from entering the CS without violating the above properties. So we state a stronger property below that guarantees that, once w’s writing session is over, every reader in R will be able to enter the CS in a bounded number of its own steps. We consider w’s writing session to be over as soon as either of the following two events happens: (i) w goes back to the Remainder section, or (ii) some reader or writer enters the CS after w leaves the CS. – ( PF3). Fast switch from writer to readers : Suppose that at some time t a write attempt w is in the CS and a read attempt r is in the waiting room, and t > t is the earliest time when w is completed or some attempt a = w is in the CS. Then, at time t , either r is in the CS or r is enabled to enter the CS.
Specification and Constant RMR Algorithm
123
The next lemma states that this property is stronger than the “Fair switch from writer to readers” property stated earlier. Lemma 1. If an algorithm satisfies Mutual Exclusion and Fast switch from writer to readers, then it satisfies Fair switch from writer to readers.
3 Related Work Courtois et al. first posed and solved the Readers/Writers problem [4]. Mellor-Crummey and Scott’s algorithms [7] and their variants [8] are queue-based; they have constant RMR complexity, but do not satisfy Concurrent Entering. Anderson’s algorithm [1] and Danek and Hadzilacos’ algorithm [9] satisfy Concurrent Entering, but they have O(n) and O(log n) RMR complexity, respectively, where n is the number of processes. The first O(1) RMR Reader-Writer lock algorithm satisfying concurrent entering (P5) was designed by Bhatt and Jayanti [5]. In that work they studied all the three variations of the problem—reader-priority, writer-priority and starvation-free cases. Brandenburg and Anderson’s recent work [1,2], which is most closely related to this paper, noted that Mellor-Crummey and Scott’s queue-based algorithm [7] limit concurrency because of its strict adherence to a FIFO order among all waiting readers and writers. For instance if the queue contains a writer between two readers, then the two readers cannot be in the CS together. To overcome this drawback Brandenburg and Anderson proposed “phase-fairness” that requires readers and writers to take turns in a fair manner. The fair and fast switch properties ( PF1- PF3) stated in the section 2 are intended to rigorously capture the informal requirements stated in their work. Their algorithm in [1] has O(n) RMR complexity and satisfies PF3 (and hence also PF1), but not PF2. Their phase-fair queue based algorithm in [2] has constant RMR complexity and satisfies PF1, but not PF2 or PF3; it also fails to satisfy Concurrent Entering (P5). An algorithm in [5] has constant RMR complexity and satisfies all of PF1, PF2 and PF3 (and all of P1-P5), but it supports only a single writer. We use this algorithm as a building block to design a constant RMR algorithm that supports multiple writers and readers and satisfies all the properties from Section 2. As with Brandenburg and Anderson’s algorithm in [1], our algorithm also uses fetch&add primitives.3
4 Single Writer Algorithm Satisfying Phase-Fair Properties In this section, we present an algorithm that supports only a single writer and multiple readers, and satisfies the phase-fair properties((P 1) − (P 6)and(P F 2), (P F 3)) and an additional property called writer priority defined as follows. (WP1). Writer Priority : If a write attempt w doorway precedes a read attempt r, then r does not enter the CS before w.4 3
4
Without the use of synchronization instructions like fetch&add, it is well known that constant RMR complexity even for the standard mutual exclusion problem is not possible [10,11,12]. This property by itself implies PF2.
124
V. Bhatt and P. Jayanti
This algorithm is very similar to the algorithm presented in Section 5 of [5], and the description given in this section is almost verbatim from Section 5.1 of [5]. We encourage a full reading of Section 5 of [5] to understand the final algorithm given in Section 5 of this paper. The overall idea is as follows. The writer can enter the CS from two sides, 0 and 1. It never changes its side during one attempt of the CS. The writer also toggles its side for every new attempt. To enter from a certain side, say 1, the writer sets the shared variable D to 1. Then it waits for the readers from the previous side (in this case side 0) to exit the critical and the Exit section. The last reader to exit from side 0 lets the writer in the CS. Once the writer is done with the CS from side 1, it lets the readers waiting from side 1 into the CS, using variable Gate described later. The readers in their Try section set their side d equal to D. Then they increment their count in side d and attempt for the CS from side d. In order to enter the CS from side d, they busy wait on the variable Gate[d] until it is true. When the readers are exiting, they decrement their count from side d and the last exiting reader wakes up the writer. Now we describe the shared variables used in the algorithm. procedure Write-lock() R EMAINDER SECTION 1. prevD ← D 2. currD ← prevD 3. D ← currD 4. Permit[prevD] ← false 5. if (F&A(C[prevD], [1, 0]) = [0, 0]) 6. wait till Permit[prevD] 7. F&A(C[prevD], [−1, 0]) 8. Gate[prevD] ← false 9. ExitPermit ← false 10. if (F&A(EC, [1, 0]) = [0, 0]) 11. wait till ExitPermit 12. F&A(EC, [−1, 0]) C RITICAL S ECTION 13. Gate[currD] ← true
procedure Read-lock() R EMAINDER SECTION 14. d ← D 15. F&A(C[d], [0, 1]) 16. d ← D 17. if(d = d ) 18. F&A(C[d ], [0, 1]) 19. d←D 20. if(F&A(C[d], [0, −1]) = [1, 1]) 21. Permit[d] ← true 22. wait till Gate[d] C RITICAL S ECTION 23. F&A(EC, [0, 1]) 24. if(F&A(C[d], [0, −1]) = [1, 1]) 25. Permit[d] ← true 26. if(F&A(EC, [0, −1]) = [1, 1]) 27. ExitPermit ← true
Fig. 1. Single-Writer Multi-Reader Algorithm satisfying Starvation Freedom and Writer-Priority. The doorway of Write-lock comprises of Lines 1-3. The doorway of Read-lock comprises of Lines 14-21.
4.1 Shared Variables and Their Purpose All the shared variable names start with upper case and the local variables start with lower case. D: A single bit read/write variable written only by the writer. This variable denotes the side from which the writer wants to attempt for the CS.
Specification and Constant RMR Algorithm
125
Gate[d]5 : Boolean read/write variable written only by the writer. Gate[d] denotes whether the side d is open for the readers to enter the CS. Before entering the CS, a reader has to wait till Gate[d] = true (Gate from side d is open), where d is the side from which the reader is attempting for the CS. Permit[d]: Boolean read/write variable written and read by both readers and the writer. The writer busy waits on Permit[d] to get the permission from the readers (from side d) to enter the CS. The idea is that the last reader to exit side d, will wake up the writer using Permit[d]. ExitPermit: Boolean read/write variable written and read by both readers and the writer. It is similar to Permit, with the difference that it is used by the writer to wait for all the readers to leave the Exit section. C[d], d ∈ {0, 1} : A fetch&add variable read and updated both by the writer and readers. C[d] has two components [writer-waiting, reader-count].6 writer-waiting ∈ {0, 1} denotes whether the writer is waiting for the readers from side d to leave the CS, it is only updated by the writer. reader-count denotes the number of readers currently registered in side d. EC : A fetch&add variable read and updated both by the writer and readers. Similar to C[d], it has two components [writer-waiting, reader-count]. writer-waiting ∈ {0, 1} denotes whether the writer is waiting for the readers to complete the Exit section. reader-count denotes the number of readers currently in the Exit section. Following theorem summarizes the properties of this algorithm. Theorem 1 (Single-Writer Multi-Reader Phase-fair lock). The algorithm in Figure 1 implements a Single-Writer Multi-Reader lock satisfying the properties (P1)-(P6) and (PF2), (PF3). The RMR complexity of the algorithm in the CC model is O(1). The algorithm uses O(1) number of shared variables that support read, write, fetch&add.
5 Multi-Writer Multi-Reader Phase-Fair Algorithm In this section we will describe how we construct our multi-writer multi-reader phasefair lock using the single-writer writer-priority lock given in Figure 1. We denote this single-writer lock given in Figure 1 by SW in the rest of this section. In all the algorithms we discuss in this section the code of the Read − lock() procedure will be identical to the algorithm given in SW . The writers on the other hand will use a Mutual Exclusion lock to ensure only one writer accesses the underlying SW . More precisely a writer first needs to enter the CS of the Mutual Exclusion lock and then compete with the readers in the single-writer protocol from Figure 1. The Mutual Exclusion lock we use was designed by T. Anderson [13]. It is a constant RMR Mutual Exclusion lock satisfying P3 and P6. We use the procedures acquire(M ) and release(M ) to denote the Try and the Exit section of this lock. 5
6
The algorithm given in Section 5 of [5] had only one Gate ∈ {0, 1} variable. The value of Gate at any time denoted the side opened for the readers. This change is required to make the final algorithm in Section 5 of this paper (which transforms this single-writer algorithm to multi-writer algorithm) to work correctly. Both the components of C[d] (and EC) are stored in a single word.
126
V. Bhatt and P. Jayanti
Notations used in this section: We denote SW-Write-try (respectively, SW-Read-try) for the Try section code of the writer (respectively, reader) in the single-writer algorithm given in the Figure 1. Similarly we use SW-Write-exit (respectively, SW-Read-exit) for the Exit section code of the writer (respectively, reader). We first present a simple but incorrect multi-writer algorithm in Figure 2. This algorithm is exactly the same as the multi-writer starvation free algorithm from [5]. The readers simply execute SW-Read-try followed by SW-Read-exit. The writers first obtain a mutual exclusion lock M , then execute SW-Write-try followed by SW-Write-exit and finally exit M . As far as the underlying single-writer protocol is concerned there is only one writer executing at any time and it executes exactly the same steps as in the multiwriter version. Hence one can easily see that the algorithm satisfies (P1)-(P6). In fact it also satisfies the fast switch from writer to readers (PF3); say a writer is in the CS (say from side d), all the readers in the waiting room are waiting for Gate[d] to open. So when the writer opens Gate[d] in SW-Read-exit(), all the waiting readers get enabled. procedure Write-lock() R EMAINDER SECTION 1. acquire(M ) 2. SW-Write-try() C RITICAL SECTION 3. SW-Write-exit() 4. release(M )
procedure Read-lock() R EMAINDER SECTION 5. SW-Read-try() CRITICAL SECTION
6.
SW-Read-exit()
Fig. 2. Simple but incorrect Phase-Fair Multi-Writer Multi-Reader algorithm
But this algorithm does not satisfy Fair switch from readers to writer (PF2). To see this consider the following scenario. Say a reader is in the CS and no other processes are active. Then a writer w enters the Try section and executes the doorway of the lock M , hence w is in the waiting room of the multi-writer algorithm. At this point all the writers are still in Remainder section of SW , because the only active writer w has not even started SW -Write-lock. This means that any new reader who begins its Try section now can go past w in to the CS, thus violating (PF2). As mentioned in the previous section SW satisfies (WP1) property: if the writer completes the doorway before a reader starts its doorway, then the reader does not enter the CS before the writer. Also note that in the doorway of SW , the writer just flips the direction variable D and at that point Gate[D] is closed. So a tempting idea to overcome the troubling scenario described above is to make the incoming writer execute the doorway (toggle D) before it enters the waiting room of the multi-writer algorithm (essentially the waiting room in acquire(M )). One obvious problem with this approach is that the direction variable D will keep on flipping as the new writers enter the waiting room, thus violating the invariants and correctness of SW . Another idea is that the exiting writer, say w, before exiting SW (which just comprises of opening Gate[currD]), flips the direction variable D. This way only the readers currently waiting are enabled but any new readers starting their Try section will be blocked. This idea will work in the cases when there is some writer in the Try section when w is exiting. If w is the only active writer in the system, and if flips the direction
Specification and Constant RMR Algorithm
127
variable D in the Exit section, then Gate[D] will be closed till the next writer flips D again. Hence, starvation freedom and concurrent entry might be in danger. One way to prevent this might be that w before opening Gate[d], checks if there are any writers in the Try section, and only if it sees presence of a writer in the Try section it flips the direction, else it does not. One inherent difficulty with this idea is that if w does not see any writers in the Try section, and is poised to open Gate[d] for the waiting readers, and just then bunch of writers enter the waiting room, and w opens the Gate for the waiting readers, the property PF2 might be in jeopardy. In this case, one might be tempted to think that one of these writers in the Try section should flip the direction. But which of these writers should flip the direction? Should the writer with the best priority (one with smallest token in the lock M ) say w∗ flip the direction? But if w∗ sleeps before flipping direction and many other writers enter the waiting room while some reader is in the CS, again (PF2) is in danger. From all these discussions above we can see the challenges in designing a multiwriter algorithm satisfying all of the phase fairness properties. In particular (PF2) property seems hard to achieve. Before we describe our correct phase-fair multi-writer algorithm presented in Figure 3, we lay down some essential invariants necessary for any phase-fair algorithm in which the readers simply execute Read-lock procedure of SW , and the writers obtain a mutual exclusion lock M before executing Write-lock procedure of SW . i. if there are no writers active, then Gate[D] = true. (required for starvation freedom and concurrent entry) ii. if there is a reader is in the CS, and a writer in the waiting room, then Gate[D] = false. (required for PF2) Now we are ready to describe our correct phase-fair multi-writer algorithm given in Figure 3. First we give the overall idea. 5.1 Informal Description of the Algorithm in Figure 3 The algorithm given in the Figure 3 is on the same lines as the discussion above. Both readers and writers use the underlying single-writer protocol SW . The Read-Lock procedure is same as in SW . The writers first obtain a Mutual exclusion lock M and then execute the Write-lock procedure of SW . We make one slight but crucial change in the way processes execute SW ; we replace the direction variable D with a fetch&add variable Y . Also when a process wants to know the current direction it calls the procedure GetD() which in tun accesses Y to determine the appropriate direction. The crux of the algorithm lies in the use of Y and we will explain that later. We denote the Lines 4-12 of SW , i.e., waiting room of Write-lock by SW -wwaiting. Similarly, SW -r-exit corresponds to the Exit section of the reader in SW , i.e., Lines 23-27 of SW . 5.2 Shared Variables and Their Purpose The shared variables Gate, C, EC, Permit, ExitPermit and writer’s local variable currD and prevD have the same type and purpose as in SW . As mentioned earlier the direction
128
V. Bhatt and P. Jayanti
Shared Variables Y is a fetch&add variable with two components [dr ∈ {0, 1}, wcount ∈ N], initialized to [0, 0] ExitPermit is a Boolean read/write variable ∀d ∈ {0, 1}, Permit[d] is a Boolean read/write variable ∀d ∈ {0, 1}, Gate[d] is a Boolean read/write variable, initially Gate[0] is true and Gate[1] is false EC is a fetch&add variable with two components [writer-waiting ∈ {0, 1}, reader-count ∈ N], initially EC = [0, 0] ∀d ∈ {0, 1}, C[d] is a fetch&add variable with two components [writer-waiting ∈ {0, 1}, reader-count ∈ N], initialized to [0, 0] procedure Write-lock() R EMAINDER SECTION 1. F&A(Y, [0, 1]) 2. acquire(M ) 3. currD ← GetD() 4. prevD ← currD 5. SW -w-waiting() CRITICAL SECTION
6. 7. 8.
F&A(Y, [1, −1]) Gate[currD] ← true release(M )
procedure Read-lock() procedure GetD() R EMAINDER SECTION 19. (dr, wc) ← Y 9. d ← GetD() 20. if (wc = 0) 10. F&A(C[d], [0, 1]) 21. return dr 11. d ← GetD() 22. return dr 12. if(d = d ) 13. F&A(C[d ], [0, 1]) 14. d ← GetD() 15. if(F&A(C[d], [0, −1]) = [1, 1]) 16. Permit[d] ← true 17. wait till Gate[d] CRITICAL SECTION
18. SW -r-exit() Fig. 3. Phase-Fair Multi-Writer Multi-Reader Algorithm. The doorway of Write-lock comprises of Line 1 and the doorway of acquire(M ). The doorway of Read-lock comprises of Lines 9-16.
variable D from SW has been replaced by a new fetch&add variable Y . We now describe this variable in detail. Y : fetch&add variable updated only by the writers and read both by the readers and the writers. Y has two components [dr, wcount]. The component dr ∈ {0, 1} is used to indicate the direction of the writers (one corresponding to the replaced direction variable D). Intuitively dr is the direction of the last writer to be in the CS. The wcount component keeps the count of the writers in the Try section and CS. The writer in the beginning of the Try section increments the wcount component, and in its Exit section it decrements the wcount component and flips the dr component. We assume the dr bit is the most significant bit of the word, hence the writer has to only add 1 at the most significant bit to flip it. How is Y able to give the appropriate direction? When a writer flips the dr component in the Exit section, there are two possibilities: either no writer is in the Try section or some writer is present in the Try section. In the former case, this flipping of the direction should not change the “real” direction and in the later case it should. Here is where the component wcount comes into play. If no writer is present and some reader r reads Y to determine the direction (Lines 9, 11 or 14), r will notice the wcount component of Y to be zero, hence it will infer the direction to be dr, i.e., the direction of the last writer to be in the CS. On the other hand if some writer is present in the Try section or CS,
Specification and Constant RMR Algorithm
129
then wcount = 0, so r can infer that some writer started an attempt since the last writer has exited, hence it will take the direction as the complement of dr. This mapping of the value of Y to the appropriate direction of the writer is done by the procedure GetD() (Lines 19-22). Now we explain the algorithm in detail line by line. 5.3 Line by Line Commentary The Read-lock procedure is exactly the same as in SW with the only difference that instead of reading D at Lines 9, 11 or 14, the reader takes the return value from the procedure GetD(), which in turn extracts the direction based on the value of Y as described above. The doorway of the reader comprises of Lines 9-16. Now we explain the Write-lock procedure in detail. In the Write-lock procedure, a writer w first increments the wcount component of Y (Line 1). Note that if wcount of Y was zero just before w executes Line 1, then w has implicitly flipped the direction, and if it was non-zero already then the direction is unchanged. Now w tries to acquire the lock M (Line 2) and proceeds to Line 3 when it is in the CS of lock M . Also note that, the configuration of SW at this point is as if w has already executed its doorway (of SW ), we will get into this detail more when we explain the code of the Exit section of Write-lock (Lines 6-8). w sets its local variable currD and prevD appropriately (Lines 3-4). Then w executes the waiting room of SW (Lines 4-12 of Figure 1) to compete with the readers (Line 5). Once out of this waiting room, w enters the CS as it is assured that no process (reader or a writer) is present in the CS. Before we describe the Exit section, note that all the readers currently in waiting room are waiting on Gate[currD], and at this point both Gate[0], Gate[1] are closed. In the first statement on the Exit section, the writer flips the dr component of Y and at the same time decrements the wcount component (Line 6).7 Note that at this point (when PCw = 7), if there are no active writers other than w in the system, wcount would be zero and the direction would be exactly the same as currD. Hence the invariant i. mentioned in the previous subsection holds. Similarly, if there is some writer present in the Try section (Y.wcount > 0), then w has flipped the direction to currD, essentially w has executed the doorway of SW for the next writer. At Line 7, w enables the waiting readers by opening Gate[currD]. Note at this point if there is a writer in the waiting room, the direction is equal to currD and Gate[currD] is closed, hence the invariant ii. from previous subsection holds. Then finally w releases the lock M (Lines 8). Following theorem summarizes the properties of this algorithm. Theorem 2 (Multi-Writer Multi-Reader Phase-fair lock). The algorithm in Figure 3 implements a Multi-Writer Multi-Reader lock satisfying the properties (P1)-(P6) and (PF2), (PF3) using the lock M from [13] and the algorithm in Figure 1. The RMR complexity of the algorithm in the CC model is O(1). The algorithm uses O(m) number of shared variables that support read, write, fetch&add, where m is the number of writers in the system. 7
We assume that the dr bit is the most significant bit of the word storing Y , and any overflow bit is simply dropped. Hence w has to only fetch&add [1, −1] to Y to atomically decrement wc and flip dr.
130
V. Bhatt and P. Jayanti
References 1. Brandenburg, B.B., Anderson, J.H.: Reader-writer synchronization for shared-memory multiprocessor real-time systems. In: ECRTS 2009: Proceedings of the 2009 21st Euromicro Conference on Real-Time Systems, Washington, DC, USA, pp. 184–193. IEEE Computer Society, Los Alamitos (2009) 2. Brandenburg, B.B., Anderson, J.H.: Spin-based reader-writer synchronization for multiprocessor real-time systems. Submitted to the Real-Time Systems (December 2009), http://www.cs.unc.edu/˜anderson/papers/rtj09-for-web.pdf 3. Dijkstra, E.W.: Solution of a problem in concurrent programming control. Commun. ACM 8(9), 569 (1965) 4. Courtois, P.J., Heymans, F., Parnas, D.L.: Concurrent control with “readers” and “writers”. Commun. ACM 14(10), 667–668 (1971) 5. Bhatt, V., Jayanti, P.: Constant rmr solutions to reader writer synchronization. In: PODC 2010: Proceeding of the 29th ACM SIGACT-SIGOPS Symposium on Principles of Distributed Computing, pp. 468–477. ACM, New York (2010) 6. Lamport, L.: A new solution of Dijkstra’s concurrent programming problem. Commun. ACM 17(8), 453–455 (1974) 7. Mellor-Crummey, J.M., Scott, M.L.: Scalable reader-writer synchronization for sharedmemory multiprocessors. In: PPOPP 1991: Proceedings of the third ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 106–113. ACM, New York (1991) 8. Lev, Y., Luchangco, V., Olszewski, M.: Scalable reader-writer locks. In: SPAA 2009: Proceedings of the Twenty-First Annual Symposium on Parallelism in Algorithms and Architectures, pp. 101–110. ACM, New York (2009) 9. Hadzilacos, V., Danek, R.: Local-spin group mutual exclusion algorithms. In: Liu, H. (ed.) DISC 2004. LNCS, vol. 3274, pp. 71–85. Springer, Heidelberg (2004) 10. Cypher, R.: The communication requirements of mutual exclusion. In: SPAA 1995: Proceedings of the Seventh Annual ACM Symposium on Parallel Algorithms and Architectures, pp. 147–156. ACM, New York (1995) 11. Attiya, H., Hendler, D., Woelfel, P.: Tight RMR lower bounds for mutual exclusion and other problems. In: STOC 2008: Proceedings of the 40th Annual ACM Symposium on Theory of Computing, pp. 217–226. ACM, New York (2008) 12. Anderson, J.H., Kim, Y.-J.: An improved lower bound for the time complexity of mutual exclusion. Distrib. Comput. 15(4), 221–253 (2002) 13. Anderson, T.E.: The performance of spin lock alternatives for shared-memory multiprocessors. IEEE Trans. Parallel Distrib. Syst. 1(1), 6–16 (1990)
On the Performance of Distributed Lock-Based Synchronization Yuval Lubowich and Gadi Taubenfeld The Interdisciplinary Center, P.O. Box 167, Herzliya 46150, Israel {yuval,tgadi}@idc.ac.il
Abstract. We study the relation between two classical types of distributed locking mechanisms, called token-based locking and permission-based locking, and several distributed data structures which use locking for synchronization. We have proposed, implemented and tested several lock-based distributed data structures, namely, two different types of counters called find&increment and increment&publish, a queue, a stack and a linked list. For each one of them we have determined what is the preferred type of lock to be used as the underling locking mechanism. Furthermore, we have determined which one of the two proposed counters is better to be used either as a stand-alone data structure or when used as a building block for implementing other high level data structures. Keywords: Locking, synchronization, distributed mutual exclusion, distributed data structures, message passing, performance analysis.
1 Introduction 1.1 Motivation and Objectives Simultaneous access to a data structure shared among several processes, in a distributed message passing system, must be synchronized in order to avoid interference between conflicting operations. Distributed mutual exclusion locks are the de facto mechanism for concurrency control on distributed data structures. A process accesses the data structure only while holding the lock, and hence the process is guaranteed exclusive access. Over the years a variety of techniques have been proposed for implementing distributed mutual exclusion locks. These locks can be grouped into two main classes: token-based locks and permission-based locks. In token-based locks, a single token is shared by all the processes, and a process acquires the lock (i.e., is allowed to enter its critical section) only when it possesses the token. Permission-based locks are based on the principle that a process acquires the lock only after having received “enough” permissions from other processes. Our first objectives is: Objective one. To determine which one of the two locking techniques – tokenbased locking or permission-based locking – is more efficient. Our strategy to achieve this objective is to implement one classical token-based lock (SuzukiKasami’s lock [28]), and two classical permission-based locks (Maekawa’s lock [10] and Ricart-Agrawala’s lock [22]), and to compare their performance. M.K. Aguilera et al. (Eds.): ICDCN 2011, LNCS 6522, pp. 131–142, 2011. c Springer-Verlag Berlin Heidelberg 2011
132
Y. Lubowich and G. Taubenfeld
It is possible to trivially implement a lock by letting a single pre-defined process (machine) to act as an “arbiter” or even to let all the data structures reside in the local memory of a single process and letting this process impose a definite order between concurrent operations. Such a centralized solution might be preferred in some situations, although it limits the degree of concurrency, imposes an extra load on the arbiter, and is less robust. In this work, we focus on fully distributed implementations of locks. Locks are just a tool used when implementing various distributed applications, thus, our second objective has to do with implementing lock-based data structures. Objective two. To propose, implement and test several distributed data structures, namely, two different types of counters, a queue, a stack and a linked list; and for each one of the data structures to determine what is the preferred mutual exclusion lock to be used as the underling locking mechanism. Furthermore, the worst-case message complexity of one of the permissionbased locks (i.e., Maekawa’s lock) is better than the worst-case message complexity of the other two locks. It would be interesting to find out whether this theoretical result be reflected in our performance analysis results. In a shared memory implementation of a data structure, the shared data is usually stored in the shared memory. But, who should hold the data in a distributed message passing system? one process? all of them? In particular, when implementing a distributed counter, who should hold the current value of the counter? one process? all of them? To address this question, for the case of a shared counter, we have implemented and compared two types of shared counter: A find&increment counter, where only the last process to update the counter needs to know its value; and an increment&publish counter, where everybody should know the value of the counter after each time the counter is updated. We notice that in the find&increment counter, once the counter is updated there is no need to “tell” its new value to everybody, but in order to update such a counter one has to find its current value first. In the increment&publish counter the situation is the other way around. Objective three. To determine which one of the two proposed counters is better to be used either as a stand-alone data structure or when used as a building block for implementing other high level data structures (such as a queue, a stack or a linked list). We point out that, in our implementations of a queue, a stack, and a linked list, the shared data is distributed among all the processes; that is, all the items inserted by a process are kept in the local memory of that process. 1.2 Experimental Framework and Performance Analysis In order to measure and analyze the performance of the proposed five data structures and of the three locks, we have implemented and run each data structure with each one of the three implemented distributed mutual exclusion locks as the underling locking mechanism. We have measured each data structure’s performance, when using each one of the locks, on a network with one, five, ten, fifteen and twenty processes, where
On the Performance of Distributed Lock-Based Synchronization
133
each process runs in a different node. A typical simulation scenario of a shared counter looked like this: Use 15 processes to count up to 15 million by using a find&increment counter that employs Maekawa’s lock as its underling locking mechanism. The queue, stack, and linked list were also tested using each one of the two counters as a building block in order to determine which of the two counter performers better when used as a building block for implementing other high level data structures. Special care was taken to make the experiments more realistic by preventing runs which would display an overly optimistic performance; for example, preventing runs where a process completes several operations while acquiring and holding the lock once. Our testing environment consisted of 20 Intel XEON 2.4 GHz machines running the Windows XP OS with 2GB of RAM and using a JRE version 1.4.2 08. All the machines were located inside the same LAN and were connected using a 20 port Cisco switch. 1.3 Our Findings The experiments, as reported in Section 5, lead to the following conclusions: 1. Permission-based locking always outperforms token-based locking. I.e, each one of the two permission-based locks always outperforms the token-based lock. This result about locks may suggest that, in general when implementing distributed data structures, it is better to take the initiative and search for information (i.e., ask for permissions) when needed, instead of waiting for your turn. 2. Maekawa’s permission-based lock always outperforms Ricart-Agrawala and SuzukiKasami locks, when used as the underling locking mechanism for implementing the find&increment counter, the increment&publish counter, a queue, a stack, and a linked list. Put another way, for each one of the five data structures, the preferred lock to be used as the underling locking mechanism is always Maekawa’s lock. The worst-case message complexity of Maekawa’s lock is better than the worstcase message complexity of both Ricart-Agrawala and Suzuki-Kasami locks; thus, the performance analysis supports and confirms the theoretical analysis. 3. The find&increment counter always outperforms the increment&publish counter, either as a stand-alone data structure or when used as a building block for implementing a distributed queue and a distributed stack. This result about counters may suggest that, in general when implementing distributed data structures, it is more efficient to actively search for information only when it is needed, instead of distributing it in advance. As expected, all the data structures exhibit performance degradation as the number of processes grows.
2 Distributed Mutual Exclusion Algorithms We consider a system which is made up of n reliable processes, denoted p1 , . . . , pn , which communicate via message passing. We assume that the reader is familiar with the definition of the mutual exclusion problem [4, 29]. The three mutual exclusion algorithms (i.e., locks) implemented in this work satisfy the mutual exclusion and the starvation freedom requirements.
134
Y. Lubowich and G. Taubenfeld
The first published distributed mutual exclusion algorithm, due to Lamport [8], is based on the notion of logical clocks. Over the years a variety of techniques have been proposed for implementing distributed mutual exclusion locking algorithms [19, 20, 25, 26]. These algorithms can be grouped into two main classes: token-based algorithms [13, 18, 28, 27] and permission-based algorithms [3, 8, 10, 12, 22, 24]. – In token-based algorithms, a single token is shared by all the processes, and a process is allowed to enter its critical section only when it possesses the token. A process continues to hold the token until its execution of the critical section is over, and than it may pass it to some other process. – Permission-based algorithms, are based on the principle that a process may enter its critical section only after having received “enough” permissions from other processes. Some permission based algorithms require that a process receives permission for all of the other processes whereas others, more efficient algorithms, require a process to receive permissions from a smaller group. Below we describe the basic principles of the three known distributed mutual exclusion algorithm that we have implemented. 2.1 Suzuki-Kasami’s Token-Based Algorithm In Suzuki and Kasami’s algorithm [28], the privilege to enter a critical section is granted to the process that holds the PRIVILEGE token (which is always held by exactly one process). Initially process p1 has the privilege . A process requesting the privilege sends a REQUEST message to all other processes. A process receiving a PRIVILEGE message (i.e. the token) is allowed to enter its critical section repeatedly until the process passes the PRIVILEGE to some other process. A REQUEST message of process pj has the form REQUEST(j, m) where j is the process identifier and m is a sequence number which indicates that pj is requesting its (m + 1)’th critical section invocation. Each process has an array RN of size n, where n is the number of processes. This array is used to record the largest sequence of number ever received from each one of the other processes. When a REQUEST(j, m) message is received by pi , the process updates RN by executing RN [j] = max(RN [j], m). A PRIVILEGE message has the form of PRIVILEGE(Q, LN ) where Q is a queue of requesting processes and LN is an array of size n such that, LN [j] is the sequence number of the request of pj granted most recently. When pi finishes executing its critical section, the array LN , contained in the last PRIVILEGE message received by pi , is updated by executing LN [i] = RN [i], indicating that the current request of pi has been granted. Next, every process pj such that RN [j] = LN [j] + 1, is appended to Q provided that pj is not already in Q. When these updates are completed, if Q is not empty then PRIVILEGE(tail(Q), LN ) is send to the process at the head of Q. If Q is empty then pi retain the privilege until some process requests it. The algorithm requires, in the worst case, n message exchanges per mutual exclusion invocation: (n − 1) REQUEST messages and one PRIVILEGE message. In the best case, when the process requesting to enter its critical section already holds the privilege token, the algorithm requires no messages at all.
On the Performance of Distributed Lock-Based Synchronization
135
2.2 Ricart-Agrawala’s Permission-Based Algorithm The first permission based algorithm, due to Lamport [8], has 3(n − 1) message complexity. Ricart and Agrawala had modified Lamport’s algorithm and were able to achieve 2(n − 1) message complexity [22]. In this algorithm, when a process, say pi , wants to enter its critical section, it sends a REQUEST(m, i) message to all other processes. This message contains a sequence number m and the process’ identifier i, which are then used to define a priority among requests. Process pj , upon receipt of a REQUEST message from process pi , sends an immediate REPLY message to pi if either pj itself has not requested to enter its critical section, or pj ’s request has a lower priority than that of pi . Otherwise, process pj defers its REPLY (to pi ) until its own (higher priority) request is granted. Process pi enters its critical section when it receives REPLY messages from all the other n − 1 processes. When pi releases the critical section, it sends a REPLY message to all deferred requests. Thus, a REPLY message from process pi implies that pi has finished executing its critical section. This algorithm requires only 2(n − 1) messages per critical section invocation: n − 1 REQUEST messages and n − 1 REPLY messages. 2.3 Maekawa’s Permission-Based Algorithm In Maekawa’s algorithm [10], process pi acquires permission to√enter its critical section from a set of processes, denoted Si ,√which consists of at most n processes that act as arbiters. The algorithm uses only c n messages per critical section invocation, where c is a constant between 3 for light traffic and 5 for heavy traffic. Each process can issue a request at any time. In order to arbitrate requests, any two requests from different processes must be known to at least one arbitrator process. Since process pi must obtain permission to enter its critical section from every process in Si , the intersection of every pair of sets Si and Sj must not be empty, so that processes in Si ∩ Sj can serve as arbitrators between conflicting requests of pi and pj . There are many efficient constructions of the sets S1 ,..., Sn (see for example√[7, 23, 15]). The construction used in our implementation is as follows: Assume that n is a positive of √ integer √ (if not then few dummy processes can be added). Consider a matrix √ size n × n, where the value of an entry (i, j) in the matrix is (i − 1) × n + j. Clearly, for every k ∈ {1, ..., n} there is exactly one entry, √ √ denoted (ik , jk ), whose value is k. The unique entry (ik , jk ) is (k/ n, k (mod n) + 1). For each k ∈ {1, ..., n}, a subset Sk is defined to be the set of values on the row and the column passing √ through (ik , jk ). Clearly, Si ∩ Sj = ∅ for all pairs i and j (and the size of each set is 2 n − 1). Thus, whenever two processes pi and pj try to enter their critical sections, the arbiter processes in Si ∩ Sj will grant access to only one of them at a time, and thus the mutual exclusion property is satisfied. By carefully designing the algorithm deadlock is also avoided.
3 Distributed Data Structures We have proposed and implemented five distributed data structures: two types of counters, a queue, a stack and a linked-list. Each one of these data structures implementations
136
Y. Lubowich and G. Taubenfeld
makes use of an underlying locking mechanism. As already mentioned, we have implemented the three mutual exclusion algorithms described in the previous section, and for each of the five data structures determined what is the preferred mutual exclusion algorithm to be use for locking. Various lock-based data structures have been proposed in the literature mainly for use in databases, see for example [2, 5, 9]. A distributed dictionary structure is studied in [16]. Below we describe the five distributed data structures that we have studied. All the data structures are linearizable. Linearizability means that, although several processes may concurrently invoke operations on a linearizable data structure, each operation appears to take place instantaneously at some point in time, and that the relative order of non-concurrent operations is preserved [6]. Two counters. A shared counter is a linearizable data structure that supports the single operation of incrementing the value of the counter by one and returning its previous value. We have implemented and compared two types of shared counter: 1. A find&increment counter. In this type of a counter only the last process to update the counter needs to know its value. In the implementation, a single lock is used, and only the last process to increment the counter knows its current value. A process p that tries to increment the shared counter first acquires the lock. Then p sends a FIND message to all other processes. When the process that knows the value of the counter receives a FIND message, it replies by sending a message with the value of the counter to p. When p receives the message, it increments the counter and releases the lock. (We notice that p can keep on incrementing the counter’s value until it gets a FIND message.) 2. An increment&publish counter. In this counter everybody should know the value of the counter each time it is updated. In the implementation, a single lock is used. A process that tries to increment the shared counter first acquires the lock. Then, it raises the counter value, sends messages to all other processes informing them of the new counter value, gets acknowledgements, and releases the lock. A queue. A distributed queue is a linearizable data structure that supports enqueue and dequeue operations, by several processes, with the usual queue semantics. We have implemented a distributed queue which consists of local queues residing in the individual processes participating in the distributed queue. A single lock and a shared counter are used for the implementation. Each element in a queue has a timestamp that is generated using the shared counter. An ENQUEUE operation is carried out by raising the counter’s value by one and enqueuing an element in the local queue along with the counter’s value. A DEQUEUE operation is carried out by first acquiring the lock, locating the process that holds the element with the lowest timestamp, removing this element from this process’ local queue, and releasing the lock. A stack. A distributed stack is a linearizable data structure that supports push and pop operations, by several processes, with the usual stack semantics. We have implemented a distributed stack which is similar to the distributed queue. A single lock and a shared counter are used for the implementation. It consists of local stacks residing in the
On the Performance of Distributed Lock-Based Synchronization
137
individual processes participating in the distributed stack. Each element in the stack has a timestamp that is generated by the shared counter. A PUSH operation is carried out by incrementing the counter value by one and pushing the element in the local stack along with the counter’s value. A POP operation is carried out by acquiring the lock, locating the process that contains the element with the highest timestamp, removing this element from its local stack, an releasing the lock. A linked list. A distributed linked list is a linearizable data structure that supports insertion and deletion of elements from any point in the list. We have implemented a list which consists of a sequence of elements, each containing a data field and two references (“links”) pointing to the next and previous elements. Each element can reside in any process. The distributed list also supports the operations “traverse list”; and “size of list”. The list contains a head and a tail “pointers” that can be sent to requesting processes. Each pointer maintains a reference to a certain process and a pointer to a real element stored in that process. Manipulating the list requires that a lock be acquired. A process that needs to insert an element to the head of the list acquires the lock, and sends a request for the “head pointer” to the rest of the processes. Whenever a process that holds the “head pointer” receives the message, it immediately replies by sending the pointer to the requesting process. Once the requesting process has the “head pointer” inserting the new element is purely a matter of storing it locally and modifying the “head pointer” to point to the new element (the new element of course now points to the element previously pointed by the “head pointer”). Deleting an element from the head of list is done much the same way. Inserting or deleting elements from the list requires a process to acquire the (single) lock, traverse the list and manipulate the list’s elements. A process is able to measure the size of the list by acquiring the lock and then querying the other processes about the size of their local lists.
4 The Experimental Framework Our testing environment consisted of 20 Intel XEON 2.4 GHz machines running the Windows XP OS with 2GB of RAM and using a JRE version 1.4.2 08. All the machines were located inside the same LAN and were connected using a 20 port Cisco switch. We measured each data structure’s performance, using each one of the distributed mutual exclusion algorithms, by running each data structure on a network with one, five, ten, fifteen and twenty processes, where each process runs in a different node of the network. For example a typical simulation scenario of a shared counter looked like this: Use 15 processes to count up to 15 million by using a find&increment counter that employs Maekawa’s algorithm as its underling locking mechanism. All tests were implemented using Coridan’s messaging middleware technology called MantaRay. MantaRay’s is a fully distributed server-less architecture where processes running in the network are aware of one another and as a result are able to send messages back and forth directly. We have tested each of the implementations in hours, and sometimes days long, of executions on various number of processes (machines).
138
Y. Lubowich and G. Taubenfeld
5 Performance Analysis and Results All the experiments done on the data structures we have implemented, start with an initially empty data structure (queue, stack etc.) to which processes have performed a series of operations. For example, in the case of a queue, the processes performed a series of enqueue/dequeue operations. Each process enqueued an element, did “something else” and repeated for a million times. After that, the process dequeued an element, did “something else” and repeated for a million times again. The “something else” consisted of approximately 30 mSeconds of doing nothing and waiting. As with the tests done on the locking algorithms, this served in making the experiments more realistic in preventing long runs by the same process which would display an overly optimistic performance, as a process may complete several operations while holding the lock. The time a process took to complete the “something else” is not reported in our figures. The experiments, as reported below, lead to the following conclusions: – Maekawa’s permission-based algorithm always outperforms Ricart-Agrawala and Suzuki-Kasami algorithms, when used as the underling locking mechanism for implementing the find&increment counter, the increment&publish counter, a queue, a stack, and a linked list; – The find&increment counter always outperforms the increment&publish counter, either as a stand-alone data structure or when used as a building block for implementing a distributed queue and a distributed stack. As expected, the data structures exhibit performance degradation as the number of processes grows. 5.1 Counters The two graphs in Figure 1, show the time one process spends performing a single count up operation averaged over one million operations for each process using each of the three locking algorithms implemented. As can be seen, the counters perform worse when using Ricart-Agrawala algorithm and perform best when using Maekawa’s algorithm. As for comparing the two counters, it is clear that the find&increment counter behaves and scales better than the increment&publish counter when the number of processes grows. The observation that the find&increment counter is better than the increment&publish counter will become also clear when examining the results for the queue and stack implementations that make use of shared counters as building blocks. 5.2 A Queue The two graphs in Figure 2, show the time one process spends performing a single enqueue operation averaged over one million operations for each process using each of the three locks. Similar to the performance analysis of the two counters, the queue performs worse when using Ricart-Agrawala algorithm and performs best when using Maekawa’s algorithm. It is clear that the queue performs better when using the find&increment counter than when using increment&publish counter.
On the Performance of Distributed Lock-Based Synchronization
Increment&publish Counter
450
450
400
400
350
350
300
300
mSeconds
mSeconds
Find&increment Counter
250 200 150 Suzuki Kasami
50 15
150 Ricart Agrawala Suzuki Kasami Maekawa
0
0 10 Processes
200
50
Maekawa
5
250
100
Ricart Agrawala
100
1
139
1
20
5
10 15 Processes
20
Fig. 1. The time one process spends performing a single count up operation averaged over one million operations per process, in the find&increment counter and in the increment&publish counter
Queue - Enqueue Operation Employing Increment&publish Counter
450
450
400
400
350
350
300
300 mSeconds
mSeconds
Queue - Enqueue Operation Employing Find&increment Counter
250 200
250 200 150
150 Ricart Agrawala
100
Suzuki Kasami
50
Maekawa 0 1
5
10 Processes
15
20
Ricart Agrawala
100
Suzuki Kasami
50
Maekawa 0 1
5
10 Processes
15
20
Fig. 2. The time one process spends performing an enqueue operation averaged over one million operations per process, in a queue employing the find&increment counter and in a queue employing the increment&publish counter
The dequeue operation does not make use of a shared counter. Figure 3bb shows the time one process spends performing a single dequeue operation averaged over one million operations for each process using each of the three locks. Similar to the performance analysis of the enqueue operation, the dequeue operation is the slowest when using Ricart-Agrawala algorithm, and is the fastest when using Maekawa’s algorithm. 5.3 A Stack As expected, the graphs of the performance analysis results for a stack are almost the same as those presented in the previous subsection for a queue, and hence omitted from this abstract. As in all previous examples, the stack performs worse when using Ricart-Agrawala algorithm and performs best when using Maekawa’s algorithm. As for comparing the two counters, it is clear that the stack performs better when using the find&increment counter than when using the increment&publish counter.
140
Y. Lubowich and G. Taubenfeld Queue - Dequeue Operation 450 400 350 mSeconds
300 250 200 150 Ricart Agrawala
100
Suzuki Kasami
50
Maekawa
0 1
5
10 Processes
15
20
Fig. 3. The time one process spends performing a dequeue operation averaged over one million operations per process
5.4 A Linked List The linked list we have implemented does not make use of a shared counter. Rather it uses the locking algorithm directly to acquire a lock before manipulating the list itself. The graphs in Figure 4, show the time one process spends performing a single insert operation or a single delete operation, averaged over one million operations for each process using each of the three locking algorithms implemented. As in all previous examples, the linked list performs worse when using Ricart-Agrawala algorithm and performs best when using Maekawa’s algorithm as the underling locking mechanism. Linked List Delete Operation
350
350
300
300
250
250 mSeconds
mSeconds
Linked List Insert Operation
200 150 Ricart Agrawala
100
200 150 Ricart Agrawala
100
Suzuki Kasami -
Suzuki Kasami 50
50 Maekawa -
Maekawa 0
0 1
5
10 Processes
15
20
1
5
10 Processes
15
20
Fig. 4. The time one process spends performing an insert operation or delete operation averaged over one million operations per process in a linked list
6 Discussion Data structures such as shared counters, queues and stacks are ubiquitous in programming concurrent and distributed systems, and hence their performance is a matter of concern. While the subject of data structures is a very hot research topic in recent years
On the Performance of Distributed Lock-Based Synchronization
141
in the context of concurrent (shared memory) systems, this is not the case for distributed (message passing) systems. In this work, we have studied the relation between classical locks and specific distributed data structures which use locking for synchronization. The experiments consistently revealed that the implementation of Maekawa’s lock is more efficient than that of the other two locks, and that the implementation of the find&increment counter is consistently more efficient than that of the increment&publish counter. The fact that Maekawa’s lock performs better is, in part, due to the fact that its worst-case message complexity is better. The results suggest that, in general, it is more efficient to actively search for information (or ask for permissions) only when it is needed, instead of distributing it to everybody in advance. Thus, we expect to find similar type of results also for different experimental set ups. For our investigation, it is important to implement and use the locks as completely independent building blocks, so that we can compare their performance. In practice, various optimizations are possible. For example, when implementing the find&increment counter using a lock, a simple optimization would be to store the value of the counter along with the lock. Thus, when a process requests and obtains a lock, it obtains the current value of the counter along with the lock, thereby eliminating the need for any find messages. Future work would entail implementing and evaluating other locking algorithms [3, 14], and fault tolerant locking algorithms that do not assume an error-free network [1, 11, 18, 21, 17]. It would also be interesting to consider additional distributed lockbased data structures, and different experimental set ups. When using locks, the granularity of synchronization is important. Our implementations are examples of coarse-grained synchronization, as they allow only one process at a time to access the data structure. It would be interesting to consider data structures which use fine-grained synchronization in which it is possible to lock “small pieces” of a data structure, allowing several processes with non-interfering operations to access it concurrently. Coarse-grained synchronization is easier to program but is less efficient and is not fault-tolerant compared to fine-grained synchronization.
References 1. Agrawal, D., El-Abbadi, A.: An efficient and fault-tolerant solution for distributed mutual exclusion. ACM Transactions on Computer Systems 9(1), 1–20 (1991) 2. Bayer, R., Schkolnick, M.: Concurrency operations on B-trees. Acta Informatica 1(1), 1–21 (1977) 3. Carvalho, O.S.F., Roucairol, G.: On mutual exclusion in computer networks. Communications of the ACM 26(2), 146–147 (1983) 4. Dijkstra, E.W.: Solution of a problem in concurrent programming control. Communications of the ACM 8(9), 569 (1965) 5. Ellis, C.S.: Distributed data structures: A case study. IEEE Transactions on Computers c34(12), 1178–1185 (1985) 6. Herlihy, M.P., Wing, J.M.: Linearizability: a correctness condition for concurrent objects. ACM Trans. on Programming Languages and Systems 12(3), 463–492 (1990) 7. Ibaraki, T., Kameda, T.: A theory of coteries: Mutual exclusion in distributed systems. IEEE Transactions on Parallel and Distributed Systems 4(7), 779–794 (1993)
142
Y. Lubowich and G. Taubenfeld
8. Lamport, L.: Time, clocks, and the order of events in a distributed system. Communications of the ACM 21(7), 558–565 (1978) 9. Lehman, P.L., Yao, S.B.: Efficient locking for concurrent operations on B-trees. ACM Transactions on Database √ Systems 6(4), 650–670 (1981) 10. Maekawa, M.: A N algorithm for mutual exclusion in decentralized systems. ACM Transactions on Computer Systems 3(2), 145–159 (1985) 11. Mishra, S., Srimani, P.K.: Fault-tolerant mutual exclusion algorithms. Journal of Systems and Software 11(2), 111–129 (1990) 12. Mizuno, M., Mesterenko, M., Kakugawa, H.: Lock-based self-stabilizing distributed mutual exclusion algorithm. In: Proc. 17th Inter. Conf. on Dist. Comp. Systems, pp. 708–716 (1996) 13. Naimi, M., Trehel, M.: An improvement of the log n distributed algorithm for mutual exclusion. In: Proc. 17th Inter. Conf. on Dist. Comp. Systems, pp. 371–375 (1987) 14. Neilsen, M.L., Mizuno, M.: A DAG-based algorithm for distributed mutual exclusion. In: Proc. 17th Inter. Conf. on Dist. Comp. Systems, pp. 354–360 (1991) 15. Neilsen, M.L., Mizuno, M.: Coterie join algorithm. IEEE Transactions on Parallel and Distributed Systems 3(5), 582–590 (1992) 16. Peleg, D.: Distributed data structures: A complexity-oriented structure. In: van Leeuwen, J., Santoro, N. (eds.) WDAG 1990. LNCS, vol. 486, pp. 71–89. Springer, Heidelberg (1991) 17. Rangarajan, S., Tripathi, S.K.: A robust distributed mutual exclusion algorithm. In: Toueg, S., Kirousis, L.M., Spirakis, P.G. (eds.) WDAG 1991. LNCS, vol. 579, pp. 295–308. Springer, Heidelberg (1992) 18. Raymond, K.: A tree-based algorithm for distributed mutual exclusion. ACM Transactions on Computer Systems 7(1), 61–77 (1989) 19. Raynal, M.: Algorithms for mutual exclusion. The MIT Press, Cambridge (1986); Translation of: Algorithmique du parall´elisme (1984) 20. Raynal, M.: Simple taxonomy for distributed mutual exclusion algorithms. Operating Systems Review (ACM) 25(2), 47–50 (1991) 21. Reddy, R.L.N., Gupta, B., Srimani, P.K.: New fault-tolerant distributed mutual exclusion algorithm. In: Proc. of the ACM/SIGAPP Symp. on Applied Computing, pp. 831–839 (1992) 22. Ricart, G., Agrawala, A.K.: An optimal algorithm for mutual exclusion in computer networks. CACM 24(1), 9–17 (1981); Corrigendum in CACM 24(9), 578 (1981) 23. Shou, D., Wang, S.D.: A new transformation method for nondominated coterie design. Information Sciences 74(3), 223–246 (1993) 24. Singhal, M.: A dynamic information-structure mutual exclusion algorithm for distributed systems. IEEE Transactions on Parallel and Distributed Systems 3(1), 121–125 (1992) 25. Singhal, M.: A taxonomy of distributed mutual exclusion. Journal of Parallel and Distributed Computing 18(1), 94–101 (1993) 26. Singhal, M., Shivaratri, N.G.: Advanced concepts in operating systems: distributed, database and multiprocessor operating systems. McGraw-Hill, Inc., New York (1994) 27. van de Snepscheut, J.L.A.: Fair mutual exclusion on a graph of processes. Distributed Computing 2, 113–115 (1987) 28. Suzuki, I., Kasami, T.: A distributed mutual exclusion algorithm. ACM Transactions on Computer Systems 3(4), 344–349 (1985) 29. Taubenfeld, G.: Synchronization Algorithms and Concurrent Programming, 423 pages. Pearson/Prentice-Hall (2006) ISBN 0-131-97259-6
Distributed Generalized Dynamic Barrier Synchronization Shivali Agarwal1, Saurabh Joshi2 , and Rudrapatna K. Shyamasundar3 1
IBM Research, India Indian Institute of Technology, Kanpur Tata Institute of Fundamental Research, Mumbai 2
3
Abstract. Barrier synchronization is widely used in shared-memory parallel programs to synchronize between phases of data-parallel algorithms. With proliferation of many-core processors, barrier synchronization has been adapted for higher level language abstractions in new languages such as X10 wherein the processes participating in barrier synchronization are not known a priori, and the processes in distinct “places” don’t share memory. Thus, the challenge here is to not only achieve barrier synchronization in a distributed setting without any centralized controller, but also to deal with dynamic nature of such a synchronization as processes are free to join and drop out at any synchronization phase. In this paper, we describe a solution for the generalized distributed barrier synchronization wherein processes can dynamically join or drop out of barrier synchronization; that is, participating processes are not known a priori. Using the policy of permitting a process to join only in the beginning of each phase, we arrive at a solution that ensures (i) Progress: a process executing phase k will enter phase k + 1 unless it wants to drop out of synchronization (assuming the phase execution of the processes terminate), and (ii) Starvation Freedom: a new process that wants to join a phase synchronization group that has already started, does so in a finite number of phases. The above protocol is further generalized to multiple groups of processes (possibly non-disjoint) engaged in barrier synchronization.
1
Introduction
Synchronization and coordination play an important role in parallel computation. Language constructs for efficient coordination of computation on shared memory multi-processors, and multi-core processors are of growing interest. There are a plethora of language constructs used for realizing mutual exclusion, point-to-point synchronization, termination detection, collective barrier synchronization etc. Barrier [8] is one of the important busy-wait primitives used to ensure that none of the processes proceed beyond a particular point in a computation until all have arrived at that point. A software implementation of the barrier using shared variables is also referred to as phase synchronization [1,7]. The issues of remote references while realizing barriers has been treated exhaustively in the seminal work [3]. Barrier synchronization protocols, either centralized and M.K. Aguilera et al. (Eds.): ICDCN 2011, LNCS 6522, pp. 143–154, 2011. c Springer-Verlag Berlin Heidelberg 2011
144
S. Agarwal, S. Joshi, and R.K. Shyamasundar
distributed, have been proposed earlier for the case when processes that have to synchronize are given a priori [7][5][13][6][14]. With the proliferation of many-core processors, barrier synchronization has been adapted for higher level language abstractions in new distributed shared memory based languages such as X10 [16] wherein the processes participating in barrier synchronization are not known a priori. Some of the recent works that address dynamic number of processes for barrier synchronization are [20][18][21]. More details on existing work on barrier synchronization can be found in section 6. Surprisingly, a distributed solution to the phase synchronization problem in such dynamic environments has not yet been proposed. In this paper, we describe a distributed solution to the problem of barrier synchronization used as an underlying synchronization mechanism for achieving phase synchronization where processes are dynamically created (in the context of nested parallelism). The challenge arises in arriving at the common knowledge of the processes that want to participate in phase synchronization for every phase in a de-centralized manner such that there are some guarantees on the progress and starvation freedom properties of the processes in addition to the basic correctness property. 1.1
Phase Synchronization Problem
The problem of phase synchronization [7] is described below: Consider a set of asynchronous processes where each process executes a sequence of phases; a process begins its next phase only upon completion of its previous phase (for the moment let us ignore the constitution of a phase). The problem is to design a synchronization scheme which guarantees the following properties: 1. No process begins it’s (k +1)th phase until all processes have completed their k th phase, k ≥ 0. 2. No process will be permanently blocked from executing it’s (k + 1)th phase if all processes have completed their k th phase, k ≥ 0. The set of processes that have to synchronize can be either given a priori which remains unchanged or the set can be a dynamic set in which new processes may join as and when they want to phase synchronize or existing processes may drop out of phase synchronization. In this paper, we describe a distributed solution for the dynamic barrier synchronization in the context of phase synchronization, wherein processes can dynamically join or drop out of phase synchronization. Using the policy of permitting a process to join in a phase subsequent to the phase of registration, we arrive at a solution that ensures (i) Progress: a process executing phase k will enter phase k +1 unless it wants to drop out of synchronization (assuming the phase execution of the processes terminate), and
Distributed Generalized Dynamic Barrier Synchronization
145
(ii) Starvation Freedom: a new process that wants to join a phase synchronization group that has already started, does so in a finite number of phases. Our protocol establishes a bound of at most two phases from the phase it registered it’s intention to join1 the phase synchronization. The lower bound is one phase. The correctness of the solution is formally established. The dynamic barrier synchronization algorithm is further generalized to cater to groups of barrier synchronization processes.
2
Barrier Synchronization with Dynamic Set of Processes
We consider a distributed system which gets initialized with a non-empty set of processes. New processes can join the system at will and existing processes may drop out of the system when they are done with their work. They carry out individual computations in phases and synchronize with each other at the end of each phase. Since it is a distributed system with no centralized control and no a priori knowledge of number of processes in the system, each process has to dynamically discover the new processes that have joined the system in such a manner that the new process can start synchronizing with them in finite amount of time. The distributed barrier synchronization protocol described below deals with this issue of including new processes in the ongoing phase synchronization in a manner that progress of existing as well as newly joined processes is ensured. It also handles the processes that drop out of the system so that existing processes know that they do not have to wait on these for commencing the next phase. Note that there is no a priori limit on the number of processes. The abstract linguistic constructs for registration and synchronization of processes is described in the following. 2.1
Abstract Language
We base our abstract language for the barrier synchronization protocol on X10. The relevant syntax is shown in figure 1 and explained below: < P rogram > < P roc > < clockDec > < stmtseq > < basic-stmt > clock-id
::=< async P roc > || < async P roc > ::=< clockDec >; < stmtseq > | clocked < clock-id >< stmtseq > ::= new clock c1, c2.. ::=< basic-stmt > | < basic-stmt >< stmtseq > ::= async P roc|atomic stmt|seq stmt|c.register|c.drop|next ::= c1, c2, ...
Fig. 1. Abstract Clock Language for Barrier Synchronization 1
Starvation Freedom is guaranteed only for processes that are registered.
146
S. Agarwal, S. Joshi, and R.K. Shyamasundar
• Asynchronous activities: The keyword to denote asynchronous processes is async. The async is used with an optional place expression and a mandatory code block. A process that creates another process is said to be the parent of the process it creates. • Clock synchronization: Special variables of type clock are used for barrier synchronization of processes. A clock corresponds to the notion of a barrier. A set of processes registered with a clock synchronize with each other w.r.t. that clock. A barrier synchronization point in a process is denoted by next. If a process is registered on multiple clocks, then next denotes synchronization on all of those. This makes the barrier synchronization deadlock-free. The abstraction of phase synchronization through clocks enables to form groups of processes such that groups can merge or disjoin dynamically for synchronization. Some important points regarding dynamic joining rule for phase synchronization are: – A process registered on clock c, can create a child process synchronizing on c via async clocked c {body}. The child process joins the phase synchronization in phase k + 1, if the parent is in phase k while executing async. – A process can register with a clock c using c.register. It will join the phase synchronization from phase k + 1 or k + 2 if the clock is in phase k at the time of registration. Some important points regarding dropping out of phase synchronization are: - A process that drops in phase k is dropped out in the same phase and is not allowed to create child processes that want to join phase synchronization in that phase. Note that this does not restrict the expressiveness of the language in any way and ensures a clean way of dropping out. - A process that registers after dropping loses the information of parent and is treated as a process whose parent is not clocked on c. - An implicit c.drop is assumed when a process registered on clock c terminates. We now provide a solution in the form of a protocol for distributed dynamic barrier synchronization problem that provably obeys the above mentioned dynamic joining rules.
3
Distributed Barrier Synchronization Solution
The distributed barrier synchronization protocol for a single clock is given in figure 2. The figures respectively describe the protocol for barrier operations like initialization, synchronization and drop. The notations used in the solution are described below. Notation: - We denote ith process by Ai (also referred as process i). - Phases are tracked in terms of k − 1, k, k + 1, · · · . - We use the guarded command notation [2] for describing our algorithm as it is easy to capture interleaving execution and termination in a structured manner.
Distributed Generalized Dynamic Barrier Synchronization
147
Assumption: The processes do not have random failures and will always call c.drop if they want to leave the phase synchronization. 3.1
Correspondance between Protocol Steps and Clock Operations
The correspondence of the clock operations with the protocol operations is given below: • new clock c: Creation of a clock(barrier) corresponds to creation of a special process Ac that executes as follows where the code blocks INIT c, SYNC and CONSOLIDATE are shown in figure 2. INIT_c; while (true) { SYNC; CONSOLIDATE;}
Note on Ac : It is a special process that exists till the program terminates. It acts like a point of contact for processes in case of explicit registration through c.register as seen below without introducing any centralized control. • next: A process Ai already in phase synchronization performs a next for barrier synchronization. A next corresponds to: SYNC;
CONSOLIDATE;
• Registration through clocked: A process Ai can either register through clocked at time of creation in which case it gets into the list Aj .registered of its parent process, Aj . In this case, Ai joins from the next phase. The specific code that gets executed in the parent for a clocked process is: INIT_i; A_j.registered:=A_j.registered+A_i;
The code that gets executed in Ai is: while (!A_i.proceed);
• Registration through c.register: If Ai registers like this, then it may join phase synchronization within atmost the next two phases. Following code gets executed in Ai : INIT_i; A_c.registered:=A_c.registered+A_i; while (!A_i.proceed);
• c.drop : Process Ai drops out of phase synchronization through c.drop (see fig. 2). The code that gets executed is: DROP;
Note: 1) Though we assume Ac to exist throughout the program execution, our algorithm is robust with respect to graceful termination of Ac , that is, it terminates after completing CONSOLIDATE and there are no processes in Ac .registered upon consolidation. The only impact on phase synchronization being that no new processes can register through c.register. 2) The assignments are done atomically.
148
S. Agarwal, S. Joshi, and R.K. Shyamasundar
3.2 How the Protocol Works The solution achieves phase synchronization by ensuring that the set of processes that enter a phase is a common knowledge to all the processes. Attaining common knowledge of the existence of new processes and the non-existence of dropped processes in every phase is the non-trivial part of the phase synchronization protocol in a dynamic environment. The machinery built to solve this problem is shown in figures 2 and described below in detail. Protocol Variables: Ac – The special clock process for clock c Ai .executing – The current phase the process i is executing Ai .proceed – used for allowing new processes to join the active phase; Ai .next – the next phase that process i wants to execute Ai .Iconcurrent – set of processes executing the phase Ai .executing; Ai .newIconcurrent – the set of new processes that will be part of the next phase. Ai .registered – the set of new processes that want to enter phase synchronization with Ai . Ai .newsynchproc – set of new processes registered with process i that will synhronize from the next phase. Ai .drop – when a process wants to drop (or terminates), it sets Ai .drop to true and exits. Ai .checklist – the subset of Ai .Iconcurrent carried to the next phase for synchronization. Ai .Iconcurrent denotes the set of processes that Ai is synchronizing with in a phase.
This set may shrink or expand after each phase depending on if an existing process drops or a new process joins respectively. The variable Ai .newsynchproc is assigned to the set that process Ai wants the other processes to include for synchronization from the next phase onwards. The variable Ai .newIconcurrent is used to accumulate the processes that will form Ai .Iconcurrent in the next phase. Ai .executing denotes the current phase of Ai and Ai .next denotes the next phase that the process will move to. INIT c: This initializes the special clock process Ac that is started at the creation of a clock c. Note that the clock process is initialized to itself for the set of initial processes that it has to synchronize with. Ac .proceed is set to true to start the synchronization. INIT i: When a process registers with a clock, Ai .proceed is set to f alse. The newly registered process waits for Ai .proceed to be made true which is done in CONSOLIDATE block of the process that contains Ai in its registered set. Rest of the variables are also set properly in this CONSOLIDATE block. In the following, we explain the protocol for SYNC and CONSOLIDATE. SYNC: This is the barrier synchronization stage of a process and performs the following main functions: 1) checks if all the processes in the phase are ready to move to the next phase; Ai .next is used to denote the completion of phase and check for others in the phase, 2) informs the other processes about the new processes that have to join from the next phase, 3) establishes if the processes have exchanged the relevant information so that it can consolidate the information required for the execution of the next phase.
Distributed Generalized Dynamic Barrier Synchronization
149
The new processes that are registered with Ai form the set Ai .newsynchproc. This step is required to capture the local snapshot. Note that for processes other than clock process, Ai .registered will be same as Ai .newsynhproc. However for the special clock process Ac , Ac .registered may keep on changing during the SYNC execution. Therefore, we need to take a snapshot so that consistent set of processes that have to be included from the next phase can be conveyed to other processes that are present in the synhcronization. The increment of Ai .next denotes that effectively the process has completed the phase and is preparing to move to the next phase. Note that after this operation the difference between Ai .next and Ai .executing becomes 2 denoting the transition. The second part of SYNC is a do-od loop that forms the crux of barrier synchronization. There are three guarded commands in this loop which are explained below. 1. The first guarded command checks if there exists a process j in Ai .Iconcurrent that has also reached barrier synchronization. If the value of Aj .next is greater or equal to Ai .next, then it implies that Aj has also reached the barrier point. If this guard is evaluated true, then that process is removed from Ai .Iconcurrent and the new processes that registered with Aj are added to the set Ai .newIconcurrent. 2. The second guard checks if any process in Ai .Iconcurrent has dropped out of synchronization and accordingly the set Ai .newIconcurrent is updated. 3. The third guard is true if the process j has not yet reached the barrier synchronization point. The associated statement with this guard is a no-op. It is this statement which forms the waiting part for barrier synchronization. By the end of this loop, Ai .Iconcurrent shall only contain Ai . The current phase denoted by Ai .executing is incremented to denote that process can start with the next phase. However, to ensure that the local snapshot captured in Ai .newsynchproc is properly conveyed to the other processes participating in phase synchronization, another do-od loop is executed that checks if processes have indeed moved to the next phase by incrementing Ai .executing. CONSOLIDATE: After ensuring that Ai has synchronized on the barrier, a final round of consolidation is performed to prepare Ai for executing in the next phase. This phase consolidation is described under label CONSOLIDATE. The set of processes that Ai needs to phase synchronize are in Ai .newIconcurrent, therefore, Ai .Iconcurrent is assigned to Ai .newIconcurrent. All the new processes that will join from Ai .executing are signalled to proceed after initializing them properly. The set Ai .registered is updated to ensure that it has only those new processes that got registered after the value of Ai .registered was last read in SYNC. This is possible because of the explicit registration that is allowed through the special clock process. DROP: Ai .drop is set to true so that the guarded command in SYNC can become true appropriately. The restriction posed on a drop command ensures that Ai .registered will be empty and thus the starvation freedom guarantee is preserved.
150
S. Agarwal, S. Joshi, and R.K. Shyamasundar
Protocol for Process i INIT c: (* Initialization of clock process *) Ac .executing, Ac .next, Ac .Iconcurrent, Ac .registered, Ac .proceed, Ac .drop := 0, 1, Ac , ∅, true, f alse; INIT i: (* Initialization of Ai that performs a registration *) Ai .proceed := f alse; SYNC : (*CHECK COMPLETION of Ai .executing by other members*) Ai .newsynchproc := Ai .registered; Ai .newIconcurrent := Ai .newsynchproc + Ai ; Ai .next := Ai .next + 1; Ai .checklist := ∅; do Ai .Iconcurrent = ∅ ∧ Aj ∈ Ai .Iconcurrent∧ i = j ∧ Ai .next ≤ Aj .next → Ai .Iconcurrent := Ai .Iconcurrent − {Aj }; Ai .newIconcurrent := Ai .newIconcurrent+Aj .newsynchproc +; {Aj }; Ai .checklist := Ai .checklist + Aj [] Ai .Iconcurrent = ∅ ∧ Aj .drop → Ai .Iconcurrent := Ai .Iconcurrent − {Aj }; [] Ai .Iconcurrent = ∅ ∧ Aj ∈ Ai .Iconcurrent ∧ Ai .next > Aj .next (* no need to check i = j *) → skip; od; Ai .executing := Ai .executing + 1; (* Set the current phase *) do (* Check for completion of phase in other processes *) Ai .checklist = ∅ ∧ j ∈ Ai .checklist ∧ Ai .executing == Aj .executing → Ai .checklist := Ai .checklist − {Aj } od; CONSOLIDATE: (* CONSOLIDATE processes for the next phase *) Ai .Iconcurrent := Ai .newIconcurrent; Ai .registered := Ai .registered − Ai .newsynchproc; for all j ∈ Ai .newsynchproc do Aj .executing, Aj .next, Aj .Iconcurrent, Aj .registered, Aj .drop := Ai .executing, Ai .next, Ai .Iconcurrent, ∅, f alse; Aj .proceed := true; DROP : (* Code when Process Ai calls c.drop*) Ai .proceed := f alse; Ai .drop := true;
Fig. 2. Action of processes in phase synchronization
4
Correctness of the Solution
The proof obligations for synchronization and progress are given below. We have provided the proof in semi-formal way in the style of [1] which is available in the longer version of the paper [22]. The symbol ‘→’ denotes leads to. – Synchronization We need to show that the postcondition of SYNC;CONSOLIDATE; (corresponding to barrier synchronization) for processes that have proceed set to true is : {∀i, j((Ai .proceed = true ∧ Aj .proceed = true) ⇒ Ai .executing = Aj .executing)}
Distributed Generalized Dynamic Barrier Synchronization
151
– Progress
Property 1: The progress for processes already in phase synchronization is given by (k is used to denote current phase) the following property which says that if all the processes have completed phase k , then each of the processes move to a phase greater than k if they do not drop out. P1: {∀i(Ai .drop = f alse ∧ ∀j(Aj .drop = f alse ⇒ Aj .executing ≥ k) → (Ai .executing ≥ k + 1))}
Property 2: The progress for new processes that want to join the phase synchronization is given by the following property which says that a process that gets registered with a process involved in phase synchronization will also join the phase synchronization. P2: {∃i((Ai .proceed = f alse ∧ ∃j(i ∈ Aj .registered)) → Ai .proceed = true)} Complexity Analysis: The protocol in it’s simplest form has a remote message complexity of O(n2 ) where n is the upper bound on the number of processes that can participate in the barrier synchronization in any phase. This bound can be improved in practice by optimizing the testing of the completion of a phase by each of the participating processes. The optimization is briefly explained in the following. When a process Ai checks for completion of phase in other process, say Aj , and it finds that Ai .executing < Aj .executing, then it can actually come out of the do-od loop by copying Aj .newIconcurrent that has the complete information about the processes participating in the next phase. This optimization can have the best case of O(n) messages and is very straightforward to embed in the proposed protocol. Note that in any case, atleast n messages are always required to propagate the information.
5
Generalization: Multi-clock Phase Synchronization
In this section, we outline the generalization of the distributed dynamic barrier synchronization for multiple clocks. 1) There is a special clock process for each clock. 2) The processes maintain protocol variables for each of the clocks that they register with. 3) A process can register with multiple clocks through C.register, where C denotes a set of clocks, as the first operation before starting with phase synchronized computation. The notation C.register denotes that for all clocks c such that c∈C, perform c.register. The corresponding code is: for each c in C INIT_i_c; A_c.registered:=A_c.registered+A_i; while (!A_i_c.proceed);
Some important restrictions to avoid deadlock scenarios are: i) C.register, when C contains more than one clock, can only be done by the process that creates the clocks contained in C.
152
S. Agarwal, S. Joshi, and R.K. Shyamasundar
ii) If a process wants to register with a single clock c that is in use for phase synchronization by other processes, it will have to drop all it’s clocks and then it can use c.register to synchronize on the desired clock. Note that the clock c need not be re-created. iii) Subsequent child processes should use clocked to register with any subset of the multiple clocks that the parent is registered with. This combined with (iv) below avoids the deadlock scenarios of the likes of mobile barriers [20]. iv) For synchronization, the process increments the value of Ai .next for each registered clock, then executes the guarded loop for each of the clocks before it can move to CONSOLIDATE stage. The SYNC part of the protocol for multi-clock is very similar to single-clock except for an extra loop to run the guarded command loop for each of the clocks. A process clocked on multiple clocks results in synchronization of all the processes that are registered with these clocks. This is evident from the second do-od loop in SYNC part of the barrier synchronization protocol. For example, if a process A1 is synchronizing on c1, A2 on c1 and c2 and A3 on c2, then A1 and A3 also get synchronized as long as A2 does not drop one or both of the clocks. These clocks can be thus thought of as forming a group and A2 can be thought of as pivot process. In the following, we state the guarantees of synchronization provided by the protocol: 1) A process that synchronizes on multiple clocks can move to the next phase only when all the processes in the group formed by the clocks have also completed their current phase. 2) Two clock groups that do not have a common pivot process but have a common clock may differ in phase by atmost one. Note that it cannot exceed one because that would imply improper synchronization between processes clocked on same clock which as has been proved above to be impossible in our protocol. 3) A new process registered with multiple clocks starts in the next phase (from the perspective of a local observer) w.r.t. each of the clocks individually.
6
Comparing with Other Dynamic Barrier Schemes
The clock syntax resembles that of X10 but differs in the joining policy. An X10 activity that registers with a clock in some phase starts the synchronization from the same phase. The advantage of our dynamic joining policy (starting from the next phase) is that when a process starts a phase, it exactly knows the processes that it is synchronizing with in the phase. This makes it simpler to detect the completion of a phase in a distributed set-up. Whether to join in same phase or the next phase is more a matter of semantics rather than expressiveness. If there is a centralized manager process to manage phase synchronization, then the semantics of starting a newly registsered activity in same phase is feasible. However, for a distributed phase synchronization protocol with dynamically joining processes, the semantics of starting from the next phase is more efficient.
Distributed Generalized Dynamic Barrier Synchronization
153
The other clock related works [19], [18] are directed more towards efficient implementations of X10 like clocks rather than dealing with synchronization in a distributed setting. Barriers in JCSP [21] and occam-pi [20] do allow process to dynamically join and resign from barrier synchronization. Because the synchronization is barrier specific in both JCSP ( using barrier.sync() ) and occam-pi ( using SYNC barrier, it is a burden on the programmer to write a deadlock free program which is not the case here, as the use of next achieves synchronization over all registered clocks. JCSP and occam-pi barriers achieve linear time synchronization due to centralized control of barriers which is also possible in the optimized version of our protocol. Previous work on barrier implementation has focussed on algorithms that work on pre-specified number of processes or processors. The Butterfly barrier algorithm [9], Dissemination algorithm [10][5], Tournament algorithm [5], [4] are some of the earlier algorithms. Most of them emphasized on how to reduce the number of messages that need to be exchanged in order to know that all the processes have reached the barrier. Some of the more recent works on barrier algorithms in software are described in [6][11][12][15][14],[3]. As contrasted to the literature, our focus has been on developing algorithms for barrier synchronization where processes dynamically join and drop out; thus, processes that can be in a barrier synchronization need not be known a priori.
7
Conclusions
In this paper, we have described a solution for distributed dynamic phase synchronization that is shown to satisfy properties of progress and starvation freedom. To our knowledge, this is the first dynamic distributed multi-processor synchronization algorithm wherein we have the established properties of progress, starvation freedom and shown the dependence of the progress on the entry strategies (captured through process registration). A future direction is to consider fault tolerance in the context of distributed barrier synchronization for dynamic number of processes.
References 1. Chandy, K.M., Misra, J.: Parallel program Design: A Foundation. Addison-Wesley, Reading (1988) 2. Dijkstra, E.W.: Guarded commands, non-determinacy and formal derivation of programs. Communications of the ACM 18(8) (August 1975) 3. Mellor-Crummey, J.M., Scott, M.L.: Algorithms for scalable synchronization on shared -memory multiprocessors. ACM TOCS 9(1), 21–65 (1991) 4. Feldmann, A., Gross, T., O’Hallaron, D., Stricker, T.M.: Subset barrier synchronization on a private-memory parallel system. In: SPAA (1992) 5. Hensgen, D., Finkel, R., Manbet, U.: Two algorithms for barrier synchronization. International Journal of Parallel Programming 17(1), 1–17 (1988) 6. Herlihy, M., Shavit, N.: The Art of Multiprocessor Programming. Morgan Kaufmann, San Francisco (2008)
154
S. Agarwal, S. Joshi, and R.K. Shyamasundar
7. Misra, J.: Phase Synchronization, Notes on Unity, 12-90, U. Texas at Austin (1990) 8. Tang, P., Yew, P.C.: Processor Self scheduling for multiple-nested parallel loops. In: Proc. ICPP, pp. 528–535 (Augus 1986) 9. Brooks III, E.D.: The butterfly barrier. International Journal of Parallel Programming 15(4) (1986) 10. Han, Y., Finkel, R.: An Optimal Scheme for Disseminating Information. In: Proc. of 17th International Conference on Parallel Processing (1988) 11. Scott, M.L., Michael, M.M.: The Topological Barrier: A Synchronization Abstraction for Regularly-Structured Parallel Applications Tech. Report TR605, Univ. of Rochester (1996) 12. Gupta, R., Hill, C.R.: A scalable implementation of barrier synchronization using an adaptive combining tree. International Journal of Parallel Programming 18(3) (1990) 13. Livesey, M.: A Network Model of Barrier Synchronization Algorithms. International Journal of Parallel Programming 20(1) (February 1991) 14. Xu, H., McKinley, P.K., Ni, L.M.: Efficient implementation of barrier synchronization in wormhole-routed hypercube multicomputers. In: Proceedings of the 12th International Conference on, vol. 9(12) (June 1992) 15. Yang, J.-S., King, C.-T.: Designing Tree-Based Barrier Synchronization on 2D Mesh Networks. IEEE Transactions on Parallel and Distributed Systems 9(6) (1998) 16. Saraswat, V., Jagadeesan, R.: Concurrent clustered programming. In: Jayaraman, K., de Alfaro, L. (eds.) CONCUR 2005. LNCS, vol. 3653, pp. 353–367. Springer, Heidelberg (2005) 17. Unified Parallel C Language,http://www.gwu.edu/~ upc/ 18. Shirako, J., Peixotto, M.D., Sarkar, V., Scherer, W.N.: Phasers: a unified deadlockfree construct for collective and point-to-point synchronization. In: ICS, pp. 277– 288 (2008) 19. Vasudevan, N., Tardieu, O., Dolby, J., Edwards, S.A.: Compile-Time Analysis and Specialization of Clocks in Concurrent Programs. In: CC, pp. 48–62 (2009) 20. Welsch, P., Barnes, F.: Mobile Barriers for occam-pi: Semantics, Implementation and Application. Communicating Process Architecture (2005) 21. Welsch, P., Brown, N., Moores, J., Chalmers, K., Sputh, B.H.C.: Integrating and Extending JCSP. In: CPA, pp. 349–370 (2007) 22. Agarwal, S., Joshi, S., Shyamasundar, R.K.: Distributed Generalized Dynamic Barrier Synchronization, Longer Version, http://www.tcs.tifr.res.in/~ shyam/Papers/dynamicbarrier.pdf
A High-Level Framework for Distributed Processing of Large-Scale Graphs Elzbieta Krepska, Thilo Kielmann, Wan Fokkink, and Henri Bal VU University Amsterdam {ekr,kielmann,wanf,bal}@cs.vu.nl
Abstract. Distributed processing of real-world graphs is challenging due to their size and the inherent irregular structure of graph computations. We present H IP G, a distributed framework that facilitates high-level programming of parallel graph algorithms by expressing them as a hierarchy of distributed computations executed independently and managed by the user. H IP G programs are in general short and elegant; they achieve good portability, memory utilization and performance.
1 Introduction We live in a world of graphs. Some graphs exist physically, for example transportation networks or power grids. Many exist solely in electronic form, for instance a state space of a computer program, the network of Wikipedia entries, or social networks. Graphs such as protein interaction networks in bioinformatics or airplane triangulations in engineering are created by scientists to represent real-world objects and phenomena. With the increasing abundance of large graphs, there is a need for a parallel graph processing language that is easy to use, high-level, and memory- and computation-efficient. Real-world graphs reach billions of nodes and keep growing: the World Wide Web expands, new proteins are being discovered, and more complex programs need to be verified. Consequently, graphs need to be partitioned between memories of multiple machines and processed in parallel in such a distributed environment. Real-world graphs tend to be sparse, as, for instance, the number of links in a web page is small compared to the size of the network. This allows for efficient storage of edges with the source nodes. Because of their size, partitioning graphs into balanced fragments with small a number of edges spanning different fragments is hard [1, 2]. Parallelizing graph algorithms is challenging. The computation is typically driven by a node-edge relation in an unstructured graph. Although the degree of parallelism is often considerable, the amount of computation per graph’s node is generally very small, and the communication overhead immense, especially when many edges spawn different graph chunks. Given the lack of structure of the computation, the computation is hard to partition and locality is affected [3]. In addition, on a distributed memory machine good load balancing is hard to obtain, because in general work cannot be migrated (part of the graph would have to be migrated and all workers informed). While for sequential graph algorithms a few graph libraries exist, notably the Boost Graph Library [4], for parallel graph algorithms no standards have been established. The current state-of-the-art amongst users wanting to implement parallel graph algorithms M.K. Aguilera et al. (Eds.): ICDCN 2011, LNCS 6522, pp. 155–166, 2011. c Springer-Verlag Berlin Heidelberg 2011
156
E. Krepska et al.
is to either use the generic C++ Parallel Graph Boost Library (PBGL) [5, 6] or, most often, create ad-hoc implementations, which are usually structured around their communication scheme. Not only does the ad-hoc coding effort have to be repeated for each new algorithm, but it also results in obscuring the original elegant concept. The programmer spends considerable time tuning the communication, which is prone to errors. While it may result in a highly-optimized problem-tailored implementation, the code can only be maintained or modified with substantial effort. In this paper we propose H IP G, a distributed framework aimed at facilitating implementations of HIerarchical Parallel Graph algorithms that operate on large-scale graphs. H IP G offers an interface to perform structure-driven distributed graph computations. Distributed computations are organized into a hierarchy and coordinated by logical objects called synchronizers. The H IP G model supports, but is not limited to, creating divide-and-conquer graph algorithms. A H IP G parallel program is composed automatically from the sequential-like components provided by the user. The computational model of H IP G, and how it can be used to program graph algorithms, is explained in Section 2, where we present three graph algorithms in increasing order of complexity: reachability search, finding single-source shortest paths and strongly connected components decomposition. These are well-known algorithms explained for example in [7]. Although the user must be aware that a H IP G program runs in a distributed environment, the code is high-level: explicit communication is not exposed by the API. Parallel composition is done in a way that does not allow race conditions, so that no locks or thread synchronization code are necessary from the user’s point of view. These facts, coupled with the use of an object-oriented language, makes for an easy-to-use, but expressive, language to code hierarchical parallel graph algorithms. We have implemented H IP G in the Java language. We discuss this choice as well as details of the implementation in Section 3. Using H IP G we have implemented algorithms presented in Section 2 and we evaluate their performance in Section 4. We processed graphs of size of the order of 109 of nodes on our cluster and obtained good performance. The H IP G code of the most complex example discussed in this paper, the strongly connected components decomposition, is an order of magnitude shorter than the hand-written C/MPI version of this program and three times shorter than the corresponding implementation in PBGL—See Section 5 for a discussion of the related work in the field of distributed graph processing. H IP G’s current limitations and future work are discussed in the concluding Section 6.
2 The H IP G Model and Programming Interface The input to a H IP G program is a directed graph. H IP G partitions the graph in a number of equal-size chunks and divides chunks between workers that are made responsible for processing nodes they own. A chunk consists of a number of nodes uniquely identified by pairs (chunk, index). H IP G uses the object-oriented paradigm of programming— namely, nodes are objects. Each node has arbitrary data and a number of outgoing edges associated and co-located with it. The target node of an edge is called a neighbor. In the current setup, the graph cannot be modified at runtime, but new graphs can be created.
A High-Level Framework for Distributed Processing of Large-Scale Graphs interface MyNode extends Node { public void visit(); } class MyLocalNode implements MyNode extends LocalNode<MyNode> { boolean visited = false; public void visit() { if (!visited) { visited = true; for (MyNode n : neighbors()) n.visit(); } } }
Fig. 1. Reachability search in H IP G
157
p s visit()
q
visit()
visit()
visit()
Node s receives visit() and sends it to its neighbors.
t
visit() visit()
visit()
visit()
r
Neighbors forward the message to their neighbors. Fig. 2. Illustration of the reachability search
Graphs are commonly processed starting at a certain graph node and by following the structure of the graph, i.e. the node-edge relationship, until all reached nodes are processed. H IP G allows to process graphs this way by offering a seamless interface to execute methods on local and remote nodes. When necessary these method calls are automatically translated by H IP G into messages. In Section 2.1 we show how the methods can be used to create a distributed graph computation in H IP G. More complex algorithms require managing more than one such distributed computations. In particular, the objective of a divide-and-conquer graph algorithm is to divide computation on a graph into several sub-computations on sub-graphs. H IP G enables creation of sub-algorithms by introducing synchronizers—logical objects that manage distributed computations. The concept and API of a synchronizer are explained further in this section: in Section 2.2 we show how to use a single synchronizer, and in Section 2.3 an entire hierarchy of synchronizers is created to solve a divide-and-conquer graph problem. 2.1 Distributed Computation H IP G allows to implement graph computations with only regular methods executed on graph nodes. Typically, the user initializes the first method, which in turn executes methods on its neighbor nodes. In general, a node can execute methods on any node of which the unique identifier is known. To implement a graph computation, the user extends the provided LocalNode class with custom fields and methods. In a local node, neighbor nodes can be accessed with neighbors(), or inNeighbors() for incoming edges. Under the hood, the methods executing on remote nodes are automatically translated by H IP G into asynchronous messages. On reception of such a message, an appropriate method is executed, which thus acts as a message handler. The order of received messages cannot be predicted. Method parameters are automatically serialized, and we strive to make the serialization efficient. Distributed computation terminates when there are no more messages present in the system, which is detected automatically. Since
158
E. Krepska et al.
interface MyNode extends Node { public void found(SSSP sp, int d); } class MyLocalNode extends LocalNode<MyNode> implements MyNode { int dist = −1; public void found(SSSP sp, int d) { if (dist < 0) { dist = d; sp.Q.add(this); } } public void found0(SSSP sp, int d) { for (MyNode n : neighbors()) n.found(sp, d); } }
class SSSP extends Synchronizer { Queue<MyLocalNode> Q = new Queue(); int localQsize; public SSSP(MyLocalNode pivot) { if (pivot != null) Q.add(pivot); localQsize = Q.size(); } @Reduce public int GlobalQSize(int s) { return s + Q.size(); } public void run() { int depth = 0; do { for (int i = 0; i < localQsize; i++) Q.pop().found0(this, depth); barrier(); depth++; localQsize = Q.size(); } while (GlobalQSize(0) > 0); } }
Fig. 3. Single-source shortest paths (breadth-first search) implemented in H IP G
messages are asynchronous, returning a value of a method can be realized by sending a message back to the source. Typically, however, a dedicated mechanism, discussed later in this section, is used to compute the result of a distributed computation. Example: Reachability search. In a directed graph, a node s is reachable from node t if a path from t to s exists. Reachability search computes the set of nodes reachable from a given pivot. A reachability search implemented in H IP G (Fig. 1) consists of an interface MyNode that represents any node and a local node implementation MyLocalNode. The visit() method visits a node and its neighbors (Fig. 2). The algorithm is initiated by pivot.visit(). We note that, if it was not for the unpredictable order of method executions, the code for visit() could be understood sequentially. In particular, no locks or synchronization code were needed. 2.2 Coordination of Distributed Computations A dedicated layer of a H IP G algorithm coordinates the distributed computations. Its main building block is a synchronizer, which is a logical object that manages distributed computations. A synchronizer can initiate a distributed computation and wait for its termination. After a distributed computation has terminated, the synchronizer typically computes global results of the computation by invoking a global reduction operation. For example, the synchronizer may compute the global number of nodes reached by the computation, or a globally elected pivot. Synchronizers can execute distributed computations in parallel or one after another.
A High-Level Framework for Distributed Processing of Large-Scale Graphs F B(V ) : p = pick a pivot f rom V F = F W D(p) B = BW D(p) Report (F ∩ B) asSCC In parallel : F B(F \ B) F B(B \ F ) F B(V \ (F ∪ B))
159
V
F
B p
Fig. 4. FB: a divide-and-conquer algorithm to search for SCCs
To implement a synchronizer, the user subclasses Synchronizer and defines a run() method that, conceptually, will execute sequentially on all processors. Termination detection is provided by barrier(). The reduce methods, annotated @Reduce, must be commutative, as the order, in which they are executed, cannot be predicted. Example: Single-source shortest paths. Fig. 3 shows an implementation of a parallel single-source shortest paths algorithm. For simplicity, each edge has equal weight, so that the algorithm is in fact a breadth-first search [7]. We define an SSSP synchronizer, which owns a queue Q that represents the current layer of graph nodes. The run() method loops over all nodes in the current layer to create the next layer. The barrier blocks until the current layer is entirely processed. GlobalQSize computes the global size of Q by summing the sizes of queues Q on all processors. The algorithm terminates when all layers have been processed. 2.3 Hierarchical Coordination of Distributed Computations The key idea of the H IP G coordination layer is that synchronizers can spawn any number of sub-synchronizers to solve graph sub-problems. Therefore, the coordination layer is, in fact, a tree of executing synchronizers, and thus a hierarchy of distributed algorithms. All synchronizers execute independently and in parallel. The order, in which synchronizers progress cannot be predicted, unless they are causally related or explicitly synchronized. The user starts a graph algorithm by spawning root synchronizers. The system terminates when all synchronizers terminate. The H IP G parallel program is composed automatically from the two components provided by the user, namely node methods (message handlers) and the synchronizer code (coordination layer). Parallel composition is done in a way which does not allow race conditions. No explicit communication or thread synchronization is needed. Example: Strongly connected components. A strongly connected component (SCC) of a directed graph is a maximal set of nodes S such that there exists a path in S between any pair of nodes in S. In Fig. 4 we describe FB [8], a divide-and-conquer graph algorithm for computing SCCs. FB partitions the problem of finding SCCs of a set of nodes V into three sub-problems on three disjoint subsets of V . First an arbitrary pivot node is selected from V . Two sets F and B are computed as the sets of nodes that are, respectively, forward and backward reachable from the pivot. The set F ∩B is an SCC.
160
E. Krepska et al.
interface MyNode extends Node { public void fwd(FB fb, int f, int b); public void bwd(FB fb, int f, int b); } class MyLocalNode implements MyNode extends LocalNode<MyNode> { int labelF = −1, labelB = −1; public void fwd(FB fb, int f, int b) { if (labelF == fb.ff && (labelB == b||labelB == fb.bb)){ labelF = f; fb.F.add(this); for (MyNode n : neighbors()) n.fwd(); } } public void bwd(FB fb, int f, int b) { if (labelB == fb.bb && (labelF == f||labelF == fb.ff)){ labelB = b; fb.B.add(this); for (MyNode n : inNeighbors()) n.bwd(); } } }
class FB extends Synchronizer { Queue<MyLocalNode> V, F, B; int ff, bb; FB(int f, int b, Queue<MyLocalNode> V0) { V = V0; F,B = new Queue(); ff = f; bb = b; } @Reduce MyNode SelectPivot(MyNode p) { return (p==null && !V.isEmpty())? V.pop():null; } public void run() { MyNode pivot = SelectPivot(null); if (pivot == null) return; int f = 2∗getId(), b = f+1; if (pivot.isLocal()) { pivot.fwd(this, f, b); pivot.bwd(this, f, b); } barrier(); spawn(f, bb, new FB(F.filterB(b)); spawn(ff, b, new FB(B.filterF(f)); spawn(f, b, new FB(V.filterFuB(f, b)); } }
Fig. 5. Implementation of the FB algorithm in H IP G
All SCCs remaining in V must be entirely contained either within F \B or within B\F or within the complement set V \(F ∪B). The H IP G implementation of the FB algorithm is displayed in Fig. 5. The FB creates subsets F and B of V by executing forward and backward reachability searches from a global pivot. Each set is labeled with a unique pair of integers (f,b). FB spawns three sub-synchronizers to solve sub-problems on F \B, B \F and V \(F ∪B). We note that the algorithm in Fig. 5 reflects the original elegant algorithm in Fig. 4. The entire H IP G program is 113 lines of code, while a corresponding C/MPI application (see Section 4) has over 1700 lines, and the PBGL implementation has 341 lines.
3 Implementation H IP G is designed to execute in a distributed-memory environment. We chose to implement it in Java because of the portability and performance (due to just-in-time compilation) as well as excellent software support of the language, although Java required us to carefully ensure that the memory is utilized efficiently. We used the Ibis [9] messagepassing communication library and the Java 6 virtual machine implemented by Sun [10]. Partitioning an input graph into equal-size chunks means that each chunk contains similar number of nodes and edges (currently, minimization of the number of edges spawning different chunks is not taken into account). Each worker stores one chunk in the form of an array of nodes. Outgoing edges are not stored within the node object. This would be impractical due to memory overhead (in 64-bit HotSpot this overhead is 16 B per object). As a compromise, nodes are objects but edges are not—rather, they are all stored in a single large integer array. We note that, although this structure is not elegant, it is transparent to the user, unless explicitly requested, e.g. when the program needs to be highly optimized. In addition, as most of the worker’s memory is used to store the graph, we tuned the garbage collector to use a relatively small young generation size (5–10% of the heap size).
A High-Level Framework for Distributed Processing of Large-Scale Graphs
161
After reading the graph, a H IP G program typically initiates root synchronizers, waits for completion, and handles the computed results. A part of the system that executes synchronizers we refer to as a worker. A worker consists of one main thread that emulates the abstraction of independent executions of synchronizers by looping over an array of active synchronizers and making progress with them in turn. When all synchronizers have terminated, the worker returns control to the user’s main program. We describe the implementation from the synchronizer’s point of view. A synchronizer is given a unique identifier, determined on spawn. Each synchronizer can take one of the three actions: either it communicates while waiting for a distributed routine to finish; or it proceeds when the distributed routine is finished; or it terminates. The bulk of synchronizer’s communication consists of messages that correspond to methods executed on graph nodes. Such messages contain identifiers of the synchronizer, the graph, the node and the executed method, followed by serialized method parameters. The messages are combined in non-blocking buffers and flushed repeatedly. Besides communicating, synchronizers perform distributed routines. Barriers are implemented with the distributed termination detection algorithm by Safra [11]. When a barrier returns, it means that no messages that belong to the synchronizer are present in the system. The reduce operation is also implemented by token traversal [12] and the result announced to all workers. Before a H IP G program can be executed, its Java bytecode has to be instrumented. Besides optimizing object serialization by Ibis [9], the graph program is modified: methods are translated into messages, neighbor access is optimized, and synchronizers are rewritten so that no separate thread is needed for each synchronizer instance. The latter is done by translating the blocking routines into a checkpoint followed by a return. This way a worker can execute a synchronizer’s run() method step-by-step. The instrumentation is part of the provided H IP G library, and needs to be called before execution. No special Java compiler is necessary. Release. More implementation details and a GPL release of H IP G can be found at http://www.cs.vu.nl/~ekr/hipg.
4 Memory Utilization and Performance Evaluation In this section we report on the results of experiments conducted with H IP G. The evaluation was carried out on the VU-cluster of the DAS-3 system [13]. The cluster consists of 85 dual-core, dual-CPU 2.4 GHz Opteron compute nodes, each equipped with 4 GB of memory. The processors are interconnected with Myri-10G (MX) and 1G Ethernet links. The time to initialize workers and input graphs was not included in the measurements. All graphs were partitioned randomly—meaning that if a graph is partitioned in p chunks, a graph node is assigned to a chunk with probability p1 . The portion of remote edges is thus p−1 p , which is very high (87-99% in used graphs) and realistic when modeling an unfavorable partitioning (many edges spawning different chunks). We start with the evaluation of performance of applications that almost solely communicate (only one synchronizer spawned). Visitor, the reachability search (see Section 2.1) was started at the root node of a large binary tree directed towards the leaves. SSSP, the single-source shortest paths (breadth first search) (see Section 2.2), was started at the root node of the binary tree, and at a random node of a synthetic social
162
E. Krepska et al.
Table 1. Performance of V ISITOR and SSSP
60
Appl. Workers Input Time[s] Mem[GB]
50
Visitor Visitor Visitor Visitor
8 16 32 64
Bin-27 Bin-28 Bin-29 Bin-29
19.1 21.4 24.5 16.9
2.8 2.9 3.1 2.1
SSSP SSSP SSSP SSSP
8 16 32 64
Bin-27 Bin-28 Bin-29 Bin-29
31.5 38.0 42.5 29.8
2.8 3.0 3.2 2.4
SSSP SSSP SSSP SSSP
8 16 32 64
LN-80 LN-160 LN-320 LN-640
30.8 33.7 34.6 38.5
1.3 1.5 1.7 2.0
Perfect speedup Visitor SSSP SSSP-LN
40 30 20 10 0
10
20
30
40
50
60
#processors
Fig. 6. Speedup of V ISITOR and SSSP
network. The results are presented in Tab. 1 and Fig. 6. We tested both applications on 8–64 processors on Myrinet. To obtain more fair results, rather than keeping the problem size constant and dividing the input into more chunks, we doubled the problem size with doubling the number of processors (Tab. 1, with the exception of Bin–30 that should have been run on 64 processors but did not fit the memory). Thanks to this we avoid spurious improvement due to better cache behavior, keep the heap filled, but also avoid too many small messages that occur if the stored portion of a graph is small. We normalized the results for the speedup computation (Fig. 6). We used binary trees, Bin– n, of height n = 27..29 that have 0.27–1.0 ·109 nodes and edges. The LN–n graphs are random directed graphs with degrees of nodes sampled from the log-normal distribution ln N (4, 1.3), aimed to resemble real-world social networks [14, 15]. An LN–n graph has n · 105 nodes and n · 1.27 · 106 edges. We used LN–n graphs of size n = 80..640 and thus up to 64·106 nodes and 8·109 edges. In each experiment, all edges of the input graphs were visited. Both applications achieved about 60% efficiency on a binary tree graph on 64 processors, which is satisfactory for an application with little computation, O(n), compared to O(n) communication. The efficiency achieved by SSSP on LN–n graphs reaches almost 80%, as the input is more randomized, and has a small diameter compared to a binary tree, which reduces the number of barriers performed. To evaluate the performance of hierarchical graph algorithms written in H IP G, we ran the OBFR-MP algorithm that decomposes a graph into SCCs [16]. OBFR-MP is a divide-and-conquer algorithm like FB [8] (see Section 2.3), but processes the graph in layers. We compared the performance of the OBFR-MP implemented in H IP G against a highly-optimized C/MPI version of this program used for performance evaluation in [16] and kindly provided to us by the authors. The H IP G version was implemented to maximally resemble the C/MPI version: the data structures used and messages sent are the same. Here, we are not interested in the speedup of the MPI implementation of OBFR-MP, on which we don’t have any influence. Rather, we want to see the difference in performance between an optimized C/MPI version and H IP G version of the same application. In general, we found that the H IP G version was substantially faster when compared with MPI implementations that used sockets. The detailed results are presented in Tab. 2. We used two different implementations of MPI over Myrinet: the MPICH-MX implementation provided by Myricom that directly accesses
A High-Level Framework for Distributed Processing of Large-Scale Graphs
163
Table 2. Performance comparison of the OBFR-MP SCC-decomposition algorithm tested on three LnLnTm graphs. OM (OpenMPI) and P4 are socket-based MPI implementations, while the MX MPI implementation directly uses the Myrinet interface. Time is given in seconds.
p 4 8 16 32 64
MX 36.6 26.6 96.5 40.0 24.1
L487L487T5 Myri Eth OM HipG P4 HipG 141.4 41.1 94.8 45.7 81.6 22.1 82.5 30.0 60.5 48.4 179.0 37.0 57.3 39.1 163.4 41.0 46.7 24.4 234.6 41.8
MX 69 73 89 136 128
L10L10T16 Myri Eth OM HipG P4 HipG 255 148 302 225 280 226 462 330 376 315 804 506 661 485 1794 851 646 277 1659 461
MX 45.1 34.5 37.1 30.1 32.0
L60L60T11 Myri Eth OM HipG P4 HipG 152.9 47.3 110.8 98.8 99.8 46.8 111.5 116.0 128.6 60.4 216.2 125.9 82.0 57.4 214.7 171.8 108.8 66.1 311.4 141.2
the interface, and OpenMPI that goes through TCP sockets. On Ethernet we used the standard MPI implementation (P4). We tested OBFR-MP on synthetic graphs called LmLmTn, which are in essence trees of height n of SCCs, such that each SCC is a lattice (m + 1) × (m + 1). An LmLmTn graph has thus (2n+1 − 1) SCCs, each of size (m + 1)2 . The performance of the OBFR-MP algorithm strongly depends on the SCC-structure of the input graph. We used three graphs: one with a small number of large SCCs, L487L487T5; one with a large number of small SCCs, L10L10T16; and one that balances the number of SCCs and their size, L60L60T11. Each graph contains a little over 15·106 nodes and 45·106 edges. The performance of the C/MPI application running over MX is the fastest, as it has the smallest software stack. The OpenMPI and P4 MPI implementations offer a more realistic comparison as they use a deeper software stack (sockets) like H IP G: H IP G ran on average 2.2 times faster than the C/MPI in this case. Most importantly, the speedup or slowdown of H IP G follows the speedup or slowdown of the C/MPI application run over MX, which suggests that the overhead of H IP G will not explode with further scaling of the application. The communication pattern of many graph algorithms is an intensive all-to-all communication. Generally, message sizes decrease with the increase of the number of processors. Good performance results from balancing the size of flushed messages and the frequency of flushing: too many flushes decrease performance, while too few flushes cause other processors to stall. Throughput on 32 processors over MX for the V ISITOR application on Bin-29 is constant (not shown): the application sends 16 GB in 24 s. A worker’s memory is divided between the graph, the communication buffers and the memory allocated by the user’s code in synchronizers. On a 64-bit machine, a graph node uses 80 B in V ISITOR and on average 1 KB in SSSP, including the edges and all overhead. Tab. 1 presents the maximum heap size used by a V ISITOR/SPPP worker. Expectedly, it remains almost constant. SSSP uses more memory than Visitor, because it stores a queue of nodes (see Section 2.2). The results in this section do not aim to prove that we obtained the most efficient implementations of the V ISITOR, SSSP or OBFR-MP algorithms. When processing large-scale graphs, the speedup is of secondary importance; it is of primary importance to be able to store the graph in memory and process it in acceptable time. We aimed to show that large-scale graphs can be handled by H IP G and satisfactory performance can be obtained with little coding effort, even for complex hierarchical graph algorithms.
164
E. Krepska et al.
5 Related Work H IP G is a distributed framework aimed at providing users with a way to code, with little effort, parallel algorithms that operate on partitioned graphs. An analysis of other platforms suitable for the execution of graph algorithms is provided in an inspiring paper by Lumsdaine et al. [3] that, in fact, advocates using massively multithreaded shared-memory machines for this purpose. However, such machines are very expensive and software support is lacking [3]. The library in [17] realizes this concept on a Cray machine. Another interesting alternative would be to use the partitioned global address space languages like UPC [18], X10 [19] or ZPL [20], but we are not aware of support for graph algorithms in these languages, except for a shared memory solution [21] based on X10 and Cilk. The Bulk Synchronous Parallel (BSP) model of computation [22] alternates work and communication phases. We know of two BSP-based libraries that support the development of distributed graph algorithms: CGMgraph and Pregel. CGMgraph [23] uses the unified communication API and parallel routines offered by CGMlib, which is conceptually close to MPI [24]. In Google’s Pregel [15] the graph program is a series of supersteps. In each superstep the Compute(messages) method, implemented by the user, is executed in parallel on all vertices. The system supports fault-tolerance consisting of heartbeats and checkpointing. Impressively, Pregel is reported to be able to handle billions of nodes and use hundreds of workers. Unfortunately, it is not available for download. Pregel is similar to H IP G in two aspects: the vertex-centered programming and composing the parallel program automatically from user-provided simple sequential-like components. However, the repeated global synchronization phase in the Bulk Synchronous Parallel model, although suitable for many applications, is not always desirable. H IP G is fundamentally different from BSP in this respect, as it uses asynchronous messages with computation synchronized on the user’s request. Notably, H IP G can simulate the BSP model as we did in the SSSP application (Section 2.2). The prominent sequential Boost Graph Library (BGL) [4] gave rise to a parallelization that adopts a different approach to graph algorithms. Parallel BGL [5,6] is a generic C++ library that implements distributed graph data structures and graph algorithms. The main focus is to reuse existing sequential algorithms, only applying them to distributed data structures, to obtain parallel algorithms. PBGL supports a rich set of parallel graph implementations and property maps. The system keeps information about ghost (remote) vertices, although that works well only if the number of edges spanning different processors is small. Parallel BGL offers a very general model, while both Pregel and H IP G trade expressiveness (for example neither offers any form of remote read) for more predictable performance. ParGraph [25] is another parallelization of BGL, similar to PBGL, but less developed; it does not seem to be maintained. We are not aware of any work directly supporting the development divide-and-conquer graph algorithms. To store graphs we used the SVC-II distributed graph format advocated in [26]. Graph formats are standardized only within selected communities. In case of large graphs, binary formats are typically preferable to text-based formats, as compression is not needed. See [26] for a comparison of a number of formats used in the formal methods community. A popular text format is XML, which is used for example to store Wikipedia [27]. RDF [28] is used to represent semantic graphs in the form of
A High-Level Framework for Distributed Processing of Large-Scale Graphs
165
triples (source, edge, target). Contrastingly, in bioinformatics, graphs are stored in many databases and integrating them is ongoing research [29].
6 Conclusions and Future Work In this paper we described H IP G, a model and a distributed framework that allows users to code, with little effort, hierarchical parallel graph algorithms. The parallel program is automatically composed of sequential-like components provided by the user: node methods and synchronizers, which coordinate distributed computations. We realized the model in Java and obtained short and elegant implementations of several published graph algorithms, good memory utilization and performance, as well as out-of-the box portability. Fault-tolerance has not been implemented in the current implementation of H IP G, as the programs that we executed so far run on a cluster and were not mission-critical. A solution using checkpointing could be implemented, in which, when a machine fails, a new machine is requested and the entire computation restarted from the last checkpoint. Such a solution is standard and similar to the one used in [15]. Creating a checkpoint takes somewhat more effort, because of the lack of global synchronization phases in H IP G. Creating a consistent image of the state space could be done either by freezing the entire computation or with a distributed snapshot algorithm in the background such as the one by Lai-Yang [12]. Distributed snapshot poses overhead on messages, which however can be minimized when using message combining, which is the case in H IP G. H IP G is work in progress. We would like to improve speedup by using better graph partitioning methods, e.g. [1]. If needed, we could implement graph modification during runtime, although in all cases that we looked at, this could be solved by creating new graphs during execution, which is possible in H IP G. We are currently working on providing tailored support for multicore processors and extending the framework to execute on a grid. Currently the size of the graph that can be handled is limited to the amount of memory available. Therefore, we are interested if a portion of a graph could be temporarily stored on disk without completely sacrificing efficiency [30]. Acknowledgments. We thank Jaco van de Pol who initiated this work and provided C code, and Ceriel Jacobs for helping with the implementation.
References 1. Karypis, G., Kumar, V.: A parallel algorithm for multilevel graph partitioning and sparse matrix ordering. J. of Par. and Distr. Computing 48(1), 71–95 (1998) 2. Feige, U., Krauthgamer, R.: A polylog approximation of the minimum bisection. SIAM Review 48(1), 99–130 (2006) 3. Lumsdaine, A., Gregor, D., Hendrickson, B., Berry, J.: Challenges in parallel graph processing. PPL 17(1), 5–20 (2007) 4. Siek, J., Lee, L.-Q., Lumsdaine, A.: The Boost Graph Library. Addison-Wesley, Reading (2002) 5. Gregor, D., Lumsdaine, A.: The parallel BGL: A generic library for distributed graph computations. In: Parallel Object-Oriented Scientific Computing (2005)
166
E. Krepska et al.
6. Gregor, D., Lumsdaine, A.: Lifting sequential graph algorithms for distributed-memory parallel computation. OOPSLA 40(10), 423–437 (2005) 7. Cormen, T., Leiserson, C., Rivest, R.: Introduction to algorithms. MIT Press, Cambridge (1990) 8. Fleischer, L., Hendrickson, B., Pinar, A.: On identifying strongly connected components in parallel. In: Rolim, J.D.P. (ed.) IPDPS-WS 2000. LNCS, vol. 1800, pp. 505–511. Springer, Heidelberg (2000) 9. Bal, H.E., Maassen, J., van Nieuwpoort, R., Drost, N., Kemp, R., van Kessel, T., Palmer, N., Wrzesi´nska, G., Kielmann, T., van Reeuwijk, K., Seinstra, F., Jacobs, C., Verstoep, K.: Real-world distributed computing with Ibis. IEEE 43(8), 54–62 (2010) 10. The Java SE HotSpot virtual machine. java.sun.com/products/hotspot 11. Dijkstra, E.: Shmuel Safra’s version of termination detection. Circulated privately (January 1987) 12. Tel, G.: Introduction to distributed algorithms. Cambridge University Press, Cambridge (2000) 13. Distributed ASCI Supercomputer DAS-3, www.cs.vu.nl/das3 14. Pennock, D.M., Flake, G.W., Lawrence, S., Glover, E.J., Giles, C.L.: Winners don’t take all: Characterizing the competition for links on the web. PNAS 99(8), 5207–5211 (2002) 15. Malewicz, G., Austern, M.H., Bik, A.J., Dehnert, J.C., Horn, I., Leiser, N., Czajkowski, G.: Pregel: A system for large-scale graph processing. In: SIGMOD, pp. 135–146 (2010) 16. Barnat, J., Chaloupka, J., van de Pol, J.: Improved distributed algorithms for SCC decomposition. In: PDMC 2007. ENTCS, vol. 198(1), pp. 63–77 (2008) 17. Berry, J., Hendrickson, B., Kahan, S., Konecny, P.: Graph software development and performance on the MTA-2 and Eldorado. At 48-th Cray Users Group meeting (2006) 18. Coarfa, C.: et al. An evaluation of global address space languages: Co-array Fortran and Unified Parallel C. In: PPoPP 2005, pp. 36–47. ACM, New York (2005) 19. Charles, P.: et al. X10: An object-oriented approach to non-uniform cluster computing. In: OOPSLA, pp. 519–538. ACM, New York (2005) 20. Chamberlain, B.L., Choi, S.-E., Lewis, E.C., Snyder, L., Weathersby, W.D., Lin, C.: The case for high-level parallel programming in ZPL. IEEE Comput. Sci. Eng. 5(3), 76–86 (1998) 21. Cong, G., Kodali, S., Krishnamoorthy, S., Lea, D., Saraswat, V., Wen, T.: Solving large, irregular graph problems using adaptive work-stealing. In: ICPP, pp. 536–545. IEEE, Los Alamitos (2008) 22. Valiant, L.: A bridging model for parallel computation. Comm. ACM 33(8), 103–111 (1990) 23. Chan, A., Dehne, F., Taylor, R.: CGMgraph/CGMlib: Implementing and testing CGMgraph alg. on PC clusters and shared memory machines. J. of HPC App., 19(1):81–97 (2005) 24. MPI Forum: MPI: A message passing interface. J of Supercomp Appl. 8(3/4), 169–416 (1994) 25. Hielscher, F., Gottschling, P.: ParGraph library. pargraph.sourceforge.net (2004) 26. Blom, S., van Langevelde, I., Lisser, B.: Compressed and distributed file formats for labeled transition systems. In: PDMC 2003. ENTCS, vol. 89, pp. 68–83 (2003) 27. Denoyer, L., Gallinari, P.: The Wikipedia XML corpus. SIGIR Forum 40(1), 64–69 (2006) 28. Resource description framework, http://www.w3.org/RDF 29. Joyce, A.R., Palsson, B.O.: The model organism as a system: Integrating ’omics’ data sets. Nat. Rev. Mol. Cell. Biol. 7(3), 198–210 (2006) 30. Hammer, M., Weber, M.: To store or not to store reloaded: Reclaiming memory on demand. In: Brim, L., Haverkort, B.R., Leucker, M., van de Pol, J. (eds.) FMICS 2006 and PDMC 2006. LNCS, vol. 4346, pp. 51–66. Springer, Heidelberg (2007)
Affinity Driven Distributed Scheduling Algorithm for Parallel Computations Ankur Narang1 , Abhinav Srivastava1 , Naga Praveen Kumar1, and Rudrapatna K. Shyamasundar2 1
2
IBM Research - India, New Delhi Tata Institute of Fundamental Research, Mumbai
Abstract. With the advent of many-core architectures efficient scheduling of parallel computations for higher productivity and performance has become very important. Distributed scheduling of parallel computations on multiple places1 needs to follow affinity and deliver efficient space, time and message complexity. Simultaneous consideration of these factors makes affinity driven distributed scheduling particularly challenging. In this paper, we address this challenge by using a low time and message complexity mechanism for ensuring affinity and a randomized work-stealing mechanism within places for load balancing. This paper presents an online algorithm for affinity driven distributed scheduling of multi-place2 parallel computations. Theoretical analysis of the expected and probabilistic lower and upper bounds on time and message complexity of this algorithm has been provided. On well known benchmarks, our algorithm demonstrates 16% to 30% performance gain as compared to Cilk [6] on multi-core Intel Xeon 5570 architecture. Further, detailed experimental analysis shows the scalability of our algorithm along with efficient space utilization. To the best of our knowledge, this is the first time affinity driven distributed scheduling algorithm has been designed and theoretically analyzed in a multi-place setup for many core architectures.
1 Introduction The exascale computing roadmap has highlighted efficient locality oriented scheduling in runtime systems as one of the most important challenges (”Concurrency and Locality” Challenge [10]). Massively parallel many core architectures have NUMA characteristics in memory behavior, with a large gap between the local and the remote memory latency. Unless efficiently exploited, this is detrimental to scalable performance. Languages such as X10 [9], Chapel [8] and Fortress [4] are based on partitioned global address space (PGAS [11]) paradigm. They have been designed and implemented as part of DARPA HPCS program3 for higher productivity and performance on many-core massively parallel platforms. These languages have in-built support for initial placement of threads (also referred as activities) and data structures in the parallel program. 1 2
3
Place is a group of processors with shared memory. Multi-place refers to a group of places. For example, with each place as an SMP(Symmetric MultiProcessor), multi-place refers to cluster of SMPs. www.highproductivity.org/
M.K. Aguilera et al. (Eds.): ICDCN 2011, LNCS 6522, pp. 167–178, 2011. c Springer-Verlag Berlin Heidelberg 2011
168
A. Narang et al.
Therefore, locality comes implicitly with the program. The run-time systems of these languages need to provide efficient algorithmic scheduling of parallel computations with medium to fine grained parallelism. For handling large parallel computations, the scheduling algorithm (in the run-time system) should be designed to work in a distributed fashion. This is also imperative to get scalable performance on many core architectures. Further, the execution of the parallel computation happens in the form of a dynamically unfolding execution graph. It is difficult for the compiler to always correctly predict the structure of this graph and hence perform correct scheduling and optimizations, especially for data-dependent computations. Therefore, in order to schedule generic parallel computations and also to exploit runtime execution and data access patterns, the scheduling should happen in an online fashion. Moreover, in order to mitigate the communication overheads in scheduling and the parallel computation, it is essential to follow affinity inherent in the computation. Simultaneous consideration of these factors along with low time and message complexity, makes distributed scheduling a very challenging problem. In this paper, we address the following affinity driven distributed scheduling problem. Given: (a) An input computation DAG (Fig. 1) that represents a parallel multithreaded computation with fine to medium grained parallelism. Each node in the DAG is a basic operation such as and/or/add etc. and is annotated with a place identifier which denotes where that node should be executed. Each edge in the DAG represents one of the following: (i) spawn of a new thread or, (ii) sequential flow of execution or, (iii) synchronization dependency between two nodes. The DAG is a strict parallel computation DAG (synchronization dependency edge represents an activity waiting for the completion of a descendant activity, details in section 3); (b) A cluster of n SMPs (refer Fig. 2) as the target architecture on which to schedule the computation DAG. Each SMP 4 also referred as place has fixed number(m) of processors and memory. The cluster of SMPs is referred as the multi-place setup. Determine: An online schedule for the nodes of the computation DAG in a distributed fashion that ensures the following: (a) Exact mapping of nodes onto places as specified in the input DAG. (b) Low space, time and message complexity for execution. In this paper, we present the design of a novel affinity driven, online, distributed scheduling algorithm with low time and message complexity. The algorithm assumes initial placement annotations on the given parallel computation with the consideration of load balance across the places. The algorithm controls the online expansion of the computation DAG. Our algorithm employs an efficient remote spawn mechanism across places for ensuring affinity. Randomized work stealing within a place helps in load balancing. Our main contributions are: – We present a novel affinity driven, online, distributed scheduling algorithm. This algorithm is designed for strict multi-place parallel computations. – Using theoretical analysis, we prove that the lower bound of the expected execution k time is O(maxk T1k /m + T∞,n ) and the upper bound is O( k (T1k /m + T∞ )), where k is a variable that denotes places from 1 to n, m denotes the number of processors per place, T1k denotes the execution time on a single processor for place 4
Symmetric MultiProcessor: group of processors with shared memory.
Affinity Driven Distributed Scheduling Algorithm for Parallel Computations
169
k, and T∞,n denotes the execution time of the computation on n places with infinite processors on each place. Expected and probabilistic lower and upper bounds for the message complexity have also been provided. – On well known parallel benchmarks (Heat, Molecular Dynamics and Conjugate Gradient), we demonstrate performance gains of around 16% to 30% over Cilk on multi-core architectures. Detailed analysis shows the scalability of our algorithm as well as efficienct space utilization.
2 Related Work Scheduling of dynamically created tasks for shared memory multi-processors has been a well studied problem. The work on Cilk [6] promoted the strategy of randomized work stealing. Here, a processor that has no work (thief ) randomly steals work from another processor (victim) in the system. [6] proved efficient bounds on space (O(P · S1 )) and time (O(T1 /P + T∞ )) for scheduling of fully-strict computations (synchronization dependency edges go from a thread to only its immediate parent thread, section 3) in an SMP platform; where P is the number of processors, T1 and S1 are the time and space for sequential execution respectively, and T∞ is the execution time on infinite processors. We consider locality oriented scheduling in distributed enviroments and hence are more general than Cilk. The importance of data locality for scheduling threads motivated work stealing with data locality [1] wherein the data locality was discovered on the fly and maintained as the computation progressed. This work also explored initial placement for scheduling and provided experimental results to show the usefulness of the approach; however, affinity was not always followed, the scope of the algorithm was limited to only SMP environments and its time complexity was not analyzed. [5] analyzed the time complexity (O(T1 /P + T∞ )) for scheduling general parallel computations on SMP platforms but does not consider locality oriented scheduling. We consider distributed scheduling problem across multiple places (cluster of SMPs) while ensuring affinity and also provide time and message complexity bounds. [7] considers work-stealing algorithms in a distributed-memory environment, with adaptive parallelism and fault-tolerance. Here task migration was entirely pull-based (via a randomized work stealing algorithm) hence it ignored affinity and also didn’t provide any formal proof for the resource utilization properties. The work in [2] described a multi-place(distributed) deployment for parallel computations for which initial placement based scheduling strategy is appropriate. A multi-place deployment has multiple places connected by an interconnection network where each place has multiple processors connected as in an SMP platform. It showed that online greedy scheduling of multi-threaded computations may lead to physical deadlock in presence of bounded space and communication resources per place. However, the computation did not respect affinity always and no time or communication bounds were provided. Also, the aspect of load balancing was not addressed even within a place. We ensure affinity along with intra-place load balancing in a multi-place setup. We show empirically, that our algorithm has efficient space utilization as well.
170
A. Narang et al.
3 System and Computation Model The system on which the computation DAG is scheduled is assumed to be cluster of SMPs connected by an Active Message Network (Fig. 2). Each SMP is a group of processors with shared memory. Each SMP is also referred to as place in the paper. Active Messages ((AM)5 is a low-level lightweight RPC(remote procedure call) mechanism that supports unordered, reliable delivery of matched request/reply messages. We assume that there are n places and each place has m processors (also referred to as workers). The parallel computation to be dynamically scheduled on the system, is assumed to be specified by the programmer in languages such as X10 and Chapel. To describe our distributed scheduling algorithm, we assume that the parallel computation has a DAG(directed acyclic graph) structure and consists of nodes that represent basic operations like and, or, not, add and so forth. There are edges between the nodes in the computation DAG (Fig. 1) that represent creation of new activities (spawn edge), sequential execution flow between the nodes within a thread/activity (continue edge) and synchronization dependencies (dependence edge) between the nodes. In the paper we refer to the parallel computation to be scheduled as the computation DAG. At a higher level, the parallel computation can also be viewed as a computation tree of activities. Each activity is a thread (as in multi-threaded programs) of execution and consists of a set of nodes (basic operations). Each activity is assigned to a specific place (affinity as specified by the programmer). Hence, such a computation is called multi-place computation and DAG is referred to as place-annotated computation DAG (Fig. 1: v1..v20 denote nodes, T 1..T 6 denote activities and P 1..P 3 denote places). Based on the structure of dependencies between the nodes in the computation DAG, there can be following types of parallel computations: (a) Fully-strict computation: Dependencies are only between the nodes of a thread and the nodes of its immediate parent thread; (b) Strict computation: Dependencies are only between the nodes of a thread and the nodes of any of its ancestor threads; (c) Terminally strict computation: (Fig. 1). Dependencies arise due to an activity waiting for the completion of its descendants. Every dependency edge, therefore, goes from the last instruction of an activity to one of its ancestor activities with the following restriction: In a subtree rooted at an activity called Γr , if there exists a dependence edge from any activity in the subtree to the root activity Γr , then there cannot exist any dependence edge from the activities in the subtree to the ancestors of Γr . The following notations are used in the paper. P = {P1 , · · · , Pn } denote the set of places. {Wi1 , Wi2 ..Wim } denote the set of workers at place Pi . S1 denotes the space required by a single processor execution schedule. Smax denotes the size in bytes of the largest activation frame in the computation. Dmax denotes the maximum depth of the computation tree in terms of number of activities. T∞,n denotes the execution time of k the computation DAG over n places with infinite processors at each place. T∞ denotes the execution time for activities assigned to place P using infinite processors. Note k k k that, T∞,n ≤ T . T denotes the time taken by a single processor for the 1 1≤k≤n ∞ activities assigned to place k. 5
Active Messages defined by the AM-2: http://now.cs.berkeley.edu/AM/active_messages.html
Affinity Driven Distributed Scheduling Algorithm for Parallel Computations Single Place with multiple processors
171
Multiple Places with multiple processors per place
T1 @ P1 v1
v2
v14
v18
v19
T2 @ P2
v20
PE
v3
v6
v9
v13
v15
v16
PE
SMP Node PE
Spawn edge
T6 @ P1
L2 Cache
….
SMP Node PE
PE
….
PE
Continue edge
v17
Dependence edge
System Bus
Memory
…
Memory
L2 Cache
T3 @ P3 v4
T5 @ P2
T4 @ P3 v5
v7
v8
v10
v11
v12
PE
PE
SMP
Fig. 1. Place-annotated Computation Dag
Interconnect (Active Message Network)
SMP Cluster
Fig. 2. Multiple Places: Cluster of SMPs
4 Distributed Scheduling Algorithm Consider a strict place-annotated computation DAG. The distributed scheduling algorithm described below schedules activities with affinity, at only their respective places. Within a place, work-stealing is enabled to allow load-balanced execution of the computation sub-graph associated with that the place. The computation DAG unfolds in an online fashion in a breadth-first manner across places when the affinity driven activities are pushed onto their respective remote places. For space efficiency, before a placeannotated activity is pushed onto a place, the remote place buffer (FAB, see below) is checked for space utilization. If the space utilization of the remote buffer (FAB) is high then the push gets delayed for a limited amount of time. This helps in appropriate spacetime tradeoff for the execution of the parallel computation. Within a place, the online unfolding of the computation DAG happens in a depth-first manner to enable efficient space and time execution. Sufficient space is assumed to exist at each place, so physical deadlocks due to lack of space cannot happen in this algorithm. Each place maintains a Fresh Activity Buffer (FAB) which is managed by a dedicated processor (different from workers) at that place. An activity that has affinity for a remote place is pushed into the FAB at that place. Each worker at a place has a Ready Deque and a Stall Buffer (refer Fig. 3). The Ready Deque of a processor contains the activities of the parallel computation that are ready to execute. The Stall Buffer contains the activities that have been stalled due to dependency on another activities in the parallel computation. The FAB at each place as well as the Ready Deque at each worker use a concurrent deque implementation. An idle worker at a place will attempt to randomly steal work from other workers at the same place (randomized work stealing). Note that an activity which is pushed onto a place can move between workers at that place (due to work stealing) but can not move to another place and thus obeys affinity at all times. The distributed scheduling algorithm is given below. At any step, an activity A at the rth worker (at place i), Wir , may perform the following actions: 1. Spawn: (a) A spawns activity B at place,Pj , i = j: A sends AM(B) (active message for B) to the remote place. If the space utilization of FAB(j) is below a given threshold, then AM(B) is successfully inserted in the FAB(j) (at Pj ) and A continues
172
A. Narang et al. Place(j)
Worker(m)
Place(i) FAB
Remote Spawn
Worker(m)
request (AM(B))
FAB
Worker(2) Dedicated Processor
Worker(1)
Spawn accept
Stall Buffer Ready Deque
Ready Deque
Stall Buffer
Worker(2)
Dedicated Processor
Worker(1)
Fig. 3. Affinity Driven Distributed Scheduling Algorithm
execution. Else, this worker waits for a limited time, δt , before retrying the activity B spawn on place Pj (Fig. 3). (b) A spawns B locally: B is successfully created and starts execution whereas A is pushed into the bottom of the Ready Deque. 2. Terminates (A terminates): The worker at place Pi , Wir , where A terminated, picks an activity from the bottom of the Ready Deque for execution. If none available in its Ready Deque, then it steals from the top of other workers’ Ready Deque. Each failed attempt to steal from another worker’s Ready Deque is followed by attempt to get the topmost activity from the FAB at that place. If there is no activity in the FAB then another victim worker is chosen from the same place. 3. Stalls (A stalls): An activity may stall due to dependencies in which case it is put in the Stall Buffer in a stalled state. Then same as Terminates (case 2) above. 4. Enables (A enables B): An activity, A, (after termination or otherwise) may enable a stalled activity B in which case the state of B changes to enabled and it is pushed onto the top of the Ready Deque. 4.1 Time Complexity Analysis The time complexity of this affinity driven distributed scheduling algorithm in terms of number of throws during execution is presented below. Each throw represents an attempt by a worker(thief ) to steal an activity from either another worker(victim) or FAB at the same place. Lemma 1. Consider a strict place-annotated computation DAG with work per place, T1k , being executed by the distributed scheduling algorithm presented in section 4. Then, the execution (finish) time for place,k, is O(T1k /m+Qkr /m+Qke /m), where Qkr denotes the number of throws when there is at least one ready node at place k and Qke denotes the number of throws when there are no ready nodes at place k. The lower bound on the executiontime of the full computation is O(maxk (T1k /m + Qkr /m)) and the upper bound is O( k (T1k /m + Qkr /m)). Proof Sketch: (Token based counting argument) Consider three buckets at each place in which tokens are placed: work bucket where a token is placed when a worker at that place executes a node of the computation DAG; ready-node-throw bucket where a token is placed when a worker attempts to steal and there is at least one ready node at that place; null-node-throw bucket where a token is placed when a worker attempts to steal and there are no ready nodes at that place (models the wait time when there is no work
Affinity Driven Distributed Scheduling Algorithm for Parallel Computations
173
at that place). The total finish time of a place can be computed by counting the tokens in these three buckets and by considering load balanced execution within a place (using randomized work stealing). The upper and lower bounds on the execution time arise from the structure of the computation DAG and the structure of the online schedule generated. The detailed proof is presented in [3]. Next, we compute the bound on the number of tokens in the ready-node-throw bucket using potential function based analysis [5]. Our unique contribution is in proving the lower and upper bounds of time complexity and message complexity for the multiplace affinity driven distributed scheduling algorithm presented in section 4 that involves both intra-place work stealing and remote place affinity driven work pushing. Let there be a non-negative potential associated with each ready node in the computation dag. If the execution of node u enables node v, then edge(u,v) is called the enabling edge and u is called the designated parent of v. The subgraph of the computation DAG consisting of enabling edges forms a tree, called the enabling tree. During the execution of the affinity driven distributed scheduling algorithm 4, the weight of a node u in the enabling tree, w(u) is defined as (T∞,n − depth(u)). For a ready node, u, we define φi (u), the potential of u at timestep i, as: φi (u) = 32w(u)−1 , if u is assigned; =3
2w(u)
, otherwise
(4.1a) (4.1b)
All non-ready nodes have 0 potential. The potential at step i, φi , is the sum of the potential of all the ready nodes at step i. When an execution begins, the only ready node is the root node with potential, φ(0) = 32T∞,n −1 . At the end the potential is 0 since there are no ready nodes. Let Ei denote the set of workers whose Ready Deque is empty at the beginning of step i, and let Di denote the set of all other workers with non-empty Ready Deque. Let, Fi denote the set of all ready nodes present across the FAB at all places. The total potential can be partitioned into three parts as follows: φi = φi (Ei ) + φi (Di ) + φi (Fi )
(4.2)
Actions such as assignment of a node from Ready Deque to the worker for execution, stealing nodes from the top of victim’s Ready Deque and execution of a node, lead to decrease of potential. The idle workers at a place do work-stealing alternately between stealing from Ready Deque and stealing from the FAB. Thus, 2m throws in a round consist of m throws to other workers’s Ready Deque and m throws to the FAB. For randomized work-stealing one can use the balls and bins game [3] to compute the expected and probabilistic bound on the number of throws. Using this, one can show that whenever m or more throws occur for getting nodes from the top of the Ready Deque of other workers at the same place, the potential decreases by a constant fraction of φi (Di ) with a constant probability. The component of potential associated with the FAB at place Pk , φki (Fi ), can be shown to deterministically decrease for m throws in a round. Furthermore, at each place the potential also drops by a constant factor of φki (Ei ). The detailed analysis of decrease of potential for each component is given in [3]. Analyzing the rate of decrease of potential and using Lemma 1 leads to the following theorem. Theorem 1. Consider a strict place-annotated computation DAG with work per place k, denoted by T1k , being executed by the affinity driven multi-place distributed scheduling
174
A. Narang et al.
algorithm (section 4). Let the critical-path length for the computation be T∞,n . The lower bound on the expected execution time is O(maxk T1k /m + T∞,n ) and the upper k bound is O( k (T1k /m + T∞ )). Moreover, for any > 0, the lower bound for the k execution time is O(maxk T1 /m + T∞,n + log(1/)) with probability at least 1 − . Similar probabilistic upper bound exists. Proof Sketch: For the lower bound, we analyze the number of throws (to the ready-nodethrow bucket) by breaking the execution into phases of θ(P = mn) throws (O(m) throws per place). It can be shown that with constant probability, a phase causes the potential to drop by a constant factor. More precisely, between phases i and i + 1, P r{(φi − φi+1 ) ≥ 1/4.φi } > 1/4 (details in [3] ). Since the potential starts at φ0 = 32T∞,n −1 and ends at zero and takes integral values, the number of successful phases is at most (2T∞,n − 1) log4/3 3 < 8T∞,n . Thus, the expected number of throws per place gets bounded by O(T∞,n · m), and the number of throws is O(T∞,n · m) + log(1/)) with probability at least 1 − (using Chernoff Inequality). Using Lemma 1 we get the lower bound on the expected execution time as O(maxk T1k /m + T∞,n ). The detailed proof and probabilistic bounds are presented in [3] . For the upper bound, consider the execution of the subgraph of the computation at each place. The number of throws in the ready-node-throw bucket per place can be k similarly bounded by O(T∞ ·m). Further, the place that finishes the execution in the end, can end up with the number of tokens in the null-node-throw bucket equal to the tokens in the work buckets and the read-node-throw buckets of all other places. Hence, the finish time for this place, which is also the execution time of the full computation DAG k is O( k (T1k /m + T∞ )). The probabilistic upper bound can be similarly established using Chernoff Inequality. The following theorem bounds the message complexity of the affinity driven work stealing algorithm 4. Theorem 2. Consider the execution of a strict place-annotated computation DAG with critical path-length T∞,n by the Affinity Driven Distributed Scheduling Algorithm (section 4). Then, the total number of bytes communicated across places is O(I · (Smax + nd )) and the lower bound on number of bytes communicated within a place has the expectation O(m·T∞,n ·Smax ·nd ), where nd is the maximum number of dependence edges from the descendants to a parent and I is the number of remote spawns from one place to a remote place. Moreover, for any > 0, the probability is at least (1 − ) that the lower bound on the communication overhead per place is O(m·(T∞,n +log(1/))·nd ·Smax ). Similarly message upper bounds exist. Proof. First consider inter-place messages. Let the number of affinity driven pushes to remote places be O(I), each of maximum O(Smax ) bytes. Further, there could be at most nd dependencies from remote descendants to a parent, each of which involves communication of constant, O(1), number of bytes. So, the total inter place communication is O(I.(Smax + nd )). Since the randomized work stealing is within a place, the lower bound on the expected number of steal attempts per place is O(m.T∞,n ) with each steal attempt requiring Smax bytes of communication within a place. Further, there can be communication when a child thread enables its parent and puts the parent into the child processors’ Ready Deque. Since this can happen nd times for each time
Affinity Driven Distributed Scheduling Algorithm for Parallel Computations
175
the parent is stolen, the communication involved is at most nd .Smax ). So, the expected total intra-place communication across all places is O(n.m.T∞,n .Smax .nd ). The probabilistic bound can be derived using Chernoff’s inequality and is omitted for brevity. Similarly, expected and probabilistic upper bounds can be established for communication complexity within the places.
5 Results and Analysis We implemented our distributed scheduling algorithm (ADS) and the pure Cilk style work stealing based scheduler (CWS) using pthreads (NPTL) API. The code was compiled using gcc version (4.1.2) with options -O2 and -m64. Using well known benchmarks the performance of ADS was compared with CWS and also with original Cilk6 scheduler (referred as CORG in this section). These benchmarks are the following. Heat: Jacobi over-relaxation that simulates heat propagation on a two dimensional grid for a number of steps [1]. For our scheduling algorithm (ADS), the 2D grid is partitioned uniformly across the available cores.7 ; Molecular Dynamics (MD): This is classical Molecular Dynamics simulation, using the Velocity Verlet time integration scheme. The simulation was carried on 16K particles for 100 iterations; Conjugate Gradient (NPB8 benchmark): Conjugate Gradient (CG) approximates the largest eigenvalue of a sparse, symmetric, positive definite matrix using inverse iteration. The matrix is generated by summing outer products of sparse vectors, with a fixed number of nonzero elements in each generating vector. The benchmark computes a given number of eigenvalue estimates, referred to as outer iterations, using 25 iterations of the CG method to solve the linear system in each outer iteration. The performance comparison between ADS and CORG was done on Intel multi-core platform. This platform has 16 cores (2.93 GHz, intel Xeon 5570, Nehalem architecture) with 8M B L3 cache per chip and around 64GB memory. Intel Xeon 5570 has NUMA characteristics even though it exposes SMP style programming. Fig. 4 compares the performance for the Heat benchmark (matrix: 32K ∗ 4K, number of iterations = 100, leafmaxcol = 32). Both ADS and CORG demonstrate strong scalability. Initially, ADS is around 1.9× better than CORG, but later this gap stabilizes at around 1.20×. 5.1 Detailed Performance Analysis In this section, we analyze the performance gains obtained by our ADS algorithm vs. the Cilk style scheduling (CWS) algorithm and also investigate the behavior of our algorithm on Power6 multi-core architecture. Fig. 5 demonstrates the gain in performance of ADS vs CWS with 16 cores. For CG, Class B matrix is chosen with parameters: NA = 75K, Non-Zero = 13M , Outer iterations = 75, SHIFT = 60. For Heat, the parameters values chosen are: matrix size 6 7
8
http://supertech.csail.mit.edu/cilk/ The Dmax for this benchmark is log(numCols/leaf maxcol) where numCols represents the number of columns in the input two-dimensional grid and leafmaxcol represents the number of columns to be processed by a single thread. http://www.nas.nasa.gov/NPB/Software
176
A. Narang et al. Strong Scalability Comparison: ADS vs CORG
WS & FAB Overheads: ADS vs CWS
Performance Comparison: ADS vs CWS 2000
CORG
1000
ADS
500 0
40 30
CWS
20
ADS
10 0
2
4
8
16
CORG
1623
812
415
244
ADS
859
683
348
195
Number of Cores
Fig. 4. CORG vs ADS
CG
CWS
45.7
ADS
31.9
Heat
MD
12.2
10.6
9.8
8.9
Number of Cores
Fig. 5. ADS vs CWS
Time (s)
1500
Total Time (s)
Total Time (s)
50
20 18 16 14 12 10 8 6 4 2 0
CWS_WS_time ADS_WS_time ADS_Fab_Overhead
CG
Heat
MD
Number of Cores
Fig. 6. ADS vs CWS
= 32 ∗ 4K, number of iterations = 100 and leafmaxcol = 32. While CG has maximum gain of 30%, MD shows gain of 16%. Fig. 6 demonstrates the overheads due to work stealing and FAB stealing in ADS and CWS. ADS has lower work stealing overhead because the work stealing happens only within a place. For CG, work steal time for ADS (5s) is 3.74× better than CWS (18.7s). For Heat and MD, ADS work steal time is 4.1× and 2.8× better respectively, as compared to CWS. ADS has FAB overheads but this time is very small, around 13% to 22% of the corresponding work steal time. CWS has higher work stealing overhead because the work stealing happens from any place to any other place. Hence, the NUMA delays add up to give a larger work steal time. This demonstrates the superior execution efficiency of our algorithm over CWS. We measured the detailed characteristics of our scheduling algorithm on multi-core Power6 platform. This has 16 Power6 cores and total 128GB memory. Each core has 64KB instruction L1 cache and 64KB L1 data cache along with 4M B semi-private unified L2 cache. Two cores on a Power6 chip share an external 32M B L3 cache. Fig. 7 plots the variation of the work stealing time, the FAB stealing time and the total time with changing configurations of a multi-place setup, for MD benchmark. With constant total number of cores = 16, the configurations, in the format (number of places * number of processors per place), chosen are: (a) (16 ∗ 1), (b) (8 ∗ 2), (c) (4 ∗ 4), and (d) (2 ∗ 8). As the number of places increase from 2 to 8, the work steal time increases from 3.5s to 80s as the average number of work steal attempts increases from 140K to 4M . For 16 places, the work steal time falls to 0 as here there is only a single processor per place, so work stealing does not happen. The FAB steal time, however, increases monotonically from 0.3s for 2 places, to 110s for 16 places. In the (16 ∗ 1) configuration, the processor at a place gets activities to execute, only through remote push onto its place.Hence, the FAB steal time at the place becomes high, as the number of FAB attempts (300M average) is very large, while the successful FAB attempts are very low (1400 average). With increasing number of places from 2 to 16, the total time increases from 189s to 425s, due to increase in work stealing and/or FAB steal overheads. Fig. 8 plots the work stealing time and FAB stealing time variation with changing multi-place configurations for the CG benchmark (using Class C matrix with parameter values: NA = 150K, Non-Zero = 13M , Outer Iterations = 75 and SHIFT = 60). In this case, the work steal time increases from 12.1s (for (2 ∗ 8)) to 13.1 (for (8 ∗ 2)) and then falls to 0 for (16 ∗ 1) configuration. The FAB time initially increases slowly from 3.6s to 4.1s but then jumps to 81s for (16 ∗ 1) configuration. This behavior can be explained as in the case of MD benchmark (above).
Affinity Driven Distributed Scheduling Algorithm for Parallel Computations
177
Fig. 9 plots the work stealing time and FAB stealing time variation with changing multi-place configurations for the CG benchmark (using parameter values: matrix size = 64K ∗ 8K, Iterations = 100 and leafmaxcol = 32). The variation of work stealing time, FAB stealing time and total time follow the pattern as in the case of MD.
350
350
300
300
350 300 250
250
ADS_FAB_time
200 150 100
ADS_Total_Time
50 0
ADS_WS_time
200
ADS_FAB_time
150
ADS_Total_Time
Time (s)
250 ADS_WS_time
Time (s)
Time (s)
WS & FAB Overheads Variation: Heat
WS & FAB Overheads Variation: CG
WS & FAB Overheads Variation: MD 450 400
ADS_FAB_time
150 100
50
50
0 (2 * 8)
(4 * 4)
(8 * 2)
(16 * 1)
(4 * 4)
(8 * 2)
(16 * 1)
(2 * 8)
(Num Places * Num Procs Per Place)
Fig. 7. Overheads - MD
ADS_Total_Time
0 (2 * 8)
(Num Places * Num Procs Per Place)
ADS_WS_time
200
100
(4 * 4)
(8 * 2)
(16 * 1)
(Num Places * Num Procs Per Place)
Fig. 8. Overheads - CG
Fig. 9. Overheads - HEAT
Fig. 10 gives the variation of the Ready Deque average space and maximum space consumption across all processors and FAB average space and maximum space consumption across places, with changing configurations of the multi-place setup. As the number of places increase from 2 to 16, the FAB average space increase from 4 to 7 stack frames first, and, then decreases to 6.4 stack frames. The maximum FAB space usage increases from 7 to 9 stack frames but then returns back to 7 stack frames. The average Ready Deque space consumption increases from 11 stack frames to 12 stack frames but returns back to 9 stack frames for 16 places, while the average Ready Deque monotonically decreases from 9.69 to 8 stack frames. The Dmax for this benchmark setup is 11 stack frames, which leads to 81% maximum FAB utilization and roughly 109% Ready Deque utilization. Fig. 12 gives the variation of FAB space and Ready Deque space with changing configurations, for CG benchmark (Dmax = 13). Here, the FAB utilization is very low and remains so with varying configurations. The Ready Deque utilization stays close to 100% with varying configurations. FIg. 11 gives the variation of FAB space and Ready Deque space with changing configurations, for Heat benchmark (Dmax = 12). Here, the FAB utilization is high (close to 100%) and remains so with varying configurations. The Ready Deque utilization also stays close to 100% with varying configurations. This empirically demonstrates that our distributed scheduling algorithm has efficient space utilization as well.
Ready Deque & FAB Space Variation: MD
16
12 10
Ready_Deque_Avg
8
Ready_Deque_Max
6
FAB_Avg FAB_Max
4 2 0
14 12 Ready_Deque_Avg
10
Ready_Deque_Max
8
FAB_Avg
6
FAB_Max
4 2
(4 * 4)
(8 * 2)
(16 * 1)
(Num Places * Num Procs Per Place)
Fig. 10. Space Util - MD
14 12 Ready_Deque_Avg
10
Ready_Deque_Max
8
FAB_Avg
6
FAB_Max
4 2 0
0
(2 * 8)
Number of Stack Frames
16 Number of Stack Frames
Number of Stack Frames
Ready Deque & FAB Space Variation: CG
Ready Deque & FAB Space Variation: Heat
14
(2 * 8)
(4 * 4)
(8 * 2)
(16 * 1)
(Num Places * Num Procs Per Place)
Fig. 11. Space Util - HEAT
(2 * 8)
(4 * 4)
(8 * 2)
(16 * 1)
(Num Places * Num Procs Per Place)
Fig. 12. Space Util - CG
178
A. Narang et al.
6 Conclusions and Future Work We have addressed the challenging problem of affinity driven online distributed scheduling of parallel computations. We have provided theoretical analysis of the time and message complexity bounds of our algorithm. On well known benchmarks our algorithm demonstrates around 16% to 30% performance gain over typical Cilk style scheduling. Detailed experimental analysis shows the scalability of our algorithm along with efficient space utilization. This is the first such work for affinity driven distributed scheduling of parallel computations in a multi-place setup. In future, we plan to look into space-time tradeoffs and markov-chain based modeling of the distributed scheduling algorithm.
References 1. Acar, U.A., Blelloch, G.E., Blumofe, R.D.: The data locality of work stealing. In: SPAA, New York, NY, USA, pp. 1–12 (December 2000) 2. Agarwal, S., Barik, R., Bonachea, D., Sarkar, V., Shyamasundar, R.K., Yellick, K.: Deadlockfree scheduling of x10 computations with bounded resources. In: SPAA, San Diego, CA, USA, pp. 229–240 (December 2007) 3. Agarwal, S., Narang, A., Shyamasundar, R.K.: Affinity driven distributed scheduling algorithms for parallel computations. Tech. Rep. RI09010, IBM India Research Labs, New Delhi (July 2009) 4. Allan, E., Chase, D., Luchangco, V., Maessen, J.-W., Ryu, S., Steele Jr., G.L., TobinHochstadt, S.: The Fortress language specification version 0.618. Tech. rep., Sun Microsystems (April 2005) 5. Arora, N.S., Blumofe, R.D., Plaxton, C.G.: Thread scheduling for multiprogrammed multiprocessors. In: SPAA, Puerto Vallarta, Mexico, pp. 119–129 (1998) 6. Blumofe, R.D., Leiserson, C.E.: Scheduling multithreaded computations by work stealing. J. ACM 46(5), 720–748 (1999) 7. Blumofe, R.D., Lisiecki, P.A.: Adaptive and reliable parallel computing on networks of workstations. In: USENIX Annual Technical Conference. Anaheim, California (1997) 8. ChamberLain, B.L., Callahan, D., Zima, H.P.: Parallel Programmability and the Chapel Language. International Journal of High Performance Computing Applications 21(3), 291–312 (2007) 9. Charles, P., Donawa, C., Ebcioglu, K., Grothoff, C., Kielstra, A., von Praun, C., Saraswat, V., Sarkar, V.: X10: An object-oriented approach to non-uniform cluster computing. In: OOPSLA 2005 Onward! Track (2005) 10. Exascale Study Group, Peter Kogge (Editor and Study Lead), William Harrod (Program Manager): Exascale computing study: Technology challenges in achieving exascale systems. Tech. rep. (September 2008) 11. Yelick, K., et al.: D.B.: Productivity and performance using partitioned global address space languages. In: PASCO 2007: Proceedings of the 2007 International Workshop on Parallel Symbolic Computation, pp. 24–32. ACM, New York (2007)
Temporal Specifications for Services with Unboundedly Many Passive Clients Shamimuddin Sheerazuddin The Institute of Mathematical Sciences C.I.T. Campus, Chennai 600 113, India
[email protected] Abstract. We consider a client-server system in which unbounded, finite but unknown, number of clients request for service from the server. The system is passive as there is no further interaction between sendrequest and receive-response. We give an automata based model for such systems and a temporal logic to frame specifications. We show that the satisfiability and model checking problems for the logic are decidable. Keywords: temporal logic, web services, client-server systems, decidability, model checking.
1
Introduction
In [DSVZ06], the authors consider a Loan Approval Service [TC03], which consists of Web Services, called peers, that interact with each other via asynchronous message exchanges. One of the peers is designated as Loan Officer, the loan disbursal authority. It receives a loan request, say for 5000 from a customer, checks her credit rating from a third party and approves or rejects the request according to some lending policy. The loan approval problem becomes doubly interesting when the disbursal officer is confronted with a number of customers asking for loans of different sizes, say ranging from 5000 to 500,000. In such a scenario, with a bounded pool of money to loan out, it may be possible that a high loan request may be accepted when there is no other pending request, or may be rejected when accompanied with pending requests of lower loan sizes. This is an example of a system composed of unboundedly many agents: how many processes are active at any system state is not known at design time but determined only at run time. Thus, though at any point of time, only finitely many agents may be participating, there is no uniform bound on the number of agents. Design and verification of such systems are becoming increasingly important in distributed computing, especially in the context of Web Services. Since services handling unknown clients need to make decisions based upon request patterns that are not pre-decided, they need to conform to specific service policies that are articulated at design time. Due to concurrency and unbounded state information, the design and implementation of such services becomes complex and hence subject to logical flaws. Thus, there is a need for formal methods in specifying service policies and verifying that systems implement them correctly. M.K. Aguilera et al. (Eds.): ICDCN 2011, LNCS 6522, pp. 179–190, 2011. c Springer-Verlag Berlin Heidelberg 2011
180
S. Sheerazuddin
Formal methods come in many flavours. One special technique is that of model checking[CGP00]. The system to be checked is modelled as a finite state system, and properties to be verified are expressed as constraints on the possible computations of the model. This facilitates algorithmic tools to be employed in verifying that the model is indeed correct with respect to those properties. When we find violations, we re-examine the finite-state abstraction, leading to a finer model, perhaps also refine the specifications and repeat the process. Finding good finite state abstractions forms an important part of the approach. Modelling systems with unboundedly many clients is fraught with difficulties. Since we have no bound on the number of active processes, the state space is infinite. A finite state abstraction may kill the very feature we wish to check, since service policies, we are interested in, involve unbounded numbers of clients. On the other hand, finite presentation of such systems with infinite spaces and/or their computations comes with considerable amount of difficulty. Propositional temporal logics have been extensively used for specifying safety and liveness requirements of reactive systems. Backed by a set of tools with theorem proving and model checking capabilities, temporal logic is a natural candidate for specifying service policies. In the context of Web Services, they have been extended with mechanisms for specifying message exchange between agents. There are several candidate temporal logics for message passing systems, but these work with a priori fixed number of agents, and for any message, the identity of the sender and the receiver are fixed at design time. We need to extend such logics with means for referring to agents in some more abstract manner (than by name). On the other hand, the client-server interaction needs far simpler communication facility than what is typically considered in any peerto-peer communication model. A natural and direct approach to refer to unknown clients is to use logical variables: rather than work with atomic propositions p, we use monadic predicates p(x) to refer to property p being true of client x. We can quantify over such x existentially and universally to specify policies relating to clients. We are thus naturally lead to the realm of Monadic First Order Temporal Logics (M F OT L)[GHR94]. In fact, it is easily seen that M F OT L is expressive enough to frame almost every requirement specification of client-server systems of the kind discussed above. Unfortunately, M F OT L is undecidable [HWZ00], and we need to limit the expressiveness so that we have decidable verification problem. We propose a fragment of M F OT L for which satisfiability and model checking are decidable. Admittedly, this language is weak in expressive power but our claim is that reasoning in such a logic is already sufficient to express a broad range of service policies in systems with unboundedly many clients. Thus, we need to come up with a formal model that finitely presents infinite state spaces and a specification language that involves quantification and temporal modlities, while ensuring that model checking can be done. This is the issue we address in this paper, by simplifying the setting a little. We consider the case where there is only one server dealing with unboundedly many clients that do not
Temporal Specifications for Services with Unboundedly Many Passive Clients
181
communicate with each other and propose a class of automaton model: for passive clientele. The client is passive as it simply sends a request to the server and waits for the service (or an answer that the service cannot be provided). The client has no further interaction with the server. We suggest that it suffices to specify such clients by boolean formulas over unary predicates. State formulas of the server are then monadic first order sentences over such predicates, and server properties are specified by temporal formulas built from such sentences. In the literature [CHK+ 01], these services with passive clientele are called discrete services. We call them Service with Passive Clients (SPS). Before we proceed to technicalities, we wish to emphasize that what is proposed in this paper is in the spirit of framework rather than a definitive temporal logic for services with unbounded clientele. The combination of modalities as well as the structure of models should finally be decided only on the basis of applications. Even though essentially abstract, our paradigm continues the research on Web Service composition [BHL+ 02], [NM02], work on Web Service programming languages [FGK02] and the AZTEC prototype [CHK+ 01]. There have been many automaton based models for Web Services but, as far as we know, none of them incorporate unboundedly many clients. [BFHS03] models Web Services as Mealy machines, [FBS04] models Web Services as Buchi automata and focus on message passing between them. The Roman model [BCG+ 03] focusses on an abstract notion of activities, and in essence models Web Services as finite state machines with transitions labelled by these activities. The Colombo model [BCG+ 03] combines the elements of [FBS04] and [BCG+ 03] alongwith the OWL-S model [NM02] of Web Services and accounts for data in messages too. Decidable fragments of M F OT L are few and far between. As far as we know, monodic fragment [HWZ00], [HWZ01] is the only decidable one found in the literature. The decidability crucially hinges on the fact that there is at most one free variable in the scope of temporal modalities. Later, it was found that the packed monodic fragment with equality is decidable too [Hod02]. In the realm of branching time logics with first-order extensions, it is shown in [HWZ02] that, by restricting applications of first-order quantifiers to state (path-independent) formulas, and applications of temporal operators and path quantifiers to formulas with at most one free variable, we can obtain decidable fragments.
2
The Client Server Model
Fix CN , a countable set of client names. In general, this set would be recursively generated using a naming scheme, for instance using sequence numbers and timestamps generated by processes. We choose to ignore this structure for the sake of technical simplicity. We will use a, b etc. with or without subscripts to denote elements of CN .
182
S. Sheerazuddin
2.1
Passive Clients
Fix Γ0 , a finite service alphabet. We use u, v etc. to denote elements of Γ0 , and they are thought of as types of services provided by a server. An extended alphabet is a set Γ = {requ , ansu | u ∈ Γ0 } ∪ {τ }. These refer to requests for such service and answers to such requests, as well as ”silent” internal action τ . Elements of Γ0 represent logical types of services that the server is willing to provide. This means that when two clients ask for a service of the same type, given by an element of Γ0 it can tell them apart only by their name. We could in fact then insist that server’s behaviour be identical towards both, but we do not make such an assumption, to allow for generality. We define below systems of services that handle passive clientele. Servers are modelled as state transition systems which identify clients only by the type of service they are associated with. Thus, transitions are associated with client types rather than client names. Definition 2.1. A Service for Passive Clients (SPS) is a tupleA = (S, δ, I, F ) where S is a finite set of states, δ ⊆ (S × Γ × S) is a server transition relation, I ⊆ S is the set of initial states and F the set of final states of A. Without loss of generality we assume that in every SPS, the transition relation δ is such that for every s ∈ S, there exists r ∈ Γ such that for some s ∈ S, (s, r, s ) ∈ δ. The use of silent action τ makes this an easy assumption. Note that an SPS is a finite state description. A transition of the form (s, requ , s ) refers implicitly to a new client of type u rather than to any specific client name. The meaning of this is provided in the run generation mechanism described below. A configuration of an SPS A is a triple (s, C, χ) where s ∈ S, C is a finite subset of CN and χ : C → Γ0 . Thus a configuration specifies the control state of the server, as well as the finite set of active clients at that configuration and their types. We use the convention that when C = ∅, the graph of χ is empty set as well. Let ΩA denote the set of all configuration of A; note that it is this infinite configuration space that is navigated by behaviours of A. We can extend r the transition relation δ to configuration =⇒ ⊆ (ΩA × Γ × ΩA ) as follows. r (s, C, χ)=⇒(s , C , χ ) iff (s, r, s ) ∈ δ and the following conditions hold: – when r = τ , C = C and χ = χ ; – when r = requ , C = C ∪ {a}, χ (a) = u and χ C = χ, where a is the least element of CN − C; – when r = ansu , X = {a ∈ C | χ(a) = u} = ∅, C = C − {a} where a is the least in the enumeration of X, and χ = χC . A configuration (s, C, χ) is said to be initial if s ∈ I and C = ∅. A run of an SPS A is an infinite sequence of configurations ρ = c0 r1 c1 · · · rn cn · · · , where c0 r is initial, and for all j > 0, cj−1 =⇒cj . Let RA denote the set of runs of A. Note that runs have considerable structure. For instance, A have an infinite path generated by a self-loop of the form (s, reqx , s ) in δ which corresponds to an infinite sequence of service requests of a particular type. Thus, these systems
Temporal Specifications for Services with Unboundedly Many Passive Clients
183
have interesting reachability properties. But, as we shall see, our main use of these systems are as models of a temporal logic, and since the logic is rather weak, information present in the runs will be under-utilized. Language of an SPS: Given a run ρ = c0 r1 c1 r2 · · · rn cn · · · we define inf (ρ) as those states which occur infinitely many times on the run. That is, inf (ρ) = {q ∈ S | ∃∞ i ∈ ω, ci = (q, ∅, ∅)}. A run ρ is good if inf (ρ) ∩ F = ∅. The language of A, Lang(A) ⊆ Γ ω is then defined as follows. Lang(A) = {r1 r2 · · · rn · · · | there is a good run ρ = c0 r1 c1 r2 · · · rn cn · · · } Once, we have fixed goodness properties for runs RA of a given system A, it is trivially seen that SPS are closed under union and intersection. Also, it can be observed that once an external bound on CN is assumed, the size of configuration set ΩA becomes bounded and all the decision algorithms for A become decidable. q0 ansh ansl reql
reqh
ansl
q3 q1
q5 reql
reql
ansl q2
ansl
q6
reqh
ansh
q4
Fig. 1. An SPS for Loan Approval Web Service System: A1
3
Loan Approval Service
We give an example SPS for automated Loan Approval Web Service System which is a kind of discrete service. In this Composite system, there is a designated Web Service acting as Loan Officer which admits loan requests of various sizes, say h depicting high (large) amount and l depicting low (small) amounts. Depending on the number of loan requests (high and low) and according to an apriori fixed loan disbursal policy, the loan officer accepts or rejects the pending requests. The behaviour of the Loan Officer is captured as SPS as follows. Let Γ0 = {h, l}, where h denotes high end loan and l denotes low size loan, and the corresponding alphabet Γ = {reqh , reql , ansh , ansl }, the Loan Approval System can be modelled as an SPS A1 = (q1 , δ1 , I1 ) as shown in Figure 1. Here, we briefly describe the working of the automaton. A1 , starting from q0 , keeps
184
S. Sheerazuddin
track of at most two low-amount requests. q1 is the state with one pending request whereas q4 is the state with two pending requests. Whenever the system gets a high amount request, it seeks to dispose it at the earliest and tries avoiding to take up a low requst as long as a high one is pending with it. But, it may not succeed all the time, i.e, when the automaton reaches q6 , it is possible that it can loop back to initial state q0 , with one or more high pending requests, and then take up low requests. It is not difficult to see that there are runs of A1 satisfy the following property, ψ1 , and there are those which don’t. ψ1 is expressed in english as ”whenever there is a request of type low there is an answer of type low in the next instant”. Now, suppose there is another property ψ2 described as ”there is no request of type low taken up as long as there is a high request pending”. If we want to avoid ψ2 in the Loan Approval System then we need to modify A1 and define A2 = (S2 , δ2 , I2 ) as in Figure 2. q0 ansh ansl
reql
reqh
ansl
q3 q1
q5 reql
reql
ansl q2
q6 reqh
ansl q4
ansh q7
Fig. 2. A modified SPS for Loan Approval Web Service System: A2
Furthermore, if we want to make sure that there are no two low requests pending at any time, i.e., our model satisfies ψ1 , we modify A2 and describe A3 in Figure 3 as follows. We shall see later that these properties can be described easily is a decidable logic which we call LSP S . Notice, that, in SPS the customer (user or client) simply sends a request of some particular type and waits for an answer. Things become interesting when the customer seeks to interact with the loan officer (server) between the sendrequest and receive-response. In this case, the complex patterns of interaction between the client and the server have to be captured by a stronger automaton model. We shall tackle this issue in a separate paper.
Temporal Specifications for Services with Unboundedly Many Passive Clients
185
q0 ansh ansl
reql
reqh
q3 q1 reql
ansl q2
q6 reqh
ansh q7
Fig. 3. Another modified SPS for Loan Approval Web Service System: A3
4
LSP S
In this section we describe a logical language to specify and verify SPS-like systems. Such a language would have two mutually exclusive dimensions. One, captured by M F O fragment, talking about the plurality of clients asking for a variety of services. The other, captured by LT L fragment, which talks about the temporal variations of services being rendered. Furthermore, the M F O fragment has to be multi-sorted to cover the multilplicity of service types. Keeping these issues in mind, we frame a logical language, which we call LSP S . LSP S is a cross between LT L and multi-sorted M F O. In the case of LTL, atomic formulas are propositional constants which have no further structure. In LSP S , there are two kind of atomic formulas, basic server properties from Ps , and M F Osentences over client properties Pc . Consequently, these formulas are interpreted over sequences of MFO-structures juxtaposed with LTL-models. 4.1
Syntax and Semantics
At the outset, we fix Γ0 , a finite set of client types. The set of Client formulas are defined over a countable set of atomic client predicates Pc which are composed of disjoint predicates of type u Pcu , for each u ∈ Γ0 . Also, let V ar be a countable supply of variable symbols and CN be a countable set of client names. CN is divided into disjoint sets of types from Γ0 via λ : CN → Γ0 . Similarly, V ar is divided using Π : V ar → Γ0 . We use x, y to denote elements in V ar and a, b for elements in CN . Formally, the set of client formulas Φ is: α, β ∈ Φ ::= p(x : u), p ∈ Pcu | x = y, x, y ∈ V aru | ¬α | α ∨ β | (∃x : u)α.
186
S. Sheerazuddin
Let SΦ be the set of all sentences in Φ, then, the Server formulas are defined as follows: ψ ∈ Ψ ::= q ∈ Ps | ϕ ∈ SΦ | ¬ψ | ψ1 ∨ ψ2 | ψ | ψ Uψ . This logic is interpreted over sequences of M F O models composed with LT L models. Formally, a model is a triple M = (ν, D, I) where – ν = ν0 ν1 · · · , where ∀i ∈ ω, νi ⊂f in Ps , gives the local properties of the server at instance i, – D = D0 D1 D2 · · · , where ∀i ∈ ω, Di = (Diu )u∈Γ0 where Diu ⊂f in CNu , gives the identity of the clients of each type being served at instance iuand – I = I0 I1 I2 · · · , where ∀i ∈ ω, Ii = (Iiu )u∈Γ0 and Iiu : Diu → 2Pc gives the properties satisfied by each live agent at ith instance, in other words, the corresponding states of live agents. Alternatively, Iiu can be given as Iiu : Diu × Pcu → {, ⊥}, an equivalent form. Satisfiability Relations |=, |=Φ Let M = (ν, D, I) be a valid model and π : V ar → CN be a partial map consistent with respect to λ and Π. Then, the relations |= and |=Φ can be defined, via induction over the structure of ψ and α, respectively, as follows. – – – – – –
M, i |= q iff q ∈ νi . M, i |= ϕ iff M, ∅, i |=Φ ϕ. M, i |= ¬ψ iff M, i |= ψ. M, i |= ψ ∨ ψ iff M, i |= ψ or M, i |= ψ . M, i |= ψ iff M, i + 1 |= ψ. M, i |= ψ Uψ iff ∃j ≥ i, M, j |= ψ and ∀i ≤ i < j, M, i |= ψ.
– – – – –
M, π, i |=Φ M, π, i |=Φ M, π, i |=Φ M, π, i |=Φ M, π, i |=Φ
5
p(x : u) iff π(x) ∈ Diu and Ii (π(x), p) = . x = y iff π(x) = π(y). ¬α iff M, π, i |=Φ α. α ∨ β iff M, π, i |=Φ α or M, π, i |=Φ β. (∃x : u)α iff ∃a ∈ Diu and M, π[x → a], i |=Φ α.
Specification Examples Using LSP S
As claimed in the previous section, we would like to show that our logic LSP S adequately captures many of the facets of SPS-like systems. We consider the Loan Approval Web Service, which has already been explained with a number of examples, and frame a number of specifications to demonstrate the use of LSP S . For Loan Approval System we shall have client types as Γ0 = {h, l} and client properties as Pc = {reqh , reql , ansh , ansl } where h means a loan request of type high and l means a loan request of type low. For this system, we can write a few simple specifications viz. initially there are no pending requests, whenever there is a request of type low there is an approval for type low in the next instant, there is no request of type low taken up as long as there is a high request pending and there is at least one request of each type pending all the time. In LSP S these can be framed as follows:
Temporal Specifications for Services with Unboundedly Many Passive Clients
– – – –
ψ0 ψ1 ψ2 ψ3
187
= ¬ (∃x : h)reqh (x) ∨ (∃x : l)reql (x) , = 2(∃x : l)reql (x) ⊃ (∃y : l)ansl (y) , = 2(∃x : h)reqh (x) ⊃ ¬(∃y : l)reql (y) , = 2 (∃x : l)reql (x) ∨ (∃y : h)reqh (y) .
Note, that, none of these formulas make use of equality (=) predicate. Using =, we can make stronger statements like at all times there is exactly one pending request of type high and at all times there is at most one pending request of type high. These can be expressed in LSP S as follows: – ψ4 = 2 (∃x : h)reqh (x) ∧ (∀y: h) reqh (y) ⊃ x = y , . – ψ5 = 2 ¬(∃x : h)reqh (x) ∨ (∃x : h)reqh (x)∧(∀y : h) reqh (y) ⊃ x = y In the same vein, using =, we can count the requests of each type and say more interesting things. For example, if ϕ2h asserted at a point means there are at most 2 requests of type h pending then we can frame the following formula: ψ5 = 2(ϕ2h ⊃ (ϕ2h ⊃ 2ϕ2h )) which means, if there are at most two pending requests of type high at successive instants then thereafter the number stabilizes. Unfortunately, owing to a lack of provision for free variables in the scope of temporal modalities, we can’t write which seek to match requests specifications and approvals. Here is a sample: 2 (∀x) requ (x) ⊃ 3ansu (x) which means, if there is a request of type u at some point of time then the same is approved some time in future. If we allow indiscriminate applications of quantifications over temporal modalities, it will lead to undecidable logics. As we are aware, even two free variables in the scope of temporal modalities allow us to encode undecidable tiling problems. The challenge is to come up with appropriate constraints on specifications which allow us to express interesting properties as well as remain decidable to verify.
6
Satisfiability and Model Checking for LSP S
We settle the satisfiability issue for LSP S using the automata theoretic techniques, first proposed by Vardi and Wolper [VW86]. Let ψ0 be an LSP S -formula. We compute a formula automaton Aψ0 , such that the following holds. Lemma 6.1. ψ0 is satisfiable iff Lang(Aψ0 ) is non-empty. From the given LSP S formula ψ0 , we can obtain Pψ0 the set of all M F O predicates occuring in ψ0 and V arψ0 the set of all variable symbols occuring in ψ0 . Using these two sets we can generate all possible M F O models at the atom level. Now, these complex atoms, which incorporate M F O valuations as well as LT L valuations, are used to construct a Buchi automaton in the standard manner to generate all possible models of ψ0 . Then, the following is immediate. Lemma 6.2. Given a LSP S -formula ψ0 with |ψ0 | = n, the satisfiability of ψ0 k can be checked in time 2O(n·r·2 ) , where r is the number of variable symbols occuring in ψ0 and k is the number predicate symbols occuring in ψ0 .
188
S. Sheerazuddin
In order to specify SPS, in which clients do nothing but send a request of type u and wait for an answer, the most we can say about a client x is whether a request from x is pending or not. So the set of client properties are Pc = {requ , ansu | u ∈ Γ0 }. When requ (x) holds at some instant i, it means there is a pending request of type u from x at i. When ansu (x) holds at i, it means either there was no request from x or the request from x has already been def answered. That is, ansu (x) = ¬requ (x). For the above sublogic, that is LSP S with Pc = {requ | u ∈ Γ0 }, we assert the following theorem, which can be inferred directly from lemma 6.2. Theorem 6.3. Let ψ0 be an LSP S formula with |Vψ0 | = r and |Γ0 | = k and k |ψ0 | = n. Then, satisfiability of ψ0 can be checked in time O(2n·r·2 ). 6.1
Model Checking Problem for LSP S
The goal of this section is to formulate the model checking problem for LSP S and show that it is decidable. We again solve this problem using the so called automata-theoretic approach. In such a setting, the client-server system is modelled as an SP S, A, and the specification is given by a formula ψ0 in LSP S . The model checking problem is to check if the system A satisfies the specification ψ0 , denoted by A |= ψ0 . In order to do this we bound the SPS using ψ0 and define an interpreted version. Bounded Interpreted SPS: Let A = (S, δ, I, F ) be an SPS and ψ0 be a specification in LSP S . From ψ0 we get Vu (ψ0 ), for each u ∈ Γ0 . Now, let n = (Σu ru ) · k where |Γ0 | = k and |Vu (ψ0 )| = ru . n is the bound for SPS M . Now, for each u ∈ Γ0 CNu = {(i, u) | 1 ≤ i ≤ ru , u ∈ Γ0 } and CN = u CNu . For each u, define CNu = {{(j, u) | 1 ≤ j ≤ i} |1 ≤ i ≤ ru } ∪ {∅}. Thereafter, define CN = Πu∈Γ0 CNu . Now, we have CN = C∈CN C. Now, we are in a position to define an interepreted form of bounded SPS. The interpreted SPS is a tuple A = (Ω, ⇒, I, F , V al), where Ω = S × CN , I = {(s, C) | s ∈ I, C = ∅}, F = {(s, C) | s ∈ F, C = ∅}, V al : Ω → (2Ps × CN ) and ⇒⊆ Ω × Γ × Ω is given r as follows: (s, C)=⇒(s , C ) iff (s, r, s ) ∈ δ and the following conditions hold: – when r = τ , C = C , – when r = requ , CN − C = ∅, if a ∈ CNu − C is the least in the enumeration then C = C ∪ {a}, – when r = ansu , X = C ∩ CNu = ∅, C = C − {a} where a ∈ X is the least in the enumeration. Note, that, |CN | = Πu∈Γ0 (ru ) < rk . Now, if, |S| = l, then |Ω| = O(l·rk ). Now, we can define the language of interpreted SPS A as Lang(A) = {V al(c0 )V al(c1 ) · · · | c0 r1 c1 r2 c2 · · · is a good run in A}. We say that A satisfies ψ0 if Lang(A) ⊆ Aψ0 , where Aψ0 is the formula automaton of ψ0 . This holds when Lang(A) ∩ Lang(A¬ψ0 ) = ∅. Therefore, the
Temporal Specifications for Services with Unboundedly Many Passive Clients
189
complexity to check emptiness of the product automaton, is linear in the product of the sizes of A and Aψ0 . k
Theorem 6.4. A |= ψ0 can be checked in time O(l · rk · 2n·r·2 ).
7
Discussion
To conclude, we gave an automaton model for unbounded agent server-client systems for discrete services [CHK+ 01] and a temporal logic to specify such services and presented an automata based decidability argument for satisfiability and model checking of the logic. We shall extend the SPS to model session-oriented client server systems in a subsequent paper. We shall also take up the task of framing appropriate temporal logics to specify such services. This forces us into the realm of M F OT L with free variables in the scope of temporal modalities [Hod02]. We know that too many of those are fatal [HWZ00]. The challenge is to define suitable fragments of M F OT L, which are sufficiently expressive as well as decidable. As this paper lacks an automata theory of SPS, we need to explore whether infinite-state reachability techniques such that [BEM97] could be used. An extension of the work in this paper would be to define models and logics for systems with multiple servers, say n, together, serving unbounded clients. An orthogonal exercise could be development of tools to efficiently implement the model checking problem for the system SPS against LSP S specifications, a la MONA [HJJ+ 95][KMS00] or SPIN [Hol97],[RH04].
References [BCG+ 03]
[BEM97]
[BFHS03] [BHL+ 02]
[CGP00]
Berardi, D., Calvanese, D., De Giacomo, G., Lenzerini, M., Mecella, M.: Automatic composition of E-services that export their behavior. In: Orlowska, M.E., Weerawarana, S., Papazoglou, M.P., Yang, J. (eds.) ICSOC 2003. LNCS, vol. 2910, pp. 43–58. Springer, Heidelberg (2003) Bouajjani, A., Esparza, J., Maler, O.: Reachability analysis of pushdown automata: Application to model-checking. In: Mazurkiewicz, A., Winkowski, J. (eds.) CONCUR 1997. LNCS, vol. 1243, pp. 135–150. Springer, Heidelberg (1997) Bultan, T., Fu, X., Hull, R., Su, J.: Conversation specification: a new approach to design and analysis of e-service composition. In: WWW, pp. 403–410 (2003) Burstein, M.H., Hobbs, J.R., Lassila, O., Martin, D.L., McDermott, D.V., McIlraith, S.A., Narayanan, S., Paolucci, M., Payne, T.R., Sycara, K.P.: Daml-s: Web service description for the semantic web. In: Horrocks, I., Hendler, J. (eds.) ISWC 2002. LNCS, vol. 2342, pp. 348–363. Springer, Heidelberg (2002) Clarke, E.M., Grumberg, O., Peled, D.: Model Checking. MIT Press, Cambridge (2000)
190
S. Sheerazuddin
[CHK+ 01]
[DSVZ06] [FBS04]
[FGK02]
[GHR94] [HJJ+ 95]
[Hod02] [Hol97] [HWZ00]
[HWZ01]
[HWZ02]
[KMS00]
[NM02] [RH04]
[TC03]
[VW86]
Christophides, V., Hull, R., Karvounarakis, G., Kumar, A., Tong, G., Xiong, M.: Beyond discrete E-services: Composing session-oriented services in telecommunications. In: Casati, F., Georgakopoulos, D., Shan, M.-C. (eds.) TES 2001. LNCS, vol. 2193, pp. 58–73. Springer, Heidelberg (2001) Deutsch, A., Sui, L., Vianu, V., Zhou, D.: Verification of communicating data-driven web services. In: PODS, pp. 90–99 (2006) Fu, X., Bultan, T., Su, J.: Conversation protocols: a formalism for specification and verification of reactive electronic services. Theor. Comput. Sci. 328(1-2), 19–37 (2004) Florescu, D., Gr¨ unhagen, A., Kossmann, D.: Xl: an xml programming language for web service specification and composition. In: WWW, pp. 65–76 (2002) Gabbay, D.M., Hodkinson, I.M., Reynolds, M.A.: Temporal Logic. Part 1. Clarendon Press (1994) Henriksen, J.G., Jensen, J.L., Jørgensen, M.E., Klarlund, N., Paige, R., Rauhe, T., Sandholm, A.: Mona: Monadic second-order logic in practice. In: Brinksma, E., Steffen, B., Cleaveland, W.R., Larsen, K.G., Margaria, T. (eds.) TACAS 1995. LNCS, vol. 1019, pp. 89–110. Springer, Heidelberg (1995) Hodkinson, I.M.: Monodic packed fragment with equality is decidable. Studia Logica 72(2), 185–197 (2002) Holzmann, G.J.: The model checker spin. IEEE Trans. Software Eng. 23(5), 279–295 (1997) Hodkinson, I.M., Wolter, F., Zakharyaschev, M.: Decidable fragment of first-order temporal logics. Ann. Pure Appl. Logic 106(1-3), 85–134 (2000) Hodkinson, I., Wolter, F., Zakharyaschev, M.: Monodic fragments of first-order temporal logics: 2000-2001 A.D. In: Nieuwenhuis, R., Voronkov, A. (eds.) LPAR 2001. LNCS (LNAI), vol. 2250, pp. 1–23. Springer, Heidelberg (2001) Hodkinson, I.M., Wolter, F., Zakharyaschev, M.: Decidable and undecidable fragments of first-order branching temporal logics. In: LICS, pp. 393–402 (2002) Klarlund, N., Møller, A., Schwartzbach, M.I.: Mona implementation secrets. In: Yu, S., P˘ aun, A. (eds.) CIAA 2000. LNCS, vol. 2088, pp. 182– 194. Springer, Heidelberg (2001) Narayanan, S., McIlraith, S.A.: Simulation, verification and automated composition of web services. In: WWW, pp. 77–88 (2002) Ruys, T.C., Holzmann, G.J.: Advanced SPIN tutorial. In: Graf, S., Mounier, L. (eds.) SPIN 2004. LNCS, vol. 2989, pp. 304–305. Springer, Heidelberg (2004) IBM Web Services Business Process Execution Language (WSBPEL) TC. Web services business process execution language version 1.1. Technical report (2003), http://www.ibm.com/developerworks/library/ws-bpel Vardi, M.Y., Wolper, P.: An automata-theoretic approach to automatic program verification (preliminary report). In: LICS, pp. 332–344 (1986)
Relating L-Resilience and Wait-Freedom via Hitting Sets Eli Gafni1 and Petr Kuznetsov2 1 2
Computer Science Department, UCLA Deutsche Telekom Laboratories/TU Berlin
Abstract. The condition of t-resilience stipulates that an n-process program is only obliged to make progress when at least n − t processes are correct. Put another way, the live sets, the collection of process sets such that progress is required if all the processes in one of these sets are correct, are all sets with at least n − t processes. We show that the ability of arbitrary collection of live sets L to solve distributed tasks is tightly related to the minimum hitting set of L, a minimum cardinality subset of processes that has a non-empty intersection with every live set. Thus, finding the computing power of L is N P -complete. For the special case of colorless tasks that allow participating processes to adopt input or output values of each other, we use a simple simulation to show that a task can be solved L-resiliently if and only if it can be solved (h − 1)-resiliently, where h is the size of the minimum hitting set of L. For general tasks, we characterize L-resilient solvability of tasks with respect to a limited notion of weak solvability: in every execution where all processes in some set in L are correct, outputs must be produced for every process in some (possibly different) participating set in L. Given a task T , we construct another task TL such that T is solvable weakly L-resiliently if and only if TL is solvable weakly wait-free.
1
Introduction
One of the most intriguing questions in distributed computing is how to distinguish solvable from the unsolvable. Consider, for instance, the question of wait-free solvability of distributed tasks. Wait-freedom does not impose any restrictions on the scope of considered executions, i.e., a wait-free solution to a task requires every correct processes to output in every execution. However, most interesting distributed tasks cannot be solved in a wait-free manner [6,19]. Therefore, much research is devoted to understanding how the power of solving a task increases as the scope of considered executions decreases. For example, t-resilience considers only executions where at least n − t processes are correct (take infinitely many steps), where n is the number of processes in the system. This provides for solving a larger set of tasks than wait-freedom, since in executions in which less than n − t processes are correct, no correct process is required to output. M.K. Aguilera et al. (Eds.): ICDCN 2011, LNCS 6522, pp. 191–202, 2011. c Springer-Verlag Berlin Heidelberg 2011
192
E. Gafni and P. Kuznetsov
What tasks are solvable t-resiliently? It is known that this question is undecidable even with respect to wait-free solvability, let alone t-resilient [9,14]. But is the question about t-resilient solvability in any sense different than the question about wait-free solvability? If we agree that we “understand” wait-freedom [16], do we understand t-resilience to a lesser degree? The answer should be a resounding no if, in the sense of solving tasks, the models can be reduced to each other. That is, if for every task T we can find a task Tt which is solvable wait-free if and only if T is solvable t-resiliently. Indeed, [2,4,8] established that t-resilience can be reduced to wait-freedom. Consequently, the two models are unified with respect to task solvability. In this paper, we consider a generalization of t-resilience, called L-resilience. Here L stands for a collection of subsets of processes. A set in L is referred to as a live set. In the model of L-resilience, a correct process is only obliged to produce outputs if all the processes in some live set are correct. Therefore, the notion of L-resilience represents a restricted class of adversaries introduced by Delporte et al. [5], described as collections of exact correct sets. L-resilience describes adversaries that are closed under the superset operation: if a correct set is in an adversary, then every superset of it is also in the adversary. We show that the key to understanding L-resilience is the notion of a minimum hitting set of L (called simply hitting set in the rest of the paper). Given a set system (Π, L) where Π is a set of processes and L is a set of subsets of Π, H is a hitting set of (Π, L) if it is a minimum cardinality subset of Π that meets every set in L. Intuitively, in every L-resilient execution, i.e., in every execution in which at least one set in L is correct, not all processes in a hitting set of L can fail. Thus, under L-resilience, we can solve the k-set agreement task among the processes in Π where k is the hitting set size of (Π, L). In k-set agreement, the processes start with private inputs and the set of outputs is a subset of inputs of size at most k. Indeed, fix a hitting set H of (Π, L) of size k. Every process in H simply posts its input value in the shared memory, and every other process returns the first value it witnesses to be posted by a process in H. Moreover, using a simple simulation based on [2,4], we derive that L does not allow solving (k − 1)-set agreement or any other colorless task that cannot be solved (k − 1)resiliently. Thus, we can decompose superset-closed adversaries into equivalence classes, one for each hitting set size, where each class agrees on the set of colorless tasks it allows for solving. Informally, colorless tasks allow a process to adopt an input or output value of any other participating process. This restriction gives rise to simulation techniques in which dedicated simulators independently “install” inputs for other, possibly non-participating processes, and then take steps on their behalf so that the resulting outputs are still correct and can be adopted by any participant [2,4]. The ability to do this is a strong simplifying assumption when solvability is analyzed. For the case of general tasks, where inputs cannot be installed independently, the situation is less trivial. We address general tasks by considering a restricted notion of weak solvability, that requires every execution where all the processes in
Relating L-Resilience and Wait-Freedom via Hitting Sets
193
some set in L are correct to produce outputs for every process in some (possibly different) participating set in L. Note that for colorless tasks, weak solvability is equivalent to regular solvability that requires every correct process to output. We relate between wait-free solvability and L-resilient solvability. Given a task T and a collection of live sets L, we define a task TL such that T is weakly solvable L-resiliently if and only if TL is weakly solvable wait-free. Therefore, we characterize L-resilient weak solvability, as wait-free solvability has already been characterized in [16]. Not surprisingly, the notion of a hitting set is crucial in determining TL . The simulations that relate T and TL are interesting in their own right. We describe an agreement protocol, called Resolver Agreement Protocol (or RAP), by which an agreement is immediately achieved if all processes propose the same value, and otherwise it is achieved if eventually a single correct process considers itself a dedicated resolver. This agreement protocol allows for a novel execution model of wait-free read-write protocols. The model guarantees that an arbitrary number of simulators starting with j distinct initial views should appear as j independent simulators and thus a (j − 1)-resilient execution can be simulated. The rest of the paper is organized as follows. Section 2 briefly describes our system model. Section 3 presents a simple categorization of colorless tasks. Section 4 formally defines the wait-free counterpart TL to every task T . Section 5 describes RAP, the technical core of our main result. Sections 6 and 7 present two directions of our equivalence result: from wait-freedom to L-resilience and back. Section 8 overviews the related work, and Section 9 concludes the paper by discussing implications of our results and open questions. Most proofs are delegated to the technical report [10].
2
Model
We adopt the conventional shared memory model [12], and only describe necessary details. Processes and objects. We consider a distributed system composed of a set Π of n processes {p1 , . . . , pn } (n ≥ 2). Processes communicate by applying atomic operations on a collection of shared objects. In this paper, we assume that the shared objects are registers that export only atomic read-write operations. The shared memory can be accessed using atomic snapshot operations [1]. An execution is a pair (I, σ) where I is an initial state and σ is a sequence of process ids. A process that takes at least one step in an execution is called participating. A process that takes infinitely many steps in an execution is said to be correct, otherwise, the process is faulty. Distributed tasks. A task is defined through a set I of input n-vectors (one input value for each process, where the value is ⊥ for a non-participating process), a set O of output n-vectors (one output value for each process, ⊥ for non-terminated processes) and a total relation Δ that associates each input vector with a set of possible output vectors. A protocol wait-free solves a task T
194
E. Gafni and P. Kuznetsov
if in every execution, every correct process eventually outputs, and all outputs respect the specification of T . Live sets. The correct set of an execution e, denoted correct (e) is the set of processes that appear infinitely often in e. For a given collection of live sets L, we say that an execution e is L-resilient if for some L ∈ L, L ⊆ correct (e). We consider protocols which allow each process to produce output values for every other participating process in the system by posting the values in the shared memory. We say that a process terminates when its output value is posted (possibly by a different process). Hitting sets. Given a set system (Π, L) where L is a set of subsets of Π, a set H ⊆ Π is a hitting set of (Π, L) if it is a minimum cardinality subset of Π that meets every set in L. We denote the set of hitting sets of (Π, L) by HS(Π, L), and the size of a hitting set of (Π, L) by h(Π, L). By (Π , L), Π ⊆ Π we denote the set system that consists of the elements S ∈ L, such that S ⊆ Π . The BG-simulation technique. In a colorless task (also called convergence tasks [4]) processes are free to use each others’ input and output values, so the task can be defined in terms of input and output sets instead of vectors. BG-simulation is a technique by which k + 1 processes q1 , . . ., qk+1 , called simulators, can wait-free simulate a k-resilient execution of any asynchronous n-process protocol [2,4] solving a colorless task. The simulation guarantees that each simulated step of every process pj is either eventually agreed on by all simulators, or the step is blocked forever and one less simulator participates further in the simulation. Thus, as long there is a live simulator, at least n − k simulated processes accept infinitely many simulated steps. The technique has been later extended to tasks beyond colorless [8]. Weak L-resilience. An execution is L-resilient if some set in L contains only correct processes. We say that a protocol solves a task T weakly L-resiliently if in every L-resilient execution, every process in some participating set L ∈ L eventually terminates, and all posted outputs respect the specification of T . In the wait-free case, when L consists of all n singletons, weak L-resilient solvability stipulates that at least one participating process must be given an output value in every execution. Weak solvability is sufficient to (strongly) solve every colorless task. For general tasks, however, weak solvability does not automatically implies strong solvability, since it only allows processes to adopt the output value of any terminated process, and does not impose any conditions on the inputs.
3
Colorless Tasks
Theorem 1. A colorless task T is L-resiliently solvable if and only if T is (h(Π, L) − 1)-resiliently solvable Theorem 1 implies that L-resilient adversaries can be categorized into n equivalence classes, class h corresponding to hitting sets of size h. Note that two
Relating L-Resilience and Wait-Freedom via Hitting Sets
195
adversaries that belong to the same class h agree on the set of colorless tasks they are able to solve, and the set includes h-set agreement.
4
Relating L-Resilience and Wait-Freedom: Definitions
Consider a set system (Π, L) and a task T = (I, O, Δ), where I is a set of input vectors, O is a set of output vectors, and Δ is a total binary relation between them. In this section, we define the “wait-free” task TL = (I , O , Δ ) that characterizes L-resilient solvability of T . The task TL is also defined for n processes. We call the processes solving TL simulators and denote them by s1 , . . . , sn . Let X and X be two n-vectors, and Z1 , . . . , Zn be subsets of Π. We say that X is an image of X with respect to Z1 , . . . , Zn if ∀i, such that X [i] = ⊥, we have X [i] = {(j, X[j])}j∈Zi . Now TL = (I , O , Δ ) guarantees that for all (I , O ) ∈ Δ , there exist (I, O) ∈ Δ such that: (1) ∃S1 , . . . , Sn ⊆ Π, each containing a set in L: (1a) I is an image of I with respect to S1 , . . . , Sn . (1b) |{I [i]}i − {⊥}| = m ⇒ h(∪i,I [i]=⊥ Si , L) ≥ m. In other words, every process participating in TL obtains, as an input, a set of inputs of T for some live set, and all these inputs are consistent with some input vector I of T . Also, if the number of distinct non-⊥ inputs to TL is m, then the hitting set size of the set of processes that are given inputs of T is at least m. (2) ∃U1 , . . . , Un , each containing a set in L: O is an image of O with respect to U1 , . . . , U n . In other words, the outputs of TL produced for input vector I should be consistent with O ∈ O such that (I, O) ∈ Δ. Intuitively, every group of simulators that share the same input value will act as a single process. According to the assumptions on the inputs to TL , the existence of m distinct inputs implies a hitting set of size at least m. The asynchrony among the m groups will be manifested as at most m − 1 failures. The failures of at most m − 1 processes cannot prevent all live sets from terminating, as otherwise the hitting set in (1b) is of size at most m − 1.
5
Resolver Agreement Protocol
We describe the principal building block of our constructions: the resolver agreement protocol (RAP). RAP is similar to consensus, though it is neither always safe nor always live. To improve liveness, some process may at some point become a resolver, i.e., take the responsibility of making sure that every correct process outputs. Moreover, if there is at most one resolver, then all outputs are the same.
196
E. Gafni and P. Kuznetsov
Shared variables: D, initially ⊥ Local variables: resolver , initially false propose(v) 1 (flag, est ) := CA.propose(v) 2 if flag = commit then 3 D := est ; return(est ) 4 repeat 5 if resolver then D := est 6 until D = ⊥ 7 return(D) resolve() 8 resolver := true Fig. 1. Resolver agreement protocol: code for each process
Formally, the protocol accepts values in some set V as inputs and exports operations propose(v), v ∈ V , and resolve() that, once called by a process, indicates that the process becomes a resolver for RAP. The propose operation returns some value in V , and the following guarantees are provided: (i) Every returned value is a proposed value; (ii) If all processes start with the same input value or some process returns, then every correct process returns; (iii) If a correct process becomes a resolver, then every correct process returns; (iv) If at most one process becomes a resolver, then at most one value is returned. A protocol that solves RAP is presented in Figure 1. The protocol uses the commit-adopt abstraction (CA) [7] exporting one operation propose(v) that returns (commit, v ) or (adopt , v ), for v, v ∈ V , and guarantees that (a) every returned value is a proposed value, (b) if only one value is proposed then this value must be committed, (c) if a process commits a value v, then every process that returns adopts v or commits v, and (d) every correct process returns. The commit-adopt abstraction can be implemented wait-free [7]. In the protocol, a process that is not a resolver takes a finite number of steps and then either returns with a value, or waits on one posted in register D by another process or by a resolver. A process that waits for an output (lines 4-6) considers the agreement protocol stuck. An agreement protocol for which a value was posted in D is called resolved. Lemma 1. The algorithm in Figure 1 implements RAP.
6
From Wait-Freedom to L-Resilience
Suppose that TL is weakly wait-free solvable and let AL be the corresponding wait-free protocol. We show that weak wait-free solvability of TL implies weak L-resilient solvability of T by presenting an algorithm A that uses AL to solve T in every L-resilient execution.
Relating L-Resilience and Wait-Freedom via Hitting Sets
197
Shared variables: Rj , j = 1, . . . , n, initially ⊥ Local variables: Sj , j = 1, . . . , h(Π, L), initially ∅ j , j = 1, . . . , h(Π, L), initially 0 9 Ri := input value of T 10 wait until snapshot (R1 , . . . , Rn ) contains inputs for some set in L 11 while true do 12 S := {pi ∈ P, Ri = ⊥} {the current participating set} 13 if pi ∈ H S then {H S is deterministically chosen in HS(S, L)} 14 m := the index of pi in H S 15 RAP mm .resolve () 16 for each j = 1, . . . , |H S | do 17 if j = 0 then 18 Sj := S 19 take one more step of RAP jj .propose (Sj ) j 20 if RAP j .propose (Sj ) returns v then 21 (flag, Sj ) := CAjj .propose (v) 22 if (flag = commit) then 23 return ({(s, Rs )}ps ∈Sj ) {return the set of inputs of processes in Sj } 24 j := j + 1 Fig. 2. The doorway protocol: the code for each process pi
First we describe the doorway protocol (DW), the only L-dependent part of our transformation. The responsibility of DW is to collect at each process a subset of the inputs of T so that all the collected subsets constitute a legitimate input vector for task TL (property (1) in Section 4). The doorway protocol does not require the knowledge of T or TL and depends only on L. In contrast, the second part of the transformation described in Section 6.2 does not depend on L and is implemented by simply invoking the wait-free task TL with the inputs provided by DW. 6.1
The Doorway Protocol
Formally, a DW protocol ensures that in every L-resilient execution with an input vector I ∈ I, every correct participant eventually obtains a set of inputs of T so that the resulting input vector I ∈ TL complies with property (1) in Section 4 with respect to I. The algorithm implementing DW is presented in Figure 2. Initially, each process pi waits until it collects inputs for a set of participating processes that includes at least one live set. Note that different processes may observe different participating sets. Every participating set S is associated with H S ∈ HS(S, L), some deterministically chosen hitting set of (S, L). We say that H S is a resolver set for Π: if S is the participating set, then we initiate |H S | parallel sequences of agreement protocols with resolvers. Each sequence of agreement protocols can
198
E. Gafni and P. Kuznetsov
return at most one value and we guarantee that, eventually, every sequence is associated with a distinct resolver in H S . In every such sequence j, each process pi sequentially goes through an alternation of RAPs and CAs (see Section 5): RAP 1j , CA1j , RAP 2j , CA2j , . . .. The first RAP is invoked with the initially observed set of participants, and each next CA (resp., RAP) takes the output of the previous RAP (resp., CA) as an input. If some CAj returns (commit , v), then pi returns v as an output of the doorway protocol. Lemma 2. In every L-resilient execution of the algorithm in Figure 2 starting with an input vector I, every correct process pi terminates with an output value I [i], and the resulting vector I complies with property (1) in Section 4 with respect to I. 6.2
Solving T through the Doorway
Given the DW protocol described above, it is straightforward to solve T by simply invoking AL with the inputs provided by DW. Thus: Theorem 2. Task T is weakly L-resiliently solvable if TL is weakly wait-free solvable.
From L-Resilience to Wait-Freedom
7
Suppose T is weakly L-resiliently solvable, and let A be the corresponding protocol. We describe a protocol AL that solves TL by wait-free simulating an L-resilient execution of A. For pedagogical reasons, we first present a simple abstract simulation (AS) technique. AS captures the intuition that a group of simulators sharing the initial view of the set of participating simulated codes should appear as a single simulator. Therefore, an arbitrary number of simulators starting with j distinct initial views should be able to simulate a (j − 1)-resilient execution. Then we describe our specific simulation and show that it is an instance of AS, and thus it indeed generates a (j − 1)-resilient execution of L, where j is the number of distinct inputs of TL . By the properties of TL , we immediately obtain a desired L-resilient execution of A. 7.1
Abstract Simulation
Suppose that we want to simulate a given n-process protocol, with the set of codes {code1 , . . . , coden }. Every instruction of the simulated codes (read or write) is associated with a unique position in N. E.g., we can enumerate the instructions as follows: the first instructions of each simulated code, then the second instructions of each simulated code, etc.1 1
In fact, only read instructions of a read-write protocol need to be simulated since these are the only steps that may trigger more than one state transition of the invoking process [2,4].
Relating L-Resilience and Wait-Freedom via Hitting Sets
199
A state of the simulation is a map of the set of positions to colors {U, IP, V }, every position can have one of three colors: U (unvisited ), IP (in progress), or V (visited ). Initially, every position is unvisited. The simulators share a function next that maps every state to the next unvisited position to simulate. Accessing an unvisited position by a simulator results in changing its color to IP or V . different states concurrently proposed
IP
adversary
U adversary, identical states concurrently proposed
V
Fig. 3. State transitions of a position in AS
The state transitions of a position are summarized in Figure 3, and the rules the simulation follows are described below: (AS1) Each process takes an atomic snapshot of the current state s and goes to position next (s) proposing state s. For each state s, the color of next (s) in state s is U . - If an unvisited position is concurrently accessed by two processes proposing different states, then it is assigned color IP . - If an unvisited position is accessed by every process proposing the same state, it may only change its color to V . - If the accessed position is already V (a faster process accessed it before), then the process leaves the position unchanged, takes a new snapshot, and proceeds to the next position. (AS2) At any point in the simulation, the adversary may take an in-progress (IP ) position and atomically turn it into V or take a set of unvisited (U ) positions and atomically turn them into V . (AS3) Initially, every position is assigned color U . The simulation starts when the adversary changes colors of some positions to V .
We measure the progress of the simulation by the number of positions turning from U to V . Note that by changing U or IP positions to V , the adversary can potentially hamper the simulation, by causing some U positions to be accessed with different states and thus changing their colors to IP . However, the following invariant is preserved: Lemma 3. If the adversary is allowed at any state to change the colors of arbitrarily many IP positions to V , and throughout the simulation has j chances to atomically change any set of U positions to V , then at any time there are at most j − 1 IP positions.
200
E. Gafni and P. Kuznetsov
7.2
Solving TL through AS
Now we show how to solve TL by simulating a protocol A that weakly Lresiliently solves T . First, we describe our simulation and show that it instantiates AS, which allows us to apply Lemma 3. Every simulator si ∈ {s1 , . . . , sn } posts its input in the shared memory and then continuously simulates participating codes in {code1 , . . . , coden } of algorithm A in the breadth-first manner: the first command of every participating code, the second command of every participating code, etc. (A code is considered participating if its input value has been posted by at least one simulator.) The procedure is similar to BG-simulation, except that the result of every read command in the code is agreed upon through a distinct RAP instance. Simulator si is statically assigned to be the only resolver of every read command in codei . The simulated read commands (and associated RAPs) are treated as positions of AS. Initially, all positions are U (unvisited). The outcome of accessing a RAP instance of a position determines its color. If the RAP is resolved (a value was posted in D in line 3 or 5), then it is given color V (visited). If the RAP is found stuck (waiting for an output in lines 4-6) by some process, then it is given color IP (in progress). Note that no RAP accessed with identical proposals can get stuck (property (ii) in Section 5). After accessing a position, the simulator chooses the first not-yet executed command of the next participating code in the round-robin manner (function next). For the next simulated command, the simulator proposes its current view of the simulated state, i.e., the snapshot of the results of all commands simulated so far (AS1). Further, if a RAP of codei is observed stuck by a simulator (and thus is assigned color IP ), but later gets resolved by si , we model it as the adversary spontaneously changing the position’s color from IP to V . Finally, by the properties of RAP, a position can get color IP only if it is concurrently accessed with diverging states (AS2). We also have n positions corresponding to the input values of the codes, initially unvisited. If an input for a simulated process pi is posted by a simulator, the initial position of codei turns into V . This is modeled as the intrusion of the adversary, and if simulators start with j distinct inputs, then the adversary is given j chances to atomically change sets of U positions to V . The simulation starts when the first set of simulators post their inputs concurrently take identical snapshots (AS3). Therefore, our simulation is an instance of AS, and thus we can apply Lemma 3 to prove the following result: Lemma 4. If the number of distinct values in the input vector of TL is j, then the simulation above blocks at most j − 1 simulated codes. The simulated execution terminates when some simulator observes outputs of T for at least one participating live set. Finally, using the properties of the inputs to task TL (Section 4), we derive that eventually, some participating live set of simulated processes obtain outputs. Thus, using Theorem 2, we obtain:
Relating L-Resilience and Wait-Freedom via Hitting Sets
201
Theorem 3. T is weakly L-resiliently solvable if and only if TL is weakly waitfree solvable.
8
Related Work
The equivalence between t-resilient task solvability and wait-free task solvability has been initially established for colorless tasks in [2,4], and then extended to all tasks in [8]. In this paper, we consider a wider class of assumptions than simply t-resilience, which can be seen as a strict generalization of [8]. Generalizing t-resilience, Janqueira and Marzullo [18] considered the case of dependent failures and proposed describing the allowed executions through cores and survivor sets which roughly translate to our hitting sets and live sets. Note that the set of survivor sets (or, equivalently, cores) exhaustively describe only supersetclosed adversaries. More general adversaries introduced by Delporte et al. [5] are defined as a set of exact correct sets. It is shown in [5] that the power of an adversary A to solve colorless tasks is characterized by A’s disagreement power, the highest k such that k-set agreement cannot be solved assuming A: a colorless task T is solvable with adversary A of disagreement power k if and only if it is solvable k-resiliently. Herlihy and Rajsbaum [15] (concurrently and independently of this paper) derived this result for a restricted set of superset-closed adversaries with a given core size using elements of modern combinatorial topology. Theorem 1 in this paper derives this result directly, using very simple algorithmic arguments. Considering only colorless tasks is a strong restriction, since such tasks allow for definitions that only depend on sets of inputs and sets of outputs, regardless of which processes actually participate. (Recall that for colorless tasks, solvability and our weak solvability are equivalent.) The results of this paper hold for all tasks. On the other hand, as [15], we only consider the class of superset-closed adversaries. This filters out some popular liveness properties, such as obstructionfreedom [13].Thus, our contributions complement but do not contain the results in [5]. A protocol similar to our RAP was earlier proposed in [17].
9
Side Remarks and Open Questions
Doorways and iterated phases. Our characterization shows an interesting property of weak L-resilient solvability: To solve a task T weakly L-resiliently, we can proceed in two logically synchronous phases. In the first phase, processes wait to collect “enough” input values, as prescribed by L, without knowing anything about T . Logically, they all finish the waiting phase simultaneously. In the second phase, they all proceed wait-free to produce a solution. As a result, no process is waiting on another process that already proceeded to the waitfree phase. Such phases are usually referred to as iterated phases [3]. In [8], some processes are waiting on others to produce an output and consequently the characterization in [8] does not have the iterated structure. L-resilience and general adversaries. The power of a general adversary of [5] is not exhaustively captured by its hitting set. In a companion paper [11], we propose a simple characterization of the set consensus power of a general adversary
202
E. Gafni and P. Kuznetsov
A based on the hitting set sizes of its recursively proper subsets. Extending our equivalence result to general adversaries and getting rid of the weak solvability assumption are two challenging open questions.
References 1. Afek, Y., Attiya, H., Dolev, D., Gafni, E., Merritt, M., Shavit, N.: Atomic snapshots of shared memory. J. ACM 40(4), 873–890 (1993) 2. Borowsky, E., Gafni, E.: Generalized FLP impossibility result for t-resilient asynchronous computations. In: STOC, pp. 91–100. ACM Press, New York (May 1993) 3. Borowsky, E., Gafni, E.: A simple algorithmically reasoned characterization of wait-free computation (extended abstract). In: PODC 1997: Proceedings of the Sixteenth Annual ACM Symposium on Principles of Distributed Computing, pp. 189–198. ACM Press, New York (1997) 4. Borowsky, E., Gafni, E., Lynch, N.A., Rajsbaum, S.: The BG distributed simulation algorithm. Distributed Computing 14(3), 127–146 (2001) 5. Delporte-Gallet, C., Fauconnier, H., Guerraoui, R., Tielmann, A.: The disagreement power of an adversary. In: Keidar, I. (ed.) DISC 2009. LNCS, vol. 5805, pp. 8–21. Springer, Heidelberg (2009) 6. Fischer, M.J., Lynch, N.A., Paterson, M.S.: Impossibility of distributed consensus with one faulty process. J. ACM 32(2), 374–382 (1985) 7. Gafni, E.: Round-by-round fault detectors (extended abstract): Unifying synchrony and asynchrony. In: Proceedings of the 17th Symposium on Principles of Distributed Computing (1998) 8. Gafni, E.: The extended BG-simulation and the characterization of t-resiliency. In: STOC, pp. 85–92 (2009) 9. Gafni, E., Koutsoupias, E.: Three-processor tasks are undecidable. SIAM J. Comput. 28(3), 970–983 (1999) 10. Gafni, E., Kuznetsov, P.: L-resilient adversaries and hitting sets. CoRR, abs/1004.4701 (2010), http://arxiv.org/abs/1004.4701 11. Gafni, E., Kuznetsov, P.: Turning adversaries into friends: Simplified, made constructive, and extended. In: OPODIS (2011) 12. Herlihy, M.: Wait-free synchronization. ACM Trans. Prog. Lang. Syst. 13(1), 123– 149 (1991) 13. Herlihy, M., Luchangco, V., Moir, M.: Obstruction-free synchronization: Doubleended queues as an example. In: ICDCS, pp. 522–529 (2003) 14. Herlihy, M., Rajsbaum, S.: The decidability of distributed decision tasks (extended abstract). In: STOC, pp. 589–598 (1997) 15. Herlihy, M., Rajsbaum, S.: The topology of shared-memory adversaries. In: PODC (2010) 16. Herlihy, M., Shavit, N.: The topological structure of asynchronous computability. J. ACM 46(2), 858–923 (1999) 17. Imbs, D., Raynal, M.: Visiting gafni’s reduction land: From the bg simulation to the extended bg simulation. In: SSS, pp. 369–383 (2009) 18. Junqueira, F., Marzullo, K.: A framework for the design of dependent-failure algorithms. Concurrency and Computation: Practice and Experience 19(17), 2255–2269 (2007) 19. Loui, M., Abu-Amara, H.: Memory requirements for agreement among unreliable asynchronous processes. Advances in Computing Research 4, 163–183 (1987)
Load Balanced Scalable Byzantine Agreement through Quorum Building, with Full Information Valerie King1 , Steven Lonargan1, Jared Saia2 , and Amitabh Trehan1 1
Department of Computer Science, University of Victoria, P.O. Box 3055, Victoria, BC, Canada V8W 3P6
[email protected], {sdlonergan,amitabh.trehaan}@gmail.com 2 Department of Computer Science, University of New Mexico, Albuquerque, NM 87131-1386
[email protected] Abstract. We address the problem of designing distributed algorithms for large scale networks that are robust to Byzantine faults. We consider a message passing, full information model: the adversary is malicious, controls a constant fraction of processors, and can view all messages in a round before sending out its own messages for that round. Furthermore, each bad processor may send an unlimited number of messages. The only constraint on the adversary is that it must choose its corrupt processors at the start, without knowledge of the processors’ private random bits. A good quorum is a set of O(log n) processors, which contains a majority of good processors. In this paper, we give a synchronous algorithm ˜ √n) bits of communication per which uses polylogarithmic time and O( processor to bring all processors to agreement on a collection of n good quorums, solving Byzantine agreement as well. The collection is balanced in that no processor is in more than O(log n) quorums. This yields the first solution to Byzantine agreement which is both scalable and loadbalanced in the full information model. The technique which involves going from situation where slightly more than 1/2 fraction of processors are good and and agree on a short string with a constant fraction of random bits to a situation where all good processors agree on n good quorums can be done in a fully asynchronous model as well, providing an approach for extending the Byzantine agreement result to this model.
1
Introduction
The last fifteen years have seen computer scientists slowly come to terms with the following alarming fact: not all users of the Internet can be trusted. While this fact is hardly surprising, it is alarming. If the size of the Internet is unprecedented in the history of engineered systems, then how can we hope to address the challenging problem of scalability and also the challenging problem of resistance to malicious users?
This research was partially supported by NSF CAREER Award 0644058, NSF CCR0313160, AFOSR MURI grant FA9550-07-1-0532, and NSERC.
M.K. Aguilera et al. (Eds.): ICDCN 2011, LNCS 6522, pp. 203–214, 2011. c Springer-Verlag Berlin Heidelberg 2011
204
V. King et al.
Recent work attempts to address both of these problems concurrently. In the last few years, almost everywhere Byzantine agreement, i.e., coordination between all but a o(1) fraction of processors, was shown to be possible with no more than polylogarithmic bits of communication per processor and polylogarithmic time [13]. More recently, scalable everywhere agreement was shown to be possible if a small set of processors took on the brunt of each communicating Ω(n3/2 ) bits to the remaining processors [11], or if private channels are assumed [12]. In this paper, we give the first load-balanced, scalable method for agreeing on a bit in the synchronous, full information model. In particular, our algorithm ˜ √n) bits. Our technique also yields an requires each processor to send only O( agreement on a collection of n good quorum gateways (referred to as quorums from now on), that is, sets of processors of size O(log n) each of which contains a majority of good processors, and a 1-1 mapping of processors to quorums. The collection is balanced in that no processor is in more than O(log n) quorums. Our usage of the quorum terminology is similar to that in the peer-to-peer literature [17,6,1,3,5,8], where quorums are of O(log n) size each having a majority of good processors, and allow for containment of adversarial behavior via majority filtering. Quorums are useful in an environment with malicious processors as they can act as a gateway to filter messages from by bad processors. For example, a bad processor x can be limited in the number of messages it sends if other processors only accept messages sent by a majority of processors in x’s quorum, and the majority only agree to forward a limited number of messages from x. The number of bits of communication required per processor is polylogarith˜ √n) per processor for mic to bring all but o(1) processors to agreement and O( everywhere agreement on the composition of the n quorums. Our result is with an adversary that controls up to a 1/3 − fraction of processors, for any fixed > 0, and which has full information, i.e., it knows the content of all messages passed between good processors. However, the adversary is non-adaptive, that is, it cannot choose dynamically which processors to corrupt based on its observations of the protocol’s execution. Bad processors are allowed to send an unlimited number of bits and messages, and defense against a denial of service attack is one of the features of our protocol. As an additional result, we present an asynchronous algorithm that can go from a situation where, for any positive constant γ, 1/2+γ fraction of processors are good and agree on a single string of length O(log n) with a constant fraction of random bits to a situation where all good processors agree on n good quorums. This algorithm is load-balanced in that it requires each processor to send only ˜ √n) bits, and the resulting quorums are balanced in that no processor is in O( more than O(log n) quorums. 1.1
Methodology
Our synchronous protocol builds on a previous protocol which brings all but o(1) processors to agreement on a set of s = O(log n) processors of which no more than a 1/3 − fraction are bad, using a sparse overlay network [14]. Being few in number, these processors can run a heavyweight protocol requiring all-
Load Balanced Scalable Byzantine Agreement
205
to-all communication to also agree on a string globalstr which contains a bit (or multiple bits) from each processor, such that a 2/3 + fraction of the bits are randomly set. This string can be communicated scalably to almost all processors using a communication tree formed as a byproduct of the protocol (See [13,14]). When a clear majority of good processors agree on a value, a processor should be able to learn that value, with high probability, by polling O(log n) processors. However the bad processors can thwart this approach by flooding all processors with requests. Even if there are few bad processors, in the full information model, the bad processors can target processors on specific good processors’ poll lists to isolate these processors. To address this problem, we use the globalstr to build quorums to limit the number of effective requests. We also restrict the design of poll lists, preserving enough randomness that they are reliable, but limit the adversary’s ability to target. Key to our work here is that we show the existence of an averaging sampler type function, H, which is known at the start by all the processors and which with high probability, when given an O(log n) length string with a constant fraction of random bits, and a processor ID, produces a good quorum for every ID. Our protocol then uses the fact that almost all processors agree on a collection of good quorums to bring all processors to agree on the string in a load balanced manner, and hence the collection of quorums. Similarly, to solve Byzantine agreement, a single bit agreed to by the initial small set can be agreed to by all the processors. We also show the existence of a function J which uses individually generated random strings and a processor’s ID to output a O(log n) poll list, so that the distribution of poll lists has desired properties. These techniques can be extended to the asynchronous model assuming a scalable implementation of [10]. That work shows that a set of size O(log log n) processors with 2/3 + good processors can be agreed to almost everywhere with probability 1 − o(1). Bringing these processors to agreement on a string with some random bits is trickier in the asynchronous full information model, where the adversary can prevent a fraction of the good processors from being heard based on their random bits. However, [10] shows that it is possible to bring such a set to agreement on a string with some randomness, which we show is enough to provide a good input to H. 1.2
Related Work
Several papers are mentioned above with related results. Most closely related is the algorithm in [11] which similarly starts with almost everywhere agreement on a bit and a small representative set of processors from [13,14] and produces everywhere agreement. However it is not load balanced, and does not create quorums or require the use of specially designed functions H and J. With private channels, load balancing in the presence of an adaptive adversary is achievable ˜ √n) bits of communication per processor [12]. with O( Awerbuch and Scheidler have done important work in the area of maintaining quorums [3,4,5,6]. They show how to scalably support a distributed hash table (DHT) using quorums of size O(log n), where processors are joining and leaving,
206
V. King et al.
a functionality our method does not support. The adversary they consider is nonadaptive in the sense that processors cannot spontaneously be corrupted; the adversary can only decide to cause a good processor to drop out and decide if an entering processor is good or bad. A critical difference between their results and ours is that while they can maintain a system that starts in a good configuration, they cannot initialize such a system unless the starting processors are all good. This is because an entering processor must start by contacting a good processor in a good quorum. The quorum uses secret sharing to produce a random number to assign or reassign new positions in a sparse overlay network (using the cuckoo rule [15]). These numbers and positions are created using a method for secret sharing involving private channels and cryptographic hardness assumptions. In older work, Upfal, Dwork, Peleg and Pippenger addressed the problem of solving almost-everywhere agreement on a bounded degree network [16,7]. However, the algorithms described in these papers are not scalable. In particular, both algorithms require each processor to send at least a linear number of bits (and sometimes an exponential number). 1.3
Model
We assume a fully connected network of n processors, whose IDs are common knowledge. Each processor has a private coin. Communication channels are authenticated, in the sense that whenever a processor sends a message directly to another, the identity of the sender is known to the recipient, but we otherwise make no cryptographic assumptions. We assume a nonadaptive (sometimes called static) adversary. That is, the adversary chooses the set of tn bad processors at the start of the protocol, where t is a constant fraction, namely, 1/3 − for any positive constant . The adversary is malicious: bad processors can engage in any kind of deviations from the protocol, including false messages and collusion, or crash failures, and bad processors can send any number of messages. Moreover, the adversary chooses the input bits of every processor. The good processors are those that follow the protocol. We consider both synchronous and asynchronous models of communication. In the synchronous model, communication proceeds in rounds; messages are all sent out at the same time at the start of the round, and then received at the same time at the end of the same round; all processors have synchronized clocks. The time complexity is given by the number of rounds. In the asynchronous model, each communication can take an arbitrary and unknown amount of time, and there is no assumption of a joint clock as in the synchronous model. The adversary can determine the delay of each message and the order in which they are received. We follow [2] in defining the running time of an asynchronous protocol as the time of execution, where the maximum delay of a message between the time it is sent and the time it is processed is assumed to be one unit. We assume full information: in the synchronous model, the adversary is rushing, that is, it can view all messages sent by the good processors in a round before the bad processors send their messages in the same round. In the case of the asynchronous model, the adversary can view any sent message before its delay is determined.
Load Balanced Scalable Byzantine Agreement
1.4
207
Results
We use the phrase with high probability to mean that an event happens with probability at least 1 − 1/nc , for any constant c and sufficiently large n. We show: Theorem 1 (Synchronous Byzantine Agreement). Let n be the number of processors in a synchronous full information message passing model with a nonadaptive, rushing adversary that controls less than 1/3 − fraction of processors. For any positive constant , there exists a protocol which √ w.h.p. computes ˜ n) bits of comByzantine agreement, runs in polylogarithmic time, and uses O( munication per processor. This result follows from the application of the load balanced protocol in [14], followed by the synchronous protocol introduced in Section 3 of this paper. Theorem 2 (Almost everywhere to everywhere–asynchronous). Let n be the number of processors in a fully asynchronous full information message passing model with a nonadaptive adversary. Assume that (1/2 + γ)n good processors agree on a string of length O(log n) which has a constant fraction of random bits, and where the remaining bits are fixed by a malicious adversary after seeing the random bits. Then for any positive constant γ, there exists a protocol which w.h.p. brings all good processors to agreement on n good quorums; runs ˜ √n) bits of communication per processor. in polylogarithmic time; and uses O( Furthermore, if we assume that same set of good processors have agreed on an input bit (to the Byzantine agreement problem) then this same protocol can bring all good processors to agreement on that bit. A scalable implementation of the protocol in [10] following the lines of [14] would create the conditions in the assumptions of this theorem with probability 1 − O(1/ log n) in polylogarithmic time and bits per processor with an adversary that controls less than 1/3 − fraction of processors. Then this theorem would yield an algorithm to solve asynchronous Byzantine agreement with probability 1 − O(1/ log n). The protocol is introduced in Section 4 of this paper.
2
Combinatorial Lemmas
Before presenting our protocol, we discuss here the properties of some combinatorial objects we shall use in our protocol. Let [r] denote the set of integers {1, . . . , r}, and [s]d the multisets of size d consisting of elements of [s]. Let H : [r] → [s]d be a function assigning multisets of size d to integers. We define the intersection of a multiset A and a set B to be the number of elements of A which are in B. H is a (θ, δ) sampler if at most a δ fraction of all inputs x have |H(x)∩S| > |S| + θ. d s c+1 c Let r = n . Let i ∈ [n ] and j ∈ [n]. Then we define H(i, j) to be H(in + j) and H(i, ∗) to be the collection of subsets H(i + 1), H(i + 2), ..., H(i + n).
208
V. King et al.
Lemma 1 ([[9], Lemma 4.7], [[18], Proposition 2.20]). For every s, θ, δ > 0 and r ≥ s/δ, there is a (θ, δ) sampler H : [r] → [s]d with d = O(log(1/δ)/θ2 ). A corollary of the proof of this lemma shows that if one increases the constant in the expression of d by a factor of c, we get the following: Corollary 1. Let H[r] be constructed by randomly selecting with replacement d elements of [s]. For every s, θ, δ, c > 0 and r ≥ s/δ, for d = O(log(1/δ)/θ 2 ), H(r) is a (θ, δ) sampler H : [r] → [s]d with probability 1 − 1/nc . Lemma 2. Let r = nc+1 and s = n. Let H : [r] → [s]d be constructed by randomly selecting with replacement d elements of [s]. Call an element y ∈ [s] overloaded by H if its inverse image under H contains more than a.d elements, for some fixed element a ≥ 6. The probability that any y ∈ [s] is overloaded by any H(i, ∗) is less than 1/2, for d = O(log n) and a = O(1). Proof. Fix i. The probability that the size of the inverse image of y ∈ [s] ∈ H(i, ∗) is a times its expected size of d is less than 2−ad , for a ≥ 6, by a standard Chernoff bound. The probability that for any i that any y ∈ [s] is overloaded is less than n(nc )2−ad < 1/2, by a union bound over all y ∈ [s] and all i for d = O(log n). Let S be any subset of [n]. A quorum or poll list is a subset of [n] of size O(log n) and a good quorum (resp., poll list) with respect to S ∈ [n] is a quorum (resp., poll list) with greater than 1/2 elements in S. Taking the union bound over the probabilities of the events given in the preceding corollary and lemma, and applying the probabilistic method yields the existence of a mapping with the desired properties: Lemma 3. For any constant c, there exists a mapping H : [nc+1 ] → [n]d such that for every i the inverse image of every element under H[i, ∗] is O(log n) and for any choice of any subset S ⊂ n of size at least 1/2 + n, with probability 1 − 1/nc over the choice of random numbers i ∈ [nc ], H[i, ∗] contains all good quorums. The following lemma is needed to show results about the polling lists, which are subsets of size O(log n) just like quorums, but are used for a different purpose in the protocol. Lemma 4. There exists a mapping J : [1..., nc+1 ] → [n]d such that for any set of 1/2 + fraction of good processors in [1 . . . n]: 1. At least nc+1 − n elements of L are mapped to a good PollList. 2 2. For any L ⊂ [nc+1 ], |L | ≤ n, let R be any subset of [r], |R | ≤ |L |/e and let L be the inverse image of R under J. Then x∈L |J(x)∩R | < d|L |/2. Hence |L |/2 pollLists contain fewer than d elements in R . Proof. Part 1: The probability that a randomly constructed J has this property with probability greater than 1/2 follows from Lemma 3. Part 2: Let J be constructed randomly as in the previous proofs. Fix L , fix R .
Load Balanced Scalable Byzantine Agreement
209
d|L | d|L| P r[ x∈L |J(x)∩R | ≥ d|L |] = d|L ≤ [(e/)(|R |/n)]d|L | ≤ | (|R |/n)
e−d|L | , for |R | ≤ n/e2 The number of ways of choosing a subset of size x and y from [nc ] and [n], resp., is bounded above by (enc /x)x ∗ (en/y)y = ex(c log n−log x+1)+y(log n−log y+1) < e2|L |c log n . The union bound over all sizes of x ≤ n and y is less than 1/2 for d > (2c log n)/ + 1/|L| Hence with probability less than 1/2, x∈L |J(x)∩R | > d|L | for all subsets L of size n or less in [nc ] and all subsets R of size |L |/e2 . Finally, by the union bound, a random J has both properites (1) and (2) with probability greater than 0. By the probabilistic method, there exists a function J with properties (1) and (2). 2.1
Using the Almost-Everywhere Agreement Protocol in [13,14]
We observe that this protocol which uses polylogarithmic bits of communication generates a representative set S of O(log n) processors which is agreed upon by all but O(1/ log n) fraction of good processors, and any message agreed upon by the processors is learned by all but O(1/ log n) fraction of good processors. Hence we start in our current work from the point where there is an b log n bit string globalstr agreed upon by all but O(1/ log n) fraction of good processors such that 2/3+ fraction of good processor in S have each generated c /b random bits (see below), and the remaining bits are generated by bad processors after seeing the bits of good processors. The ordering of the bits is independent of their value and is given by processor ID. globalstr is random enough: Lemma 5. With probability at least 1 − 1/nc , for sufficiently large constant c and d = O log n), there is an H : [nc +1 ] → [n]d such that H(globalstr , ∗) is a collection of all good quorums. Proof. By Lemma 3 there are nc good choices for globalstr and n bad choices. We choose c to be a multiple of b which is greater than (3/2)c. Fix one bad choice string. The probability of the random bits matching this string is less than 2−(2/3c log n) and by a union bound, the probability of it matching any of the n bad strings is less than 1/nc .
3
Algorithm
In this section, we describe the protocol (Protocol 3.1) that reaches everywhere agreement from almost everywhere agreement. 3.1
Description of the Algorithm
Precondition: Each processor p starts with a hypothesis of the global string, candstr p ; this hypothesis may or may not equal globalstr . However, we make
210
V. King et al.
Given: Functions H (as described in Lemma 3), and J (as described in Lemma 4). Part I: Setting up Candidate Lists. 1: for each processor p: do 2: Select uniformly at random a subset, samplelist p , of processor IDs where √ |samplelist p | = c n log(n). 3: p.send (samplelist p , < candstr p >). 4: Set candlist p ← candstr p . 5: For each processor √ r that sent < candstr r > to p, add candstr r to candlist p with probability 1/ n. Part II: Setting up Requests through quorums. 1: for each processor p: do 2: p generates a random string rstr p . 3: For each candidate string s ∈ candlist p , p.send (H(s, p), < rstr p >). 4: Let polllist p ← J(rstr p , p) 5: if processor z ∈ H(candstr z , p) ) and z.accept(p, < rstr p >) then 6: for each processor y ∈ polllist p do 7: z.send (H(candstr z , y), < p → y >) 8: for Processor t ∈ H(candstr t , y) for any processor y do 9: Requestst (y) = {< p → y > | received from p ’s quorum H(candstr t , p)} Part III: Propagating globalstr to every processor. 1: for log n rounds in parallel do 2: if 0 < |Requestst (y)| < c log(n) then 3: for < p → y >∈ Requestst (y) do 4: t.send (y, < p → y >) 5: set Requestst (y) ← ∅. 6: if y.accept(H(candstr y , y), < p → y >) then 7: y.send (p, < candstr y >) 8: y.send (H(candstr y , p), < candstr y >) 9: when for processor p, count of processors in polllist p sending candidate string s over all rounds reaches a majority: Set candstr p ← s. 10: if when for processor z ∈ H(candstr z , p), count of processors in polllist p sending string s over all rounds reaches a majority then 11: for Processor y ∈ polllist p such that y did not yet respond do 12: z.send (H(candstr y , z), < Abort, p >) 13: if t ∈ H(candstr t , y) and t.accept(H(candstr t , p), < Abort, p > then 14: < p → y > is removed from Requestst (y).
Protocol 3. 1. Load balanced almost everywhere to everywhere
a critical assumption that at least 1/2 + γ fraction of processors are good and knowledgable i.e. their candstr equals globalstr . Actually we can ensure that 2/3 + − O(1/ log n) fraction of processors are good and knowledgeable using the almost-everywhere protocol from [13,14], but we need only have 1/2 + fraction for our protocol to work. Let candlist p be a list of candidate strings that p collects during the algorithm. Further, we call H(candstr q , p) a quorum of p (or p’s quorum) according to q. If
Load Balanced Scalable Byzantine Agreement
211
a processor p is sending to a quorum for x then it is assumed to mean that this is the quorum according to p, unless otherwise stated. Similarly, if t is sending conditional on its being in a particular quorum, then we mean this quorum according to t. Often, we shall denote a message within arrow brackets ( ), in particular < p → y > is the message that p has requested information from y. We call a quorum a legitimate quorum of p if it is generated by the globalstr i.e. H(globalstr , p). We also define the following primitives: v.send (X, m): Processor v sends message m to all processors in set X. v.accept (X, m): Processor v accepts the message m received from a majority of the processors in the set X (which could be a singleton set), otherwise It rejects it. Rejection of excess: Every processor will reject messages received in excess of the number of those messages dictated by the protocol in that round or stage of execution of the protocol. We assume each processor knows H and J. The key to achieving reliable communication channels through quorums is to use the globalstr √ . To begin, each processor p sends its candidate string candstr p directly to c n log n randomly selected processors (the samplelist p ). It then generates its own list of candidates candlist p for the √ globalstr including candstr p and every received string with probability 1/ n. This ensures that p has at least one globalstr in its list. The key to everywhere agreement is to be able to poll enough processors reliably so as to be able to learn globalstr . Part II sets up these polling requests. Each processor p generates a random string rstr p , which is used to generate p’s poll list polllist p using the function J by both p and its quorums. All the processors in the poll list are then contacted by p for their candidate string. In line 2, p determines its quorum for each of the strings in its candlist p and sends rstr p to the processors in the quorums. To prevent the adversary from targeting groups of processors, the quorums do not accept the poll list but rather the random string and then generate the poll list themselves. The important thing to note here is that even if p sent a message to its quorum the processors in the quorum will not accept the messages unless according to their own candidate string, they are in p’s quorum. Hence, it is important to note that w.h.p. at least one of these quorums is a legitimate quorum. Since p sends to at least one legitimate quorum, and the processors in this quorum will accept p’s request, this request will be forwarded. p’s quorum in turn contacts processor y’s quorum for each y that was in p’s poll list. The processors in y’s legitimate quorum gather all the requests meant for y in preparation for the next part of the protocol. Part III proceeds in log n rounds. The processors in y’s quorum only forward the received requests if they number less than c log n for some fixed constant c . This prevents any processor from being overloaded. Once y accepts the requests (in accordance with y.accept), it will send its candidate string directly to p and also to p’ s quorum. When p gets the same string from a majority of processors in
212
V. King et al.
its poll list, it sets its own candidate string to this string. This new string w.h.p. is globalstr . There may be overloaded processors which have not yet answered p’s requests. To release the congestion, p will send abort messages to these quorums, which will now take the request off p’s request off their list. In each round, the number of satisfied processors falls by at least half, so that no more than log n rounds are needed. In this way, w.h.p. each processor decides the globalstr . 3.2
Proof of Correctness
The conditions for the correctness of the protocol given in Protocol 3.1 are stated as Lemma 10. To prove that, first we show the following supporting lemmas. Lemma 6. W.h.p., at least one string in the candlist p of processor p is the globalstr . Proof. The proof of this √ follows from the birthday paradox. If there are n possible birthdays and O( n) children, two will likely share a birthday. Adding an O(log n) factor increases the probability for this to happen n times w.h.p. Lemma 7. For processor p and its random string rstr p , a majority of the processors y in polllist p are good and knowledgable, and they receive the request < p → y >. Proof. The poll list for processor p, polllist p is generated by the sampler J using p’s random string rstr p and p’s ID. By Lemma 4, a majority of polllist p is good and knowledgable. From Lemmas 5 and 6, processor p will send its message for its poll lists to at least one legitimate quorum. Since a majority of these are good and knowledgable, they will forward the message < p → y > for each processor y ∈ polllist p = J(rstr p , p) to at least one legitimate quorum of y. By Lemma 9, y shall accept the message. Observation 1. The messages sent by the bad processors, or good but not knowledgable processors (having candstr = globalstr ) do not affect the outcome of the protocol. Proof. All communication in Parts 2 and 3 is verified by a processor against its quorums or poll list. Any communication received through the quorum or poll list is influential if only a majority of processors in them have sent it (either using the accept primitive or by counting total messages received). By Lemmas 6 and 7, majority of these lists are good and knowledgable. ˜ √n) bits. Lemma 8. For the protocol, any processor sends no more than O( Proof. Consider a good and knowledgeable processor p. In Part-I, line 3, p sends √ c n log n messages. For part II of the algorithm, consider p is in the quorum of a processor z; p forwards O(log2 n) messages to the quorums of z’s poll list. In part III, p forwards only O(log n) requests to z. The cost of aborting is no more than the cost of sending. In addition, z answers no more than the number of requests that its quorum forwards. By the rejection of excess primitive, no extra messages ˜ √n) bits over a run of the whole protocol. are sent. Thus, p sends at most O(
Load Balanced Scalable Byzantine Agreement
213
Lemma 9. By the end of part III, for each p, a majority of p’s poll list have received p’s request to respond from their legitimate quorums. Proof. Quorums will forward requests provided their processors are not overloaded. We show by induction that if in round i, there were x processors making requests to overloaded processors, there are no more than x/2 requests to overloaded processors in round i + 1, and thus in log n rounds, there shall be no overloaded processors. Hence every processor will answer its requests. Refer to Lemma 4: let Ri be the set of overloaded processors in round i (those that have more than (4/)d requests). Consider the set Li of processors which made these requests; |Li | ≥ 8/|Ri |. By part 2 of the lemma, half the processors in Li contain less than fraction of their PollLists in Ri , and their requests will be satisfied in the current round by a majority of good processors. Thus, there are now no more than |Li |/2 such processors making requests to processors in Ri , and hence to overloaded processors in round i + 1. Lemma 10. Let n be the number of processors in a synchronous full information message passing model with a nonadaptive rushing malicious adversary which controls less than a 13 − fraction of processors, and more than 12 + γ fraction of processors are good and knowledgable. For any positive constants , γ there exists a protocol w.h.p. such that: 1) At the end of the protocol, each good processor is also knowledgable, 2) The protocol takes no more than O(log n) rounds in ˜ √n) messages per processor. parallel, using no more than O( Proof. Part 1 follows from Lemmas 7 and observation 1; processor p hears back from its poll list and becomes knowledgable. Part 2 follows directly from lemmas 9 (Protocol is completed in O(log n) rounds) and 8.
4
Asynchronous Version
The asynchronous protocol for Byzantine agreement relies on the globalstr being generated by a scalable version of [10]. Such a string would have a reduced constant fraction of random bits but there would still be sufficient randomness to guarantee the properties needed. Note that the reduction in the fraction of random bits needed in the string can be compensated for by increasing the length of the string in the proof of Lemma 5. The asynchronous protocol to bring all processors to agreement on the globalstr can be constructed from the synchronous protocol by using the primitive asynch accept instead of accept and by changes to Part III. The primitive v.asynchaccept (X, m) is defined as : Processor v waits until |X|/2 + 1 messages which agree on m are received and then takes their value. In Part III, since there are no rounds, there is instead an end-of-round signal for each “round” which is determined when enough processors have decided. the quorums are organized in a tree structure which allows them to simulate the synchronous rounds by explicitly counting the number of processors that become knowledgable. Round number is determined by the count of quorums which have received n/2 + 1 answers to requests of their
214
V. King et al.
processor. The quorum of a processor monitors the number of requests received and only forward the requests to a processor when the current number of requests received in a round is sufficiently small. The asynchronous protocol incurs an additional overhead of a log n factor in the number of messages.
References 1. Aspnes, J., Shah, G.: Skip graphs. In: SODA, pp. 384–393 (2003) 2. Attiya, H., Welch, J.: Distributed Computing: Fundamentals, Simulations and Advanced Topics. John Wiley & Sons, Chichester (2004) 3. Awerbuch, B., Scheideler, C.: Provably secure distributed name service. In: Albers, S., Marchetti-Spaccamela, A., Matias, Y., Nikoletseas, S., Thomas, W. (eds.) ICALP 2009. LNCS, vol. 5556, Springer, Heidelberg (2009) 4. Awerbuch, B., Scheideler, C.: Robust distributed name service. In: Voelker, G.M., Shenker, S. (eds.) IPTPS 2004. LNCS, vol. 3279, pp. 237–249. Springer, Heidelberg (2005) 5. Awerbuch, B., Scheideler, C.: Towards a Scalable and Robust DHT. In: SPAA, pp. 318–327 (2006) 6. Awerbuch, B., Scheideler, C.: Towards a scalable and robust DHT. Theory Comput. Syst. 45(2), 234–260 (2009) 7. Dwork, C., Peleg, D., Pippenger, N., Upfal, E.: Fault tolerance in networks of bounded degree. In: STOC, pp. 370–379 (1986) 8. Fiat, A., Saia, J., Young, M.: Making chord robust to byzantine attacks. In: Brodal, G.S., Leonardi, S. (eds.) ESA 2005. LNCS, vol. 3669, pp. 803–814. Springer, Heidelberg (2005) 9. Gradwohl, R., Vadhan, S.P., Zuckerman, D.: Random selection with an adversarial majority. In: Dwork, C. (ed.) CRYPTO 2006. LNCS, vol. 4117, pp. 409–426. Springer, Heidelberg (2006) 10. Kapron, B.M., Kempe, D., King, V., Saia, J., Sanwalani, V.: Fast asynchronous byzantine agreement and leader election with full information. In: SODA, pp. 1038– 1047 (2008) 11. King, V., Saia, J.: From almost everywhere to everywhere: Byzantine agreement ˜ 3/2 ) bits. In: Keidar, I. (ed.) DISC 2009. LNCS, vol. 5805, pp. 464–478. with O(n Springer, Heidelberg (2009) 12. King, V., Saia, J.: Breaking the O(n2 ) bit barrier: Scalable byzantine agreement with an adaptive adversary. In: PODC, pp. 420–429 (2010) 13. King, V., Saia, J., Sanwalani, V., Vee, E.: Scalable leader election. In: SODA, pp. 990–999 (2006) 14. King, V., Saia, J., Sanwalani, V., Vee, E.: Towards secure and scalable computation in peer-to-peer networks. In: FOCS, pp. 87–98 (2006) 15. Scheideler, C.: How to Spread Adversarial Nodes? Rotate! In: STOC, pp. 704–713 (2005) 16. Upfal, E.: Tolerating linear number of faults in networks of bounded degree. In: PODC, pp. 83–89 (1992) 17. Young, M., Kate, A., Goldberg, I., Karsten, M.: Practical robust communication in DHTs tolerating a byzantine adversary. In: ICDCS, pp. 263–272. IEEE, Los Alamitos (2010) 18. Zuckerman, D.: Randomness-optimal oblivious sampling. Random Struct. Algorithms 11(4), 345–367 (1997)
A Necessary and Sufficient Synchrony Condition for Solving Byzantine Consensus in Symmetric Networks Olivier Baldellon1 , Achour Most´efaoui2, and Michel Raynal2 1 LAAS-CNRS, 31077 Toulouse, France IRISA, Universit´e de Rennes 1, 35042 Rennes, France
[email protected], {achour,raynal}@irisa.fr 2
Abstract. Solving the consensus problem requires in one way or another that the underlying system satisfies synchrony assumptions. Considering a system of n processes where up to t < n/3 may commit Byzantine failures, this paper investigates the synchrony assumptions that are required to solve consensus. It presents a corresponding necessary and sufficient condition. Such a condition is formulated with the notions of a symmetric synchrony property and property ambiguity. A symmetric synchrony property is a set of graphs, where each graph corresponds to a set of bi-directional eventually synchronous links among correct processes. Intuitively, a property is ambiguous if it contains a graph whose connected components are such that it is impossible to distinguish a connected component that contains correct processes only from a connected component that contains faulty processes only. The paper connects then the notion of a symmetric synchrony property with the notion of eventual bi-source, and shows that the existence of a virtual 3[t + 1]bi-source is a necessary and sufficient condition to solve consensus in presence of up to t Byzantine processes in systems with bi-directional links and message authentication. Finding necessary and sufficient synchrony conditions when links are timely in one direction only, or when processes cannot sign messages, still remains open (and very challenging) problems. Keywords: Asynchronous message system, Byzantine consensus, Eventually synchronous link, Lower bound, Signature, Symmetric Synchrony property.
1 Introduction Byzantine consensus. A process has a Byzantine behavior when it behaves arbitrarily [15]. This bad behavior can be intentional (malicious behavior, e.g., due to intrusion) or simply the result of a transient fault that altered the local state of a process, thereby modifying its behavior in an unpredictable way. We are interested here in the consensus problem in distributed systems prone to Byzantine process failures whatever their origin. Consensus is an agreement problem in which each process first proposes a value and then decides on a value [15]. In a Byzantine failure context, the consensus problem is defined by the following properties: every non-faulty process decides a value (termination), no two non-faulty processes decide different values (agreement), and if all non-faulty processes propose the same M.K. Aguilera et al. (Eds.): ICDCN 2011, LNCS 6522, pp. 215–226, 2011. c Springer-Verlag Berlin Heidelberg 2011
216
O. Baldellon, A. Most´efaoui, and M. Raynal
value, that value is decided (validity). (See [14] for a short introduction to Byzantine consensus.) Aim of the paper. A synchronous distributed system is characterized by the fact that both processes and communication links are synchronous (or timely) [2,13,16]. This means that there are known bounds on process speed and message transfer delays. Let t denote the maximum number of processes that can be faulty in a system made up of n processes. In a synchronous system, consensus can be solved (a) for any value of t (i.e., t < n) in the crash failure model, (b) for t < n/2 in the general omission failure model, and (c) for t < n/3 in the Byzantine failure model [12,15]. Moreover, these bounds are tight. On the contrary, when all links are asynchronous (i.e., when there is no bound on message transfer delays), it is impossible to solve consensus even if we consider the weakest failure model (namely, the process crash failure model) and assume that at most one process may be faulty (i.e., t = 1) [7]. It trivially follows that Byzantine consensus is impossible to solve in an asynchronous distributed system. As Byzantine consensus can be solved in a synchronous system and cannot in an asynchronous system, a natural question that comes to mind is the following “When considering the synchrony-to-asynchrony axis, which is the weakest synchrony assumption that allows Byzantine consensus to be solved?” This is the question addressed in the paper. To that end, the paper considers the synchrony captured by the structure and the number of eventually synchronous links among correct processes. Related work. Several approaches to solve Byzantine consensus have been proposed. We consider here only deterministic approaches. One consists in enriching the asynchronous system (hence the system is no longer fully asynchronous) with a failure detector, namely, a device that provides processes with hints on failures [4]. Basically, in one way or another, a failure detector encapsulates synchrony assumptions. Failure detectors suited to Byzantine behavior have been proposed and used to solved Byzantine consensus (e.g., [3,8,9]). Another approach proposed to solve Byzantine consensus consists in directly assuming that some links satisfy a synchrony property (“directly” means that the synchrony property is not hidden inside a failure detector abstraction). This approach relies on the notion of a 3[x + 1]bi-source (read “3” as “eventual”) that has been introduced in [1]. Intuitively, this notion states that there is a correct process that has x bidirectional input/outputs links with other correct processes and these links eventually behave synchronously [5,6]. (Our definition of a 3[x + 1]bi-source is slightly different from the original definition introduced in [1]. The main difference is that it considers only eventual synchronous links connecting correct processes. It is precisely defined in Section 61 .) Considering asynchronous systems with Byzantine processes without message authentication, it is shown in [1] that Byzantine consensus can be solved if the system has 1
We consider eventually synchronous links connecting correct processes only for the following reason. This is because, due to Byzantine behavior, a synchronous link connecting a correct process and a Byzantine process can always appear to the correct process as being an asynchronous link.
A Symmetric Synchrony Condition for Solving Byzantine Consensus
217
a 3[n − t]bi-source (all other links being possibly fully asynchronous). Moreover, the 3[n − t]bi-source can never be explicitly known. This result has been refined in [11] where is presented a Byzantine consensus algorithm for an asynchronous system that has a 3[2t + 1]bi-source. Considering systems with message authentication, a Byzantine consensus algorithm is presented in [10] that requires a 3[t + 1]bi-source only. As for Byzantine consensus in synchronous systems, all these algorithms assume t < n/3. ⇒
Synchrony property S is ambiguous
Theorem 2 Section 5
Consensus cannot be solved with property S
⇒
⇒
Theorem 1 Section 4
∃ G ∈ S and ∀C ∈ G : |C| ≤ t
⇒ Contrapositive of Ref. [10]
Lemma 3 Section 6
∃ G ∈ S with no virtual 3[t + 1]bi-source
Fig. 1. The proof of the necessary and sufficient condition (Theorem 3)
Content of the paper. The contribution of the paper is the definition of a symmetric synchrony property that is necessary and sufficient to solve Byzantine consensus in asynchronous systems with message authentication. From a concrete point of view, this property is the existence of what we call a virtual 3[t + 1]bi-source. A symmetric synchrony property S is a set of communication graphs, such that (a) each graph specifies a set of eventually synchronous bi-directional links connecting correct processes and (b) this set of graphs satisfies specific additional properties that give S a particular structure. A synchrony property can be or not ambiguous. Intuitively, it is ambiguous if it contains a graph whose connected components are such that there are executions in which it is impossible to distinguish a component with correct processes only from a connected component with faulty processes only. (These notions are formally defined in the paper). A synchrony property S for a system of n processes where at most t processes may be faulty is called (n, t)-synchrony property. The paper shows first that, assuming a property S, it is impossible to solve consensus if S is ambiguous. It is then shown that, if consensus can be solved when the actual communication graph is any graph of S (we then say “S is satisfied”), then any graph of S has at least one connected component whose size is at least t + 1. The paper then relates the ambiguity of an (n, t)-synchrony property S with the size x of a virtual 3[x]bi-source. These results are schematically represented in Figure 1 from which follows the fact that a synchrony property S allows Byzantine consensus to be solved despite up to t Byzantine processes in a system with message authentication if and only if S is not ambiguous. Road map. The paper is made up of 7 sections. Section 2 presents the underlying asynchronous Byzantine computation model. Section 3 defines the notion of a synchrony property S and the associated notion of ambiguity. As already indicated, a synchrony
218
O. Baldellon, A. Most´efaoui, and M. Raynal
property is on the structure of eventually synchronous links connecting correct processes. Then, Section 4 shows that an ambiguous synchrony property S does not allow consensus to be solved (Theorem 1). Section 5 relates the size of connected components of the graphs of an (n, t)-synchrony property S with the ambiguity of S (Theorem 2). Section 6 establishes the main result of the paper, namely a necessary and sufficient condition for solving Byzantine consensus in system with message authentication. Finally, Section 7 concludes the paper.
2 Computation Model Processes. The system is made up of a finite set Π = {p1 , . . . , pn } of n > 1 processes that communicate by exchanging messages through a communication network. Processes are assumed to be synchronous in the sense that local computation times are negligible with respect to message transfer delays. Local processing times are considered as being equal to 0. Failure model. Up to t < n/3 processes can exhibit a Byzantine behavior. A Byzantine process is a process that behaves arbitrarily: it can crash, fail to send or receive messages, send arbitrary messages, start in an arbitrary state, perform arbitrary state transition, etc. Moreover, Byzantine processes can collude to “pollute” the computation. Yet, it is assumed that they do not control the network. This means that they cannot corrupt the messages sent by non-Byzantine processes, and the schedule of message delivery is uncorrelated to Byzantine behavior. A process that exhibits a Byzantine behavior is called faulty. Otherwise, it is correct or non-faulty. Communication network. Each pair of processes pi and pj is connected by a reliable bidirectional link denoted (pi , pj ). This means that, when a process receives a message, it knows which is its sender. A link can be fully asynchronous or eventually synchronous. The bi-directional link connecting a pair of processes pi and pj is eventually synchronous if there is a finite (but unknown) time τ after which there is an upper bound on the time duration that elapses between the sending and the reception of a message sent on that link (hence an eventually synchronous link is eventually synchronous in both directions). If such a bound does not exist the link is fully asynchronous. If τ = 0 and the bound is known, then the link is synchronous. Message authentication. When the system provides the processes with message authentication, a Byzantine process can fail to relay messages or send bad messages only. When it forwards a message received from another process it cannot alter its content. Notation. Given a set of processes that defines which are the correct processes, let H ⊆ Π × Π denote the set of eventually synchronous bi-directional links connecting these correct processes. (This means that this communication graph has no incident edge to a faulty process; moreover it is possible that some pair of correct processes be not connected by an eventually synchronous bi-directional link).
A Symmetric Synchrony Condition for Solving Byzantine Consensus
219
Given a set of correct processes and an associated graph H as defined above, the previous system model is denoted AS n,t [H]. More generally let S = {H1 ,..., H } be a set of sets of eventually synchronous bidirectional links connecting correct processes. AS n,t [S] denotes the system model (set of runs, see below in Section 3.2) in which the correct processes and the eventually synchronous bi-directional links connecting them are defined by H1 , or H2 , etc., or H .
3 Definitions We consider only undirected graphs in the following. The aim of this section is to state a property that will be used to prove an impossibility result. Intuitively, a vertex represents a process, while an edge is used to represent an eventually synchronous bi-directional link. Hence the set of vertices of a graph G is Π and its set of edges is included in Π × Π. 3.1 (n, x)-Synchrony Property and Ambiguity The formal definitions given in this section will be related to processes and links of a system in the next section. Definition 1. Let G = (Π, E) be a graph. A permutation π on Π defines a permuted graph, denoted π(G) = (Π, E ), i.e., ∀ a, b ∈ Π : (a, b) ∈ E ⇔ (π(a), π(b)) ∈ E . All permuted graphs of G have the same structure as G, they differ only in the names of vertices. Definition 2. Let G1 = (Π, E1) and G2 = (Π, E2). G1 is included in G2 (denoted G1 ⊆ G2 ) if E1 ⊆ E2. Definition 3. An (n, x)-synchrony property S is a set of graphs with n vertices such that ∀G1 ∈ S we have: – Permutation stability. If G2 is a permuted graph of G1 , then G2 ∈ S. – Inclusion stability. ∀ G2 such that G1 ⊆ G2 then G2 ∈ S. – x-Resilience. ∃ G0 ∈ S such that G0 ⊆ G1 and G0 has at least x isolated vertices (an isolated vertex is a vertex without neighbors). The aim of an (n, x)-synchrony property is to capture a property on eventually synchronous bi-directional links. It is independent from process identities (permutation stability). Moreover, adding eventually synchronous links to a graph of an (n, x)-synchrony property S does not falsify it (inclusion stability). Finally, the fact that up to x processes are faulty cannot invalidate it (x-resilience). As an example, assuming n−t ≥ 3, “there are 3 eventually synchronous bi-directional links connecting correct processes” is an (n, x)-synchrony property. It includes all the graphs G of n vertices that have 3 edges and x isolated vertices plus, for every such G, all graphs obtained by adding any number of edges to G. Given a graph G = (Π, E) and a set of vertices C ⊂ Π, G \ C denotes the graph from which edges (pi , pj ) with pi or pj ∈ C have been removed.
220
O. Baldellon, A. Most´efaoui, and M. Raynal
Definition 4. Let S be an (n, x)-synchrony property. S is ambiguous if it contains a graph G = (Π, E) whose every connected component C is such that (i) |C| ≤ x and (ii) G \ C ∈ S. Such a graph G is said to be S-ambiguous. Intuitively, an (n, x)-synchrony property S is ambiguous if it contains a graph G that satisfies the property S in all runs where all processes of any connected component of G could be faulty (recall that at most x processes are faulty). 3.2 Algorithm and Runs Satisfying an (n, x)-Synchrony Property Definition 5. An n-process algorithm A is a set of n automata, such that a deterministic automaton is associated with each correct process. A transition of an automaton defines a step. A step corresponds to an atomic action. During a step a correct process may send/receive a message and change its state according to its previous steps and the current state of its automaton. The steps of a faulty process can be arbitrary. Definition 6. A run of an algorithm A in AS n,t [G] (a system of n processes with at most t faulty processes and for which G defines the graph of eventually timely channels among correct processes) is a triple I, R, T t,G where I defines the initial state of each correct process, R is a (possibly infinite) sequence of steps of A (where at most t processes are faulty) and T is the increasing sequence of time values indicating the time instants at which the steps of R occurred. The sequence R is such that, for any message m, the reception of m occurs after its sending and the steps issued by every process occur in R in their issuing order, and for any correct process pi the steps of pi are as defined by its automaton. Definition 7. Et,G (A) denotes the set of runs of algorithm A in AS n,t [G]. Definition 8. Given an (n, x)-synchrony property S, let ES (A) = t≤x,G∈S Et,G (A) (i.e., ES (A) is the set of runs r = I, R, T t,G of A such that t ≤ x and G ∈ S). Let us remind that a synchrony property S is a set of graphs on Π. Definition 9. An (n, x)-synchrony property S allows an algorithm A to solve the consensus problem in AS n,x [S] if every run in ES (A) satisfies the validity, agreement and termination properties that define the Byzantine consensus problem.
4 An Impossibility Result Given an (n, t)-synchrony property S, this section shows that there is no algorithm A that solves the consensus problem in AS n,t [H] if H is an S-ambiguous graph of S. This means that the synchrony assumptions captured by S are not powerful enough to allow consensus to be solved despite up to t faulty processes. There is no algorithm A that would solve consensus for any underlying synchrony graph of an ambiguous synchrony property S.
A Symmetric Synchrony Condition for Solving Byzantine Consensus
221
4.1 A Set of Specific Runs This section defines the set of runs in which the connected components (as defined by the eventually synchronous communication graph H) are asynchronous the ones with respect to the others, and (if any) the set of faulty processes corresponds to a single connected component. The corresponding set of runs, denoted F (A, H), will then be used to prove the impossibility result. Definition 10. Let A be an n-process algorithm and H be a graph whose n vertices are processes and every connected component contains at most t processes. Let F (A, H) be the set of runs of A that satisfy the following properties: – If pi and pj belong to the same connected component of H, then the bi-directional link (pi , pj ) is eventually synchronous. – If pi and pj belong to the same connected component of H, then both are correct or both are faulty. – If pi and pj belong to distinct connected components of H, then, if pi is faulty, pj is correct. 4.2 An Impossibility Let S be an ambiguous (n, t)-synchrony property, A be an algorithm and H be the graph defining the eventually synchronous links among processes. The lemma that follows states that, if H is S-ambiguous, all runs r in F (A, H) belong to ES (A). Lemma 1. Let S be an (n, t)-synchrony property and H ∈ S. If H is S-ambiguous, then F (A, H) ⊆ ES (A). Proof. As it is S-ambiguous, H contains only connected components with at most t processes. It follows that the set F (A, H) is well-defined. Let r ∈ F (A, H). Let C1 , . . . , Cm be the connected components of H. We can then define H0 = H (when no process are faulty) and for any i with 1 ≤ i ≤ m, Hi = H \ Ci (when the set of faulty processes correspond to Ci ). If in run r, all processes are correct, we have r ∈ Et,H (A). Moreover, if there is a faulty process in run r, by definition of F (A, H), the set of faulty processes correspond to a connected component. Let Ci be this connected component. We then have r ∈ Et,Hi (A). We just showed that F (A, H) ⊆ 0≤i≤m Et,Hi (A). As H is S-ambiguous, for any 1 ≤ i ≤ m we have Hi ∈ S. Moreover, due to the definition of H0 and the lemma assumption, we also have H0 = H ∈ S. Finally, as ES (A) = X ∈S Et,X (A), we have F (A, H) ⊆ ES (A) which prove the lemma.
Lemma 2. Let S be an ambiguous (n, t)-synchrony property and H an S-ambiguous graph. Whatever the algorithm A, there is a run r ∈ F (A, H) that does not solve consensus. Proof. The proof is a reduction to the FLP impossibility result [7] (impossibility to solve consensus despite even only one faulty process in a system in which all links are asynchronous). To that end, let us assume by contradiction that there is an algorithm
222
O. Baldellon, A. Most´efaoui, and M. Raynal
A that solves consensus among n processes p1 , . . . , pn despite the fact that up to t of them may be faulty, when the underlying eventually synchronous communication graph belongs to S (for example an S-ambiguous graph H). This means that, by assumption, all runs r ∈ ES (A) satisfy the validity, agreement and termination properties that define the Byzantine consensus problem.
C1
C2 q1 q2
⇔ (. . .) Cm
Processes p1 , . . . , pn
...
qm
Processes q1 , . . . , qm
Fig. 2. A reduction to the FLP impossibility result
Let C1 , . . . , Cm be the connected components of H and q1 , . . . , qm a set of m processes (called simulators in the following). The proof consists in constructing a simulation in which the simulators q1 , . . . , qm solve consensus despite the fact they are connected by asynchronous links and one of them may be faulty, thereby contradicting the FLP result (Figure 2). To that end, each simulator qj , 1 ≤ j ≤ m, simulates the processes of the connected component Cj it is associated with. Moreover, without loss of generality, let us assume that, for every component Cj made up of correct processes, these processes propose the same value vj . Such a simulation2 of the processes p1 , . . . , pn (executing the Byzantine consensus algorithm A) by the simulators q1 , . . . , qm results in a run r ∈ F (A, H) (from the point of view of the processes p1 , . . . , pn ). As (by definition) the algorithm A is correct, the correct processes decide in run r. As (a) H ∈ S, (b) S is ambiguous, and (c) r ∈ F(A, H), it follows from Lemma 1 that r ∈ ES (A), which means that r is a run in which the correct processes pi decide the same value v (and, if they all have proposed the very same value w, we have v = w). It follows that, simulating the processes p1 , . . . , pn that execute the consensus algorithm A, the m asynchronous processes q1 , . . . , qm (qj proposing value vj ) solve consensus despite the fact that one of them is faulty (the one associated with the faulty component Cj , if any). Hence, the simulators q1 , . . . , qm solve consensus despite the fact that one of them may be faulty contradicting the FLP impossibility result, which concludes the proof of the theorem.
2
The simulation, which is only sketched, is a very classical one. A similar simulation is presented in [15], in the context of synchronous systems, that extends the impossibility to solve Byzantine consensus from a set of n = 3 synchronous processes where one (t = 1) is a Byzantine process to a set of n ≤ 3t processes. A similar simulation is also described in [16].
A Symmetric Synchrony Condition for Solving Byzantine Consensus
223
The following theorem is an immediate consequence of lemmas 1 and 2. Theorem 1. No ambiguous (n, t)-synchrony property allows Byzantine consensus to be solved in a system of n processes where up to t processes can be faulty. Remark 1. Let us observe that the proof of the previous theorem does not depend on the fact that messages are signed or not. Hence, the theorem is valid for systems with or without message authentication. Remark 2. The impossibility to solve consensus despite one faulty process in an asynchronous system [7] corresponds to the case where S is the (n, 1)-synchrony property that contains the edge-less graph.
5 Relating the Size of Connected Components and Ambiguity Assuming a system with message authentication, let S be an (n, t)-synchrony property that allows consensus to be solved despite up to t Byzantine processes. This means that consensus can be solved for any eventually synchronous communication graph in S. It follows from Theorem 1 that S is not ambiguous. This section shows that if an eventual synchrony property S allows consensus to be solved, then any graph of S contains at least one connected component C whose size is greater than t (|C| > t). Theorem 2. Let S be an (n, t)-synchrony property. If there is a graph G ∈ S such that none of its connected components has more than t vertices, then S is ambiguous. Proof. Let G ∈ S such that no connected component of G has more than t vertices. It follows from the t-resilience property of S that there is a graph G included in G (i.e., both have the same vertices and the edges of G are also in G) that has at least t isolated vertices. Let us observe that G can be decomposed into m+ t connected components C1 , . . . , Cm , γ1 , . . . , γt where each Ci contains at most t vertices and each γi contains a single vertex (top of Figure 3). Let us construct a graph G as follows. G is made up of the m connected components C1 , . . . , Cm plus another connected component denoted Cm+1 including the t vertices γ1 , . . . , γt (bottom of Figure 3). Moreover, G contains all edges of G plus
G :
C1
C2
...
Cm
G :
C1
C2
...
Cm
γ1
Fig. 3. Construction of the graph G
...
Cm+1
γt
224
O. Baldellon, A. Most´efaoui, and M. Raynal
the new edges needed in order that the connected component Cm+1 be a clique (i.e., a graph whose any pair of distinct vertices is connected by an edge). As G ∈ S and G ⊆ G , it follows from the stability property of S that G ∈ S. The rest of the proof consists in showing that G is S-ambiguous (from which ambiguity of S follows). – Let us first observe that, due to its very construction, each connected component C of G contains at most t vertices. – Let us now show that for any connected component C of G , we have G \ C ∈ S. (Let us recall that G \ C is G from which all edges incident to vertices of C have been removed.) We consider two cases. • Case C = Cm+1 . We then have G \ C = G . The fact that G ∈ S concludes the proof of the case.
G :
C1
...
Ci
...
Cm
...
...
γ1 . . . γd
γd+1 . . . γt
δ1 . . . δ d Gi :
C1
...
...
...
Cm
Cm+1
Fig. 4. Using a permutation
• Case C = Ci for 1 ≤ i ≤ m. Let δ1 , . . . , δd be the vertices of Ci and let Gi = G \ C. According to the permutation stability property of S there is a permutation π of the vertices of Gi such that G ⊆ π(Gi ) (Figure 4). It then follows from the fact that S is a synchrony property that π(Gi ) ∈ S and consequently Gi ∈ S, which concludes the proof of the case and the proof of the theorem.
Taking the contrapositive of Theorem 1 and Theorem 2, we obtain the following corollary. Corollary 1. If an (n, t)-synchrony property S allows consensus to be solved, then any graph of S contains at least one connected component whose size is at least t + 1.
6 A Necessary and Sufficient Condition This section introduces the notion of a virtual 3[x + 1]bi-source and shows that the existence of a virtual 3[t + 1]bi-source is a necessary and sufficient condition to solve
A Symmetric Synchrony Condition for Solving Byzantine Consensus
225
the consensus problem in a system with message signatures and where up to t processes can commit Byzantine failures. Definition 11. A 3[x + 1]bi-source is a correct process that has an eventually synchronous bi-directional link with x correct processes (not including itself). From a structural point of view, a 3[x + 1]bi-source is a star made up of correct processes. (As already noticed, this definition differs from the usual one, in the sense that it considers only correct processes.) Lemma 3. If a graph G has a connected component C of size x+1, a 3[x+1]bi-source can be built inside C. Proof. Given a graph G that represents the eventually synchronous bi-directional links connecting correct processes, let us assume that G has a connected component C such that |C| ≥ x + 1. A star (3[x + 1]bi-source) can be easily built as follows. When a process p receives a message for the first time, it forwards it to all. Let us remember that, as messages are signed, a faulty process cannot corrupt the content of the messages it forwards; it can only omit to forward them. Let λ be the diameter of C and δ the eventual synchrony bound for message transfer delays. This means that, when we consider any two processes p, q ∈ C, λ × δ is an eventual synchrony bound for any message communicated inside the component C. Moreover, considering any process p ∈ C, the processes of C define a star structure centered at p, and such that, for any q ∈ C \ {p}, there is a virtual eventually synchronous link (bound λ × δ) that is made up of eventually synchronous links and correct processes of C, which concludes the proof of the lemma.
The following definition gives a more general definition of a 3[x + 1]bi-source. Definition 12. A communication graph G has a virtual 3[x + 1]bi-source if has a connected component C of size x + 1. Theorem 3. An (n, t)-synchrony property S allows consensus to be solved in an asynchronous system with message authentication, despite up to t Byzantine processes, if and only if any graph of S contains a virtual 3[t + 1]bi-source. Proof. The proof of the sufficiency side follows from the algorithm described in [10] that presents and proves correct a consensus algorithm for asynchronous systems made up of n processes where (a) up to t processes may be Byzantine, (b) messages are signed, and (c) there is a 3[t + 1]bi-source (a 3[t + 1]bi-source in our terminology is a 3[2t]bi-source in the parlance of [1,10]). When considering the necessity side, let S be synchrony property such that none of its graphs contains a virtual 3[x + 1]bi-source. It follows from the contrapositive of Corollary 1 that S does not allow Byzantine consensus to be solved.
The following corollary is an immediate consequence of the previous theorem. Corollary 2. The existence of a virtual 3[t + 1]bi-source is a necessary and sufficient condition to solve consensus (with message authentication) in presence of up to t Byzantine processes.
226
O. Baldellon, A. Most´efaoui, and M. Raynal
7 Conclusion This paper has presented a synchrony condition that is necessary and sufficient for solving consensus despite Byzantine processes in systems equipped with message authentication. This synchrony condition is symmetric in the sense that some links have to be eventually timely in both directions. Last but not least, finding necessary and sufficient synchrony conditions when links are timely in one direction only, or when processes cannot sign messages, still remain open (and very challenging) problems.
References 1. Aguilera, M.K., Delporte-Gallet, C., Fauconnier, H., Toueg, S.: Consensus with Byzantine Failures and Little System Synchrony. In: Int’l Conference on Dependable Systems and Networks (DSN 2006), pp. 147–155. IEEE Computer Press, Los Alamitos (2006) 2. Attiya, H., Welch, J.: Distributed Computing: Fundamentals, Simulations and Advanced Topics, 2nd edn., 414 pages. Wiley-Interscience, Hoboken (2004) 3. Cachin, C., Kursawe, K., Shoup, V.: Random Oracles in Constantinople: Practical Asynchronous Byzantine Agreement using Cryptography. In: Proc. 19th ACM Symposium on Principles of Distributed Computing (PODC 2000), pp. 123–132 (2000) 4. Chandra, T., Toueg, S.: Unreliable Failure Detectors for Reliable Distributed Systems. Journal of the ACM 43(2), 225–267 (1996) 5. Delporte-Gallet, C., Devismes, S., Fauconnnier, H., Larrea, M.: Algorithms for Extracting Timeliness Graphs. In: Patt-Shamir, B., Ekim, T. (eds.) SIROCCO 2010. LNCS, vol. 6058, pp. 127–141. Springer, Heidelberg (2010) 6. Dwork, C., Lynch, N., Stockmeyer, L.: Consensus in the Presence of Partial Synchrony. Journal of the ACM 35(2), 288–323 (1988) 7. Fischer, M.J., Lynch, N.A., Paterson, M.S.: Impossibility of Distributed Consensus with One Faulty Process. Journal of the ACM 32(2), 374–382 (1985) 8. Friedman, R., Most´efaoui, A., Raynal, M.: Pmute -Based Consensus for Asynchronous Byzantine Systems. Parallel Processing Letters 15(1-2), 162–182 (2005) 9. Friedman, R., Most´efaoui, A., Raynal, M.: Simple and Efficient Oracle-Based Consensus Protocols for Asynchronous Byzantine Systems. IEEE Transactions on Dependable and Secure Computing 2(1), 46–56 (2005) 10. Hamouna, M., Most´efaoui, A., Tr´edan, G.: Byzantine Consensus with Few Synchronous Links. In: Tovar, E., Tsigas, P., Fouchal, H. (eds.) OPODIS 2007. LNCS, vol. 4878, pp. 76– 89. Springer, Heidelberg (2007) 11. Hamouna, M., Most´efaoui, A., Tr´edan, G.: Byzantine Consensus in Signature-free Systems. Submitted to publication 12. Lamport, L., Shostack, R., Pease, M.: The Byzantine Generals Problem. ACM Transactions on Programming Languages and Systems 4(3), 382–401 (1982) 13. Lynch, N.A.: Distributed Algorithms, 872 pages. Morgan Kaufmann Pub., San Francisco (1996) 14. Okun, M.: Byzantine Agreement. Springer Encyclopedia of Algorithms, pp. 116–119 (2008) 15. Pease, M., Shostak, R., Lamport, L.: Reaching Agreement in the Presence of Faults. JACM 27, 228–234 (1980) 16. Raynal, M.: Fault-tolerant Agreement in Synchronous Message-passing Systems. In: Morgan & Claypool, 167 pages (September 2010)
GoDisco: Selective Gossip Based Dissemination of Information in Social Community Based Overlays Anwitaman Datta and Rajesh Sharma School of Computer Engineering, Nanyang Technological University, Singapore {Anwitaman,raje0014}@ntu.edu.sg
Abstract. We propose and investigate a gossip based, social principles and behavior inspired decentralized mechanism (GoDisco) to disseminate information in online social community networks, using exclusively social links and exploiting semantic context to keep the dissemination process selective to relevant nodes. Such a designed dissemination scheme using gossiping over a egocentric social network is unique and is arguably a concept whose time has arrived, emulating word of mouth behavior and can have interesting applications like probabilistic publish/subscribe, decentralized recommendation and contextual advertisement systems, to name a few. Simulation based experiments show that despite using only local knowledge and contacts, the system has good global coverage and behavior. Keywords: gossip algorithm, community networks, selective dissemination, social network, egocentric.
1
Introduction
Many modern internet-scale distributed systems are projections of real-world social relations, inheriting also the semantic and community contexts from real world. This is explicit in some scenarios like online social networks, while implicit for others such as the ones derived from interactions among individuals (such as email exchanges), traditional file-sharing peer-to-peer systems where people with similar interests self-organize into a peer-to-peer overlay which is semantically clustered according to the tastes of these people [6,13]. Recently, various efforts to build peer-to-peer online social networks (P2P OSNs) [3] are also underway. Likewise, also in massively multiplayer online games (MMOGs), virtual communities are formed. In many of these social information systems, it is often necessary to have mechanisms to disseminate information (i) effectively - reaching the relevant people who would be interested in the information while not bothering others who won’t be, doing so (ii) efficiently - avoiding duplication or latency, in a (iii) decentralized environment - scaling without global infrastructure, knowledge and coordination, and in a (iv) reliable manner - dealing with temporary failures of a subset of the population, and ensuring information quality. M.K. Aguilera et al. (Eds.): ICDCN 2011, LNCS 6522, pp. 227–238, 2011. c Springer-Verlag Berlin Heidelberg 2011
228
A. Datta and R. Sharma
Consider a hypothetical scenario involving circulation of call for papers (CFP) to academic peers. One often posts such a CFP to relevant mailing lists. Such an approach is essentially analogous to a very simplistic publish/subscribe (pub/sub) system. If the scopes of the mailing lists are very restricted, then many such mailing lists would be necessary. However people who have not explicitly subscribed to such a mailing list will miss out on the information. Likewise, depending on the scope of topics discussed in the mailing list, there may be too many posts which are actually not relevant. CFP is also propagated by personal email among colleagues and collaborators one knows. The later approach is unstructured and does not require any specific infrastructure (unlike the mailing list) nor any explicit notion of subscription. Individuals make autonomous decisions based on their perception of what their friends’ interests may be. Likewise, individuals may build local trust metrics to decide which friends typically forward useful or useless information, providing a personal, subjective context to determine the quality of information. On the downside, such an unstructured, word of mouth (gossip/epidemic) approach does not guarantee full coverage, and may also generate many duplicates. Such redundancy can however also make the dissemination process robust against failures. Consequently, epidemic algorithms are used in many distributed systems and applications [9,2]. Despite the well recognized role of word of mouth approaches in information dissemination in a selective manner - in real life or over the internet, such as by emails or in online social networks, there has not been any algorithmic (designed) information dissemination mechanism leveraging on the community structures and semantics available in social information systems. This is arguably partly because of the novelty of the systems such as P2P OSNs, [3] where the mechanism can be naturally applied. We propose and investigate a gossip based, social principles and behavior inspired decentralized mechanism to disseminate information in a distributed setting, using exclusively social links and exploiting semantic context to keep the dissemination process selective to relevant nodes. We explore the trade-offs of coverage, spam and message duplicates; and evaluate our mechanisms over synthetic and real social communities. These designed mechanisms can be useful not only for the distributed social information systems for which we develop them, but may also have wider impact in the longer run - such as in engineering word of mouth marketing strategies, as well as understanding naturally occurring dissemination processes which had inspired us the algorithm designs at the first instance. Experiments on synthetic and real networks show that our proposed mechanism is effective and moderately efficient, though the performance deteriorates in sparse graphs. The rest of the paper is organized as follows. In section 2, we present related work. We present our approach including various concepts and algorithm in section 3. Section 4 evaluates the approach using synthetic and real network. We conclude with several future directions in section 5.
GoDisco: Selective Gossip Based Dissemination of Information
2
229
Related Works
Targeted dissemination of information can be carried out efficiently using a structured approach as is the case with publish/subscribe systems [10] that rely on infrastructure like overlay based application layer multicasting [12] or gossiping techniques [2] to spread the information within a well-defined group. Well defined groups are not always practical, and alternative approaches to propagate information in unstructured environments are appealing. Broadcasting information to everyone ensures that relevant people get it (high recall), but is undesirable, since it spams many others who are uninterested (poor precision). In mobile ad-hoc and delay tolerant networking scenarios, selective gossiping techniques relying on user profile and context to determine whether to propagate the information have been used. This is analogous to how multiple diseases can spread in a viral manner - affecting only “susceptible” subset of the population. Autonomous gossiping [8] and SocialCast [7] are few such approaches, which rely on the serendipity of interactions with like minded users, and depends on users’ physical mobility that leads to such interactions. There are also many works studying naturally occurring (not specifically designed) information spread in community networks - including on the blogosphere [15,11] as well as in online social network communities [5], and simulation based models to understand such dissemination mechanisms [1] and human behavior [19]. Our work, GoDisco lies at the crossroads of these works - we specifically design algorithms to effectively (ideally high recall and precision, low latency) and efficiently (low spam and duplication) disseminate information in collaborative social network communities by using some simple and basic mechanisms motivated by naturally occurring human behaviors (similar to mouth of word) using local information. Our work belongs to the family of epidemic and gossip algorithms [18]. However unlike GoDisco where we carry out targeted and selective dissemination, traditional approaches [14,2,9,16] are designed to broadcast information within a well defined group, and typically assumes a fully connected graph - so any node can communicate directly with any other node. In GoDisco, nodes can communicate directly only with other nodes with whom they have a social relation. While this constraint seems restrictive, it actually helps directing the propagation within a community of users with shared interest, as well as limiting the spread to uninterested parties. We specifically aim to use only locally available social links, rather than forming semantic clusters, because we want to leverage on the natural trust and disincentives to propagate bogus messages that exist among neighbors in an acquaintance/social graph. Similar to GoDisco, design of JetStream [16] is also motivated by social behavior - but of a different kind, that of reciprocity. JetStream is designed to enhance the reliability and relative determinism of the information propagation using gossip, but it is not designed for selective dissemination of information to an interest sub-group. A recent work, GO [20], also performs selective dissemination, but in explicit and well defined subgroups, and again assuming a fully connected underlying graph. The emphasis of GO is to piggyback multiple
230
A. Datta and R. Sharma
messages (if possible) in order to reduce the overall bandwidth consumption of the system. Some of the ideas from these works [16,20] may be applicable as optimizations in GoDisco.
3
Approach
We first summarize various concepts which are integral part of proposed solution followed by detail description of GoDisco algorithm. 3.1
Concepts
System Model: Let G(N, E) be a graph, where N represents nodes (or users) of the network and E represents the set of edges between any two nodes ni and nj ∈ N . Edge eij ∈ E exist in the network, if node ni and nj know each other. We define the neighborhood of node ni as the set ℵi where nj ∈ ℵi iff eij ∈ E. Community and Information Agent: Let I be the set of interests, representing a collection of all the different interests of all the nodes in a network. We assume each node has at least one interest, and Inj represents the collection of all the different interests of node nj . We consider a community to be a set of people with some shared interest. The social link between these people define the connectivity graph of the community. This graph may be connected, or it may comprise of several connected components. Such a graph, representing a specific community, is a subgraph G (N , E ) where N ⊆ N and E ⊆ E such that ∃x ∈ I; ∀nj ∈ N , x ∈ Inj . According to this definition, a node can (and real data sets show, often does) belong to different communities at the same instance if it has varied interest. Since the subgraph G (N , E ) may not be connected, it may be necessary to rely on some nodes from outside the community to pass messages between these mutually isolated subsets. We however need to minimize messages to such nodes outside the community. To that end, we should try to find potential forwarders who can help in forwarding the message to such isolated but relevant nodes. We identify suitable forwarders based on several factors: (i) interest similarity, (ii) node’s history as forwarder, (iii) degree of the node and (iv) activeness of the node. These forwarders, which we call information agents (IAs), help spreading a message faster, even if not personally interested in the message. History as Forwarder: This parameter calculates how good forwarder a node is. The rationale behind this approach is the principle of reciprocity from social sciences. Reciprocity is used also in JetStream [16], and we reuse the same approach. Activeness of a Node: An active node in the network can play a crucial role in quick dissemination of information, making it a good potential information agent. One way to measure user activeness is in terms of frequency of visits to an online social network, which can be inferred from any kind of activity the user does - e.g., post messages or update status, etc.
GoDisco: Selective Gossip Based Dissemination of Information
231
Duplication avoidance: We use social triads [17] concept to avoid message duplication. In a social network nodes can typically be aware of all the neighbors of all its neighboring node. This knowledge can be explored in reducing duplication. Interest classification and node categorizations: Interests can typically be classified using a hierarchical taxonomy. For simplicity, as well as to avoid too sparse and poorly connected communities, we restrict it two levels and name them as (i) Main Category (MC) & (ii) Subcategory (SC). For example, web mining and data mining are closer to machine learning rather than communication networks. So data-mining, web mining and machine learning would belong within one main category, while networks related topics will belong to a different main category. Within a main category, similarity of interest among different categories can vary again by different degree, for example web-mining and data mining are relatively similar as compared to machine learning. So at next level of categorization we put data mining and web mining under one sub category and machine learning under another subcategory. To summarize, two different main categories are more distant as compared to two different sub categories within same main category. Information Agent Classification: Based on interest similarity, we categories different levels of IAs with respect to tolerance for spam message. For a particular message if a message’s interest is falling under SC11 of M C1 , we classify levels of IA in the following manner: Level 1: Nodes having interest in different subcategory under same main category (e.g., a node having interest in M C1 SC12 ). Ideally such nodes should not be considered as spammed nodes as they have, high similar interest and possibly they might also be interested in this kind of messages. Level 2: For nodes having good history as a forwarder, an irrelevant message can be considered as spam. However they have high tolerance for such messages - that is why they have good forwarding history. These nodes will typically be cooperating with others based on the principle of reciprocity. Level 3: Highly active nodes with no common interest can play an important role in quick dissemination, but their high level of activity does not imply that they would be tolerant to spam, and such information agents should be chosen with lowest priority. While selecting an IA from non-similar communities maximum nodes should be chosen from level 1 since they are more likely to be able to connect back to other interested users, then from level 2 and 3. 3.2
GoDisco
There are two logically-independent modules of GoDisco - the control phase runs in the background to determine system parameters and self-tune the system, while the dissemination phase carrying out the actual dissemination of the messages.
232
A. Datta and R. Sharma
1. Control Phase. Each node regularly transmits its interest and its degree information to its neighbors. Each node also monitors (infers) its neighbor’s activeness and forwarding behavior. The latter two parameters are updated after each dissemination phase. Neighboring nodes who spreads/forwards further to more number of neighbors are rewarded in future (based on reciprocity principle). Also nodes that are more active are considered better IA (potential forwarders) as compared to less active nodes. Every node maintains a tuple of for each of its neighbors, reflecting their history as forwarder, degree and activeness respectively. During the dissemination phase non-relevant nodes are ranked according to a weighted sum hα + dβ + aγ to determine potential information agents (IAs) where α , β , γ are parameters to set the priority of the three variables such that α + β + γ =1. 2. Dissemination Phase. We assume that the originator of the message provides necessary meta-information - a vector of interest categories that the message fits (msgprofile), as well as dissemination parameters tuple (their use is explained below). The message payload is thus: <message, msgprofile, parameters, dampingflag>. dampingf lag is a flag used to dampen the message propagation when no more relevant nodes are found. Time to live could also be used instead. Algorithm 1 illustrates the logic behind sending message to relevant nodes and collecting non-relevant nodes, ranking their suitability and using them as information agents (IAs) for disseminating a message based on multiple criterion. We have adopted an optimistic approach for forwarding a message to neighboring nodes. A node forwards a message to all its neighbors with at least some matching interest in the message. For example, a node with interests in data mining and peer-to-peer systems will get a message intended for people with interests in data mining or web-mining. Of the nodes who are not directly relevant, we determine nodes with common interest within the main categories from message and users’ profiles (using M C(.)), helping identify Level 1 IAs, and forward the message to these Level 1 IAs probabilistically with probability p1 which is a free parameter. For level 1 IAs, the message though possibly not directly relevant, is still somewhat relevant, and the “perception of spam” is expected to be weaker. If all the neighbors of a node are totally non-relevant, then the dissemination stops, since such a node is at the boundary of the community. However, existence of some isolated islands or even a large community with same interests is possible. To alleviate the same, some boundary nodes can probabilistically send random walkers. We implement such random walkers with preference to neighbors with higher score (hα + dβ + aγ) and limit the random walks with limited timeout (in the experiments we chose time-out equal to the network’s diameter). If a random-walker revisits a specific node, then it is forwarded to a different neighbor than in the past, with a ceiling to the number of times such revisits are forwarded (in experiments, this was set to two).
4
Evaluation
We evaluate GoDisco based information dissemination on both real and synthetically generated social networks. User behavior vis-a-vis the degree of activeness
GoDisco: Selective Gossip Based Dissemination of Information
233
and history of forwarding were assigned uniformly at random from some arbitrary scale (and normalized to a value between 0-1). 1. Synthetic graphs. We used preferential attachment based Barabassi graph generator1 to generate a synthetic network of ∼20000 Nodes and a diameter of 7 to 8 (calculated using ANF [4]).
Algorithm 1. GoDisco: Actions of node ni (which has relevant profile, i.e., msgprof ile Ini = ∅) with neighbors ℵi upon receiving message payload <message, msgprofile, parameters, dampingflag> from node nk 1: for ∀nj s.t. nj ∈ ℵi nj ∈ ℵk do 2: if msgprof ile Inj = ∅ then 3: DelvM sgT o(nj ); {Forward message to neighbors with some matching interest in the message} 4: else 5: if M C(msgprof ile) M C(Inj ) = ∅ then 6: With p1 probability DelvM sgT o(nj ); {Message is forwarded with probability p1 to Level 1 IAs.} 7: else 8: N onRelv ← nj ; {Append to a list of nodes with no (apparent) common interest} 9: end if 10: end if 11: end for 12: if |N onRelv| == |ℵi | then 13: if dampingf lag == T RU E then 14: Break; {Stop and do not execute any of the below steps. Note: In experimental evaluation, the “no damping” scenario corresponds to not having this IF statement/not bothering about the damping flag at all.} 15: else 16: dampingf lag := T RU E; {Set the damping flag} 17: Sort N onRelv in descending order of their score using hα+dβ +aγ; { obtained from control phase, from parameters set by message originator.} 18: IAN odes := Top X% of the sorted nodes; 19: for ∀nj ∈ IAN odes do 20: DelvM sgT o(nj ); {Carry out control phase tasks in the background} 21: end for 22: end if 23: else 24: dampingf lag := F ALSE {Reset the damping flag} 25: end if
Random cliques: People often share some common interests with their immediate friends and form cliques, but do not form a contiguous community. For example, all soccer fans in a city do not necessarily form one community, instead smaller 1
http://www.cs.ucr.edu/~ ddreier/barabasi.html
234
A. Datta and R. Sharma
bunch of buddies pursue the interest together. To emulate such a behavior, we pick random nodes but relatively smaller number of neighbors (between 50-200) and assign these cliques some common interest. Associativity based interest assignment: People often form a contiguous subgraph within the population where all (most) members of the subgraph share some common interest, and form a community. This is particularly true in professional social networks. To emulate such a scenario, we randomly picked a node (center node) and applied a restricted breadth first algorithm covering ∼ 1000 nodes and assigned the interest of these nodes to be similar to that of the center node, fading the associativity with distance from the center node. Totally random: Interest of the nodes were assigned uniformly at random. 2. Real network - DBLP network: We use the giant connected component of co-authorship graph from DBLP record of papers published in 1169 unique conferences between 2004 to 2008, which comprises of 284528 unique authors. We classified the category of the conferences (e.g., data mining, distributed systems, etc.) to determine the interest profile of the authors. 4.1
Results
In this section we describe various metrics we observed in our experiments. The default parameter values were p1 = 0.5, X = 10%, α=0.50, β=0.30 and γ=0.20. Other choices provided similar qualitative results. We also expose results from limited exploration of parameters in a brief discussion later (see Figure 3). 1. Message dissemination: Figure 1 shows the spread of dissemination over time in the various networks. The plots compare three different mechanism of propagation of information - (i) with damping (D), (ii) without damping (ND), and with the additional use of (iii) random walkers (RW) in the case with damping and plots the number of relevant nodes (R) and total nodes including non-relevant nodes (T) who receive the message. With the use of the damping mechanism, the number of non-relevant nodes dropped sharply but with only a small loss in coverage of relevant nodes. This shows the effectiveness of damping mechanism approach to reduce spam. Using random walkers provide better coverage of relevant nodes, and only marginally more non-relevant nodes receive the message. This effect is most pronounced in the case of the DBLP graph. Associativity based synthetic graph better resembles real networks. Gossiping based mechanism is also expected to work better in communities which are not heavily fragmented. So we will mostly confine our results to the associativity based and DBLP graphs due to space constraints even though we used all the networks for all the experiments described next. If qualitatively different results are observed for the other graphs, we will mention the same as and when necessary. 2. Recall: Recall is the ratio of the number of relevant nodes who get a message to the total number of relevant nodes in the graph. We compare the recall for damping (D) vs non-damping (ND) mechanisms shown in Figure 2(a) and 2(b)
GoDisco: Selective Gossip Based Dissemination of Information
(a) Rand. Cliq.
(b) Tot. Rand.
(c) Asst.
235
(d) DBLP
Fig. 1. Message dissemination (a) Random Cliques (Rand. Cliq.) (b) Total Random (Tot. Rand.) (c) Associativity (Asst.) (d) DBLP
(a) R Asst
(b) R DBLP
(c) P Asst
(d) P DBLP
Fig. 2. Recall (R) & Precision (P) for DBLP and Associativity (Asst)
for Associativity based and DBLP respectively. Use of damping mechanism leads to slight decrease in recall. In associativity based interest network, recall value reaches very close to 1. Even in random cliques, the recall value reached very close to one, but it was relatively slower than in the associativity based network, but in totally random assignment of individuals’ interests, recall value of only upto 0.9 could be achieved (non-damping), while random walkers provide relatively more improvement (recall of around 0.8) than the case of using damping but no random walkers (recall of roughly 0.7), demonstrating the limitations of a gossip based approach if the audience is highly fragmented as well as the effectiveness of random walkers in reaching some of such isolated groups at a low overhead. In the case of DBLP network (Figure 2(b)) the recall value is reasonably good. Since the DBLP graph comprises of an order of magnitude more total nodes (around fourteen times more) than the synthetic graphs. Consequently, the absolute numbers observed in the DBLP graph can not directly be compared to the results observed in the synthetic graphs. The use of random walker provides significant improvements in the dissemination process. 3. Precision: Precision is the ratio of number of relevant nodes who get a message to the total number of nodes (relevant plus irrelevant) who get the message. Figure 2(c) and 2(d) shows the precision value measured in the different networks for associativity based and DBLP respectively. We notice that in the DBLP network, with real semantic information, the precision is in fact relatively better than what is achieved in the synthetic associativity based network. From
236
A. Datta and R. Sharma
(a) Msg. Diss.
(b) P & R
(c) R DBLP
(d) P DBLP
Fig. 3. α and γ comparison on the random clique based network, Effect of Feedback on Recall (R) & Precision (P)
this we infer that in real networks, the associated semantic information in fact enables a more targeted dissemination. We also observe the usual trade-off with the achieved recall. 4. Parameter exploration: In associativity based network, because of a tightly coupled contiguous community, the quality of dissemination is not so sensitive to the parameter choice, while in scattered networks like random cliques it is. To evaluate the effect of γ in the network, we perform experiments on the random cliques based network with γ=0.60, β=0.30 and α=0.10, and compare these with the scenarios with the default values of α=0.50, β=0.30 and γ=0.20. A greater value of γ puts a greater emphasis to highly active users as potential IAs, who can help improve the speed of dissemination, but at the cost of spamming more uninterested users. Results shown in Figure 3(a) and 3(b) confirm this intuition. Figure 3(a) shows the total nodes (T) and relevant nodes (R) receiving a message. Figure 3(b) compares the recall and precision for the two choices of parameters. 5. Message duplication: Nodes may receive duplicates of the same message. We use proximity to reduce such duplicates. Figures 4(a) and 4(c) show for associativity based and DBLP networks respectively the number of duplicates avoided during the dissemination process for both nodes for whom the message is relevant (Relv) as well as for nodes for whom it is irrelevant (IrRelv). Interesting to note is that with damping, the absolute number of irrelevant nodes getting the message is already low, so the savings in duplication is also naturally low. The results show that the use of proximity is an effective mechanism to significantly reduce such unnecessary network traffic. 6. Volume of duplicates: Figure 4(b) and 4(d) measures the volume of duplicates received by individual nodes for associativity based and DBLP networks respectively during the dissemination process. The observed trade-offs of using random walkers with damping, or the case of not using damping is intuitive. 7. Effect of Feedback: Figure 3(c) shows the effect of feedback on recall for DBLP network. In case of associativity based network community members are very tightly coupled with few exceptions, so there is not much visible improvement with feedback (not shown). However in case of DBLP, for non-damping
GoDisco: Selective Gossip Based Dissemination of Information
(a) DS Asst.
(b) DR Asst.
(c) DS DBLP
237
(d) DR DBLP
Fig. 4. Duplication saved using proximity (DS) & Duplicates received (DR) for DBLP and Associativity (Asst)
as well as random walk based schemes where non-relevant nodes are leveraged for disseminating the information, we observe significant improvement in recall when comparing the spread of first message in the system with respect to the thousandth message (identified with an ‘F’ in the legends to indicate the scenarios with feedback), as the feedback mechanism helps the individual nodes self-organize and choose better information agents over time, which accelerates the process of connecting fragmented communities. Interestingly, this improvement in recall does not compromise the precision, which in fact even improves slightly, further confirming that the feedback steers the dissemination process to use more relevant information agents.
5
Conclusion and Future Work
We have described and evaluated a selective gossiping based information dissemination mechanism (namely GoDisco), which is constrained by communication among only nodes socially connected to each other, and leverage users’ interests and the fact that users with similar interests often form a community - in order to do directed dissemination of information. GoDisco is nevertheless a first work of its kind, leveraging on interest communities for directed information dissemination using exclusively social links. We have a multidirectional plan for future extensions of this work. We plan to apply GoDisco in various kinds of application, including in information dissemination in peer-to-peer online social networks [3] - for example, for probabilistic publish/subscribe systems, or to contextually advertise products to sell to other users of a social network. We are also planning to improve existing schemes in various manner, like incorporating various security mechanisms and disincentives for antisocial behavior.
References 1. Apolloni, A., Channakeshava, K., Durbeck, L., Khan, M., Kuhlman, C., Lewis, B., Swarup, S.: A study of information diffusion over a realistic social network model. In: Int. Conf. on Computational Science and Engineering (2009)
238
A. Datta and R. Sharma
2. Birman, K.P., Hayden, M., Ozkasap, O., Xiao, Z., Budiu, M., Minsky, Y.: Bimodal multicast. ACM Trans. Comput. Syst. 17(2) (1999) 3. Buchegger, S., Schi¨ oberg, D., Vu, L.H., Datta, A.: PeerSoN: P2P social networking - early experiences and insights. In: Proc. of the 2nd ACM Workshop on Social Network Systems (2009) 4. Palmer, P.G.C., Faloutsos, C.: Anf: A fast and scalable tool for data mining in massive graphs. In: KDD (2002) 5. Cha, M., Mislove, A., Gummadi, K.P.: A measurement-driven analysis of information propagation in the flickr social network. In: WWW (2009) 6. Cholvi, V., Felber, P., Biersack, E.W.: Efficient search in unstructured peer-to-peer networks. In: SPAA (2004) 7. Costa, P., Mascolo, C., Musolesi, M., Picco, G.P.: Socially-aware routing for publish-subscribe in delay-tolerant mobile ad hoc networks. IEEE Journal on Selected Areas in Communications 26(5) (June 2008) 8. Datta, A., Quarteroni, S., Aberer, K.: Autonomous gossiping: A self-organizing epidemic algorithm for selective information dissemination in wireless mobile adhoc networks. Semantics of a Networked World (2004) 9. Demers, A., Greene, D., Hauser, C., Irish, W., Larson, J., Shenker, S., Sturgis, H., Swinehart, D., Terry, D.: Epidemic algorithms for replicated database maintenance. In: PODC. ACM, New York (1987) 10. Eugster, P.T., Felber, P.A., Guerraoui, R., Kermarrec, A.-M.: The many faces of publish/subscribe. ACM Comput. Surv. 35(2) (2003) 11. Gruhl, D., Guha, R., Liben-Nowell, D., Tomkinsi, A.: Information diffusion through blogspace. In: WWW (2004) 12. Hosseini, M., Ahmed, D.T., Shirmohammadi, S., Georganas, N.D.: A survey of application-layer multicast protocols 9(3) (2007) 13. Iamnitchi, A., Ripeanu, M., Santos-Neto, E., Foster, I.: The small world of file sharing. IEEE Transactions on Parallel and Distributed Systems 14. Karp, R., Schindelhauer, C., Shenker, S., Vocking, B.: Randomized rumor spreading. In: FOCS. IEEE Comput. Soci., Los Alamitos (2000) 15. Kumar, R., Novak, J., Raghavan, P., Tomkins, A.: On the bursty evolution of blogspace. World Wide Web 8(2) (2005) 16. Patel, J.A., Gupta, I., Contractor, N.: Jetstream: Achieving predictable gossip dissemination by leveraging social network principles. In: Proceedings of the Fifth IEEE Int. Symposium on Network Computing and Applications (2006) 17. Rapoport, A.: Spread of information through a population with sociostructural bias: I. assumption of transitivity. Bulletin of Mathematical Biophysics 15 (1953) 18. Shah, D.: Gossip algorithms. Found. Trends Netw. 3(1) (2009) 19. Song, X., Lin, C.-Y., Tseng, B.L., Sun, M.-T.: Modeling and predicting personal information dissemination behavior. In: KDD 2005: Proceedings of the Eleventh ACM SIGKDD Int. Conf. on Knowledge Discovery in Data Mining, ACM, New York (2005) 20. Vigfusson, Y., Birman, K., Huang, Q., Nataraj, D.P.: Go: Platform support for gossip applications. In: P2P (2009)
Mining Frequent Subgraphs to Extract Communication Patterns in Data-Centres Maitreya Natu1 , Vaishali Sadaphal1 , Sangameshwar Patil1 , and Ankit Mehrotra2 1
Tata Research Development and Design Centre, Pune, India 2 SAS R & D India Pvt Limited, Pune , India {maitreya.natu,vaishali.sadaphal,sangameshwar.patil}@tcs.com,
[email protected] Abstract. In this paper, we propose to use graph-mining techniques to understand the communication pattern within a data-centre. We present techniques to identify frequently occurring sub-graphs within this temporal sequence of communication graphs. We argue that identification of such frequently occurring sub-graphs can provide many useful insights about the functioning of the system. We demonstrate how the existing frequent sub-graph discovery algorithms can be modified for the domain of communication graphs in order to provide computationally light-weight and accurate solutions. We present two algorithms for extracting frequent communication sub-graphs and present a detailed experimental evaluation to prove the correctness and efficiency of the proposed algorithms.
1
Introduction
With the increasing scale and complexity of today’s data-centers, it is becoming more and more difficult to analyze the as-is state of the system. The data center operators observe the dire need for obtaining insights into the as-is state of the system such as the inter-component dependencies, heavily used resources, heavily used communication patterns, occurrence of changes, etc. Modern enterprises support two broad types of workloads and applications: transactional and batch. The communication patterns in these systems are dynamic and keep changing over time. For instance, in a batch system, the set of jobs executed on a day depends on several factors, including the day of the week, day of the month, addition of new reporting requirements, etc. The communication patterns for both transactional and batch systems can be observed as a temporal sequence of communication graphs. We argue that there is a need to extract and analyze frequently-occurring subgraph derived from a sequence of communication graphs to answer various analysis questions. In scenarios where the discovered frequent subgraph is large in size, the frequent subgraph provides a representative graph of the entire system. Such a representative graph becomes extremely useful in scenarios where the communication graphs change dynamically over time. The representative graph can M.K. Aguilera et al. (Eds.): ICDCN 2011, LNCS 6522, pp. 239–250, 2011. c Springer-Verlag Berlin Heidelberg 2011
240
M. Natu et al.
provide a good insight into the recurring communication pattern of the system. The discovered representative graph can be used in a variety of ways. For instance, (a) It can be used to predict the future communication patterns. (b) It can be used to perform what-if analysis. (c) Given a representative graph, various time-consuming analysis operations such as dependency discovery, building workload-resource utilization models, performing slack analysis, etc., can be done a-priori in an off-line manner to aid in quicker online analysis in future. In scenarios where the discovered frequent subgraph is small in size, such a subgraph can be used to zoom into the heavily used communication patterns. The components in this graph can be identified as critical components and further analyzed for appropriate resource provisioning, load balancing, etc. In this paper we present techniques to identify frequently occurring subgraph within a set of communication graphs. The existing graph-mining solutions [5,6,4] for frequent subgraph discovery address a more complex problem. Most of the existing techniques assume that the graph components can have non-unique identifiers. For instance, consider a problem of mining chemical compounds to find recurrent substructures. Presence of non-unique identifiers results in an explosion in the number of possible combinations of sub-graphs. Further, operations such as graph isomorphism become highly computation intensive. The problem of finding frequent subgraph in a set of communication graphs is a simpler problem as each graph component can be assigned a unique identifier. In this paper, we argue that simpler graph-mining solutions can be developed for this problem. We present two techniques for discovering frequent subgraph. (1) We first present a bottom-up approach where we incrementally take combinations of components and compute their support. The key idea behind the proposed modification is that the support of subgraphs at the next level can be estimated based on the support of the subgraphs at the lower levels using probability theory. (2) We then present a top-down approach where, instead of analyzing components in a graph, we consider the entire graph as a whole and analyze the set of graphs. We use simple matrix operations to mine frequent subgraph. The key idea behind this algorithm is that the property of unique component identifiers can be used to assign a common structure to all graphs. This common structure can be used to process entire graph as a whole in order to identify frequently occurring subgraphs. The main contributions of this paper are as follows: (1) We present a novel application of frequent subgraph discovery to extract communication patterns. (2) We present a modification to the existing bottom-up Apriori algorithm to improve efficiency. We also present a novel top-down approach for frequent subgraph discovery in communication graphs. (3) We apply the proposed algorithm on a real-world batch system. We also present comprehensive experimental evaluation of the techniques and discuss the effective application areas of the proposed algorithms.
Mining Frequent Subgraphs to Extract Communication Patterns
2
241
Related Work
Graph mining techniques have been applied to varied domains viz. for the purpose of extracting common patterns in chemical compounds, genetic formulae, extracting common structures in Internet, social networks, [3,2] etc. In this paper, we consider extracting commonality in communication patterns given that the communication patterns are changing dynamically. Our work is different in that we take advantage of presence of unique node identifiers and propose new algorithms to identify frequent subgraph. Frequent subgraph mining techniques proposed in the past [5], [6], [1] use a bottom-up approach. In this paper, we propose a modification to the traditional bottom-up approach and present a novel top-down approach for frequent subgraph discovery. Most of the past research assumes non-unique identifiers of components and is directed towards solving the issues related to explosion of candidate space due to presence of multiple vertices and edges with same identifier.When applied to the domain of communication graphs, the problem of mining frequent subgraph translates to a much simpler problem due to the fact that no two nodes or links in a communication network have same identifier. [7] propose algorithms for testing graph isomorphism, computing largest common subgraph in the trees or graphs with unique node labels. In this paper, we address the problem of mining frequent graphs in a set of graphs with unique node and link identifiers.
3
Problem Description
In this section, we first define various terms used in this paper. We then systematically map the addressed problem to the problem of frequent subgraph discovery. We then present the problem definition and introduce the proposed solution approaches. 3.1
Terms Used
We first define various terms used in this paper. Size(G): The size of a graph G is defined as the number of edges present in the graph. Subgraph(G): A graph G (V , E ) is a subgraph of a graph G(V, E), if and only if V ⊆ V and E ⊆ E. Support(G , G): Given a set of n graphs G = {G1 , G2 , . . . , Gn }, a subgraph G has a support ns if G is a subgraph of s graphs out of the set {G1 , G2 , . . . , Gn }. Frequent Subgraphs(min support, G): Given a set of graphs G = {G1 , G2 , . . . , Gn } and the required minimum support of min support, a subgraph G is a frequent subgraph if and only if Support(G , G) ≥ min support. Note that the resulting frequent subgraph can be a disconnected graph.
242
3.2
M. Natu et al.
Problem Definition
We now map the problem of identifying communication patterns to the problem of frequent subgraph discovery. Consider a network topology T represented by a graph GT (VT , ET ) where the vertices VT represent network nodes and edges ET represent network links. Each vertex (and each edge) in the graph G(V, E) can be identified using a unique identifier. The problem of identifying communication patterns in the network from a set of communication traces can be mapped to the problem of frequent subgraph discovery as follows. 1. A communication trace C consists of the links (and nodes) being used by the system over time. 2. The communication trace Ct , at time t, can be represented as a graph Gt (Vt , Et ) where the set Vt and Et represent the network nodes links being used in time-window t. (Note that, Vt ⊆ VT and Et ⊆ ET .) 3. A set of communication traces C1 , C2 , . . . , Cn can be represented by a set of graphs G1 (V1 , E1 ), G2 (V2 , E2 ), . . . , Gn (Vn , En ). 4. The problem of identifying frequent communication patterns in a trace C can be mapped to the problem of identifying frequent subgraph in a set of graphs G1 (V1 , E1 ), G2 (V2 , E2 ), . . . , Gn (Vn , En ). The problem can then be defined as follows: Given a set of graphs G1 , G2 , . . . , Gn and required support of min sup, compute the set F of all frequently occurring subgraphs F1 , F2 , . . . , Fm .
Fig. 1. (a) Example graphs, (b) Frequent subgraphs with minimum support of
2 3
Fig. 2. (a) Execution of Apriori algorithm, (b) Execution of Approx Apriori algorithm
Mining Frequent Subgraphs to Extract Communication Patterns
4
243
Proposed Bottom-Up Approach: Algorithm Approx-Apriori
In the past, the problem of identifying frequent subgraph has been solved using bottom-up approaches. Apriori [4] is a representative algorithm of this category. The traditional Apriori algorithm broadly involves two steps: Step 1: Generate subgraphs of size (k + 1) by identifying pairs of frequent subgraphs of size k that can be combined. Step 2: Compute the support of subgraphs of size (k + 1) to identify frequent subgraphs. We use a running example to explain the Apriori algorithm and other algorithms presented in the next section. Running example: We present an example consisting of a set of three graphs shown in Figure 1a. We consider min support = 23 . We show the execution of the traditional Apriori algorithm on this example. Figure 2a shows the subgraphs of size 1, 2, 3, and 4 generated by the iterations of Apriori algorithm. Consider the subgraphs of size 1. The subgraph with a single edge (c − e) has a support less than 23 and is discarded from further analysis. The remaining subgraphs are used to build subgraphs of size 2. Similar process is continued in subsequent iterations to identify subgraphs of size 3 and 4. This approach results in two kinds of overheads, (1) joining two size-k frequent subgraphs to generate one size-(k + 1) subgraph, (2) counting frequency of these subgraphs. Step (1) involves making all possible combinations of subgraphs of level-k resulting into large number of candidates making the approach computation intensive. The problem becomes more acute when the minimum support required for a subgraph to be identified as a frequent subgraph is smaller. In this paper, we present a modification of the first step of the Apriori algorithm by pruning of the search space. We modify the first step of the Apriori algorithm by intelligently selecting the size-k subgraphs. In this step, we propose to estimate the support of the generated subgraphs of size-k using the support of the constituent size (k − 1) subgraphs. We use support of subgraph of size (k − 1) as the probability of its occurrence in the given set of graphs. Thus, given the probability of occurrence of two subgraphs of size (k − 1), their product is used as the probability of occurrence of the subgraph of size k. This is used as an estimate of the support of subgraphs of size-k. We prune a size k subgraph if the estimated support is less than desired support, pruning min support. Note that, the above computation assumes that the occurrence of two subgraphs of size k is independent. Thus the estimated support tends to move away from the actual support in situations where the independence assumption does not hold. Furthermore, the error propagates in subsequent iterations as the graphs grow in size which may result in larger inaccuracy. We hence propose to relax the pruning min support with every iteration. The pruning min support is equal to min support in the first iteration. We decrease the pruning min support by a constant REDUCTION FACTOR in every iteration. The pruning thus performed narrows down the search space and decreases the execution time. In Section 7, we show through experimental evaluation that
244
M. Natu et al.
with appropriate use of REDUCTION FACTOR, the Approx-Apriori algorithm gives reasonably accurate results. We next present various steps involved in the proposed approach. Input: 1. G = {G1 (V1 , E1 ), . . . , Gn (Vn , En )}: set of graphs 2. min support: Minimum support required to declare a subgraph as frequent Output: Set of frequently occurring subgraphs. Initialization: 1. 2. 3. 4.
Generate a set of graphs GS1 of size 1 for each edge in E, E = E1 ∪ . . . ∪ En . S1 Remove Graph GS1 if Support(GS1 i from G i , G) < min support. Set k = 2, where k = size of subgraphs. k = k + 1 with every iteration. pruning min support = min support.
Step 1 - Generate subgraphs of size k by identifying pairs of frequent subgraphs of size k − 1 that can be combined: Two subgraphs GSk and i Sk+1 GSk of size k + 1 if and only j of size k are combined to create a subgraph Gij Sk Sk if k − 1 edges are common in GSk and GSk i j . In other words, |Ei ∩ Ej | = 1. – Estimate the support of the subgraph GSk+1 : ij Estimated Support(GSk+1 ) = Prob(GSk+1 ) ij ij
–
Sk where Prob(GSk+1 ) = Prob(GSk i ) * Prob(Gj ). ij Sk+1 Prune the subgraph Gij if Estimated Support(GSk+1 ) < pruning min support. ij
– Decrease the pruning min support for the next iteration:
pruning min support = pruning min support - REDUCTION FACTOR.
Step 2 - Compute the support of subgraphs of size k to identify frequent subgraphs: Repeat Step 1 and Step 2 until subgraphs of all sizes are explored. Running example: For the the running example of Figure 1a Figure 2b shows the subgraphs and estimated support of size 1, 2, 3, and 4 generated by the iterations of Approx-Apriori algorithm. Unlike Apriori, the Approx-Apriori performs an intelligent selection of pairs using estimated support. If the estimated support of a size k subgraph is less than 23 then the constituent pair of size k − 1 subgraphs are never combined. For example, consider the pair of size 3 subgraphs: (a − b)(b − c)(c − d) and (a − b)(b − c)(b − e) both having a support of 23 . The estimated support of the resulting size-4 subgraph is 49 which is less than 23 . Hence the size-4 subgraph (a − b)(b − c)(c − d)(b − e) is not constructed for analysis.
5
Proposed Top-Down Approach: Matrix-ANDing
While building a top-down approach, we use the entire graph as a whole and use the following two properties of the communication graphs: (1) Each node in a network can be identified with a unique identifier. (2) The network topology is known apriori. These properties can be exploited to assign a common structure
Mining Frequent Subgraphs to Extract Communication Patterns
245
Fig. 3. (a)Matrix representation of Graphs X, Y, and Z, Mx , My , Mz . (b) Consolidated Matrix, Mc .
to all graphs. This common structure can be used to process entire graphs as a whole in order to identify frequently occurring subgraphs. In the following algorithm, we first explain this structure and show all graphs can be represented in a similar manner. We then present a technique to process these structures to extract frequent subgraphs. Input: (1) G = {G1 (V1 , E1 ), . . . , Gn (Vn , En )}: set of graphs. (2) min support: Minimum support required to declare a subgraph as frequent. Output: Set of frequently occurring subgraphs. Initialization: (1) We first identify the maximal set of vertices VM as discussed above and order these vertices lexicographically. (2) We then represent each graph Gt ∈ GT as a |VT | × |VT | matrix Mt using the standard binary adjacency matrix representation. Note that the nodes in all matrices are in same order. Thus, a cell Mt [i, j] represents the same edge Eij in all the matrices. Processing: 1. Assign a unique prime number pt as an identifier of a graph Gt ∈ G and multiply all the values in its representative matrix Mt with pt . 2. Given the matrices M1 , . . . , Mn , compute a consolidated matrix Mc such that ∀i∈VT ,j∈VT Mc [i, j] = M1 [i, j] ∗ . . . ∗ Mn [i, j] if Mn [i, j] = 0. 3. a set of n graphs G = {G1 , . . . , Gn } and min support, compute Given n combinations of the graphs. Compute an identifier for each comn∗min support bination as a product of the identifiers of its constituent graphs. Thus, the identifier for a combination of graphs G1 , . . . , Gk is computed as p1 ∗ . . . ∗ pk . The set Gcomb consists of identifier for each of the n∗min nsupport combinations. – Given an identifier Gcomb [i] for a combination of graphs, we can identify if an edge represented by Mc [i, j] is present in all the graphs represented by Gcomb [i]. By the property of prime numbers, this can be done by simply checking if Mc [i, j] is divisible by Gcomb [i]. 4. For each identifier Gcomb [i] identify the cells Mc [i, j] in the consolidated matrix Mc that are divisible by Gcomb [i]. The edges identified by these cells represent a frequently occurring subgraph with a support greater than or equal to the min support.
246
M. Natu et al.
Note that each element in a consolidated matrix has a product of prime numbers. Even if there are small number of graphs in a database that have the same edge, the element corresponding to the edge in the consolidated matrix would have a very large value resulting in overflow. We propose to use compression techniques to avoid such scenarios. Bit vectors can also be used in such cases. Running example: We assign prime numbers 3, 5, and 7 to the graphs X, Y , and Z from Figure 1a. As the maximal set of nodes in these graphs is {a, b, c, d, e, f, g} we represent all graphs in the form of a 7 × 7 matrix. Figure 3a shows matrix representation Mx , My , and Mz of graphs X, Y , and Z where matrix of each graph has been multiplied by its identifier prime number. The consolidated matrix built from these matrices is shown in Figure 3b. From the given set of 3 graphs, X, Y, and Z, in order to identify frequent subgraphs with min support = 23 , we compute 32 = 3 combinations, viz., (X, Y ), (X, Z), (Y, Z). The identifiers for the combinations (X, Y ), (X, Z), and (Y, Z) are computed as 3 ∗ 5 = 15, 3 ∗ 7 = 21, and 5 ∗ 7 = 35, respectively. With identifier of X,Y equal to 15, the cells in Mc that are divisible by 15 indicate the edges that are present in both graphs X and Y. This in turn also provides a frequent subgraph with a minimum support of 2. Similarly, other frequent subgraphs can be identified by checking for cells with divisibility by 21, and 35.
Fig. 4. (a,b) Sample snapshots of the real-life mainframe batch processing system. (c) Identified frequent subgraph.
6
Application of Proposed Technique on a Real-Life Example
In this section, we apply the proposed technique on a real-life mainframe batch processing system. This system is used at a leading financial service provider for a variety of end-of-day trade result processing. We have 6 months data about the per-day graph of dependence among these jobs. Over the period of 6 months, we observed a total of 516 different jobs and 72702 different paths. Analyzing the per-day job precedence graphs brings out the fact that the jobs and their dependencies change over time. Figures 4(a,b) show two sample
Mining Frequent Subgraphs to Extract Communication Patterns
247
snapshots of this changing graph. Figure 4(c) shows one of the frequent subgraphs detected by our algorithm (min support = 0.7). On an average, there are about 156 processes and 228 dependence links per graph, whereas the frequently occurring subgraph in the set of these graphs consists of 98 processes and 121 dependence links.
Fig. 5. Execution time for experiments with changing (a) number of nodes, (b) average node degree, (c) level of activity, (d) min support
The frequent subgraphs discovered on the system are found to be large in size. It covers more than 65% of the graph at any time instance. This insight is typically true for most back-office batch processing systems, where a large portion of communication takes place on a regular basis and few things change seasonally. This communication graph can then be used as a representative graph and various analysis can be performed a-priori on this representative graph in an off-line manner. This off-line analysis on the representative graph can then be used to quickly answer various analysis questions on-line about the entire system.
7
Experimental Evaluation
In this section, we present the experiment design for systematic evaluation of the algorithms proposed in this paper. We simulate systems with temporally changing communication patterns and execute the proposed algorithms to identify frequently occurring communication patterns. We generate various network topologies based on the desired number of nodes and average node degree and model each topology as a graph. For each topology, we generate a set of graphs that represents temporally changing communication patterns. We model the level of change in system-activity, c, by controlling the amount of change in links across the graphs. The default values of various experiment parameters are set as follows: Number of nodes = 10; Average node degree = 5; Change in level of activity = 0.5; Number of graphs = 10; min support = 0.3; REDUCTION FACTOR = 0; Each point plotted on the graphs is an average of the results of 10 runs.
248
M. Natu et al.
Fig. 6. False negatives for experiments with changing (a) number of nodes, (b) average node degree, (c) min support
7.1
Comparative Study of Different Algorithms to Observe the Effect of Various System Properties
We performed experiments to observe the effect of four system properties viz. number of nodes, average node degree, level of system activity, and desired min support. Figure 5 presents effect on time-taken by the three algorithms. Figure 6 presents the effect on accuracy of the Approx. Apriori algorithm. Number of nodes: We evaluate the algorithm on graphs with number of nodes = 10 to 18 in Figure 5a and Figure 6a. It can be seen that execution time of Apriori algorithms is more sensitive to the number of nodes than the Matrix ANDing algorithm. The execution time of Apriori algorithm is significantly larger than that of the Approx. Apriori algorithm. The false negatives of the Approx. Apriori algorithm increase as the size of graph increases. This is because with the increase in the size of the graph the number of candidates increases and the possibility of missing out a frequent subgraph increases. Average node degree: We evaluate the algorithms on the graphs with node degree 3 to 7 in Figure 5b and Figure 6b. The execution time of Matrix-ANDing algorithm is independent of the average node degree of the network. This is because its execution time is mainly dependent on the matrix size (number of nodes) and depends very less on the node degree or the number of links. The execution time of Apriori algorithms increases as the node degree increases because of the increased number of candidate subgraphs. Larger node degree results in a larger candidate space. As a result, increase in node degree results in increase in false negatives. Change in the level of activity in the system: Figure 5c shows the effect of amount of change in the level of activity in the system, c, on the execution time of the algorithm. There is no significant effect of amount of change in the level of activity in the system on the false negatives. The execution time of MatrixANDing algorithm is independent of the level of activity in the system. The execution time of the Apriori algorithms decreases with increase in the level of activity. As c increases, the number of frequent subgraphs decreases, resulting in the decrease in execution time of the Apriori algorithms.
Mining Frequent Subgraphs to Extract Communication Patterns
249
Desired min support : Figure 5d shows the effect of min support on the execution time of the algorithms. Figure 6c shows effect of min support on false negatives. The execution time of the Apriori algorithms decreases with increase in the min support value. Small min support results in a large number of graphs resulting in large execution time. The Matrix ANDing algorithm behaves in an interesting manner. The Matrix-ANDing algorithm processes n∗min nsupport combinations of graphs. Note that this number is maximum when min support = 0.5. As a result, Matrix-ANDing takes maximum time to execute when min support = 0.5. The false negatives decrease with increasing min support since the number of frequent graphs decreases. This experiment brings out the strength and weaknesses of the three algorithms. (1) It is interesting to note that the execution time of Apriori and Approx-Apriori algorithm is controlled by graph properties such as number of nodes, average node degree, and the level of activity. However, behaviour of Matrix-ANDing is mainly dependent upon application parameters such as min support. (2) Note that both Apriori and Matrix-ANDing provide optimal solutions. For large values of support Apriori algorithm requires smaller execution time and for small values of support Matrix-ANDing requires smaller execution time. (3) In cases where some false negatives can be tolerated, ApproxApriori algorithm performs fastest with near-optimal solutions for larger values of support. 7.2
Properties of Identified Frequent Subgraphs
We study the effect of different parameters on the properties of the identified frequent subgraphs viz. number and the size of subgraphs. The number of frequent subgraphs and size of frequent subgraphs identified in the given set of graphs provides insight into the dynamic nature of the system and size of critical components in the graph.
Fig. 7. Number of frequent subgraphs identified in the network with changing (a) number of nodes, (b) average node degree, (c) level of activity, (d) min support
Figure 7a and Figure 7b show the effect of increase in number of nodes and node degree in the graph, respectively. The number of nodes or the node degree in the graph increases the number and size of frequent subgraphs increase.
250
M. Natu et al.
Figure 7c shows the effect of minimum support, min support on the number and the size of the identified frequent subgraphs. The number and the size of frequent subgraph decreases as the minimum support increases. This is because, larger is the value of the minimum support, more stringent are the requirements for a graph to be declared as a frequent subgraph. Figure 7d shows the effect of amount of change in the level of activity in the system on the number and size of the frequent subgraphs. With increase in the level of activity of the system the number and size of frequent subgraphs decrease.
8
Conclusion
In this paper, we propose to use graph-mining techniques to understand the communication pattern within a data-centre. We present techniques to identify frequently occurring sub-graphs within this temporal sequence of communication graphs. The main contributions of this paper are as follows: (1) We present a novel application of frequent subgraph discovery to extract communication patterns, (2) We present a modification to the existing bottom-up Apriori algorithm to improve efficiency. We also present a novel top-down approach for frequent subgraph discovery in communication graphs, (3) We apply the proposed algorithm on a real-world batch system. We also present comprehensive experimental evaluation of the techniques and discuss the effective application areas of the proposed algorithms.
References 1. Bernecker, T., Kriegel, H.-P., Renz, M., Verhein, F., Zuefle, A.: Probabilistic frequent itemset mining in uncertain databases. In: KDD (2009) 2. Chen, C., Yanz, X., Zhuy, F., Hany, J.: gapprox: Mining frequent approximate patterns from a massive network. In: Perner, P. (ed.) ICDM 2007. LNCS (LNAI), vol. 4597, pp. 445–450. Springer, Heidelberg (2007) 3. Christos Faloutsos and Jimeng Sun. Incremental pattern discovery on streams, graphs and tensors. Technical report at CMU (2007) 4. Mannila, H., Toivonen, H., Verkamo, A.: Efficient algorithms for discovering association rules. In: AAAI Workshop on KDD (1994) 5. Inokuchi, A., Washio, T., Motoda, H.: Complete mining of frequent patterns from graphs. In: Mining Graph Data, Machine Learning (2003) 6. Kuramochi, M., Karypis, G.: An efficient algorithm for discovering frequent subgraphs. IEEE Transactions on Knowledge and Data Engineering (2004) 7. Valiente, G.: Efficient Algorithms on Trees and Graphs with Unique Node Labels. In: Studies in Computational Intelligence, vol. 52, pp. 137–149. Springer, Heidelberg (2007)
On the Hardness of Topology Inference H.B. Acharya1 and M.G. Gouda2 1 2
The University of Texas at Austin, USA
[email protected] The National Science Foundation, USA
[email protected] Abstract. Many systems require information about the topology of networks on the Internet, for purposes like management, efficiency, testing of new protocols and so on. However, ISPs usually do not share the actual topology maps with outsiders; thus, in order to obtain the topology of a network on the Internet, a system must reconstruct it from publicly observable data. The standard method employs traceroute to obtain paths between nodes; next, a topology is generated such that the observed paths occur in the graph. However, traceroute has the problem that some routers refuse to reveal their addresses, and appear as anonymous nodes in traces. Previous research on the problem of topology inference with anonymous nodes has demonstrated that it is at best NP-complete. In this paper, we improve upon this result. In our previous research, we showed that in the special case where nodes may be anonymous in some traces but not in all traces (so all node identifiers are known), there exist trace sets that are generable from multiple topologies. This paper extends our theory of network tracing to the general case (with strictly anonymous nodes), and shows that the problem of computing the network that generated a trace set, given the trace set, has no general solution. The weak version of the problem, which allows an algorithm to output a “small” set of networks- any one of which is the correct one- is also not solvable. Any algorithm guaranteed to output the correct topology outputs at least an exponential number of networks. Our results are surprisingly robust: they hold even when the network is known to have exactly two anonymous nodes, and every node as well as every edge in the network is guaranteed to occur in some trace. On the basis of this result, we suggest that exact reconstruction of network topology requires more powerful tools than traceroute.
1
Introduction
Knowledge of the topology of a network is important for many design decisions. For example, the architecture of an overlay network - how it allocates addresses etc.- may be significantly optimized by knowledge of the distribution and connectivity of the nodes on the underlay network that actually carries the traffic. Several important systems, such as P4P [9] and RMTP [7], utilize information about the topology of the underlay network for optimization as well as management. Furthermore, knowledge of network topology is useful in research; for M.K. Aguilera et al. (Eds.): ICDCN 2011, LNCS 6522, pp. 251–262, 2011. c Springer-Verlag Berlin Heidelberg 2011
252
H.B. Acharya and M.G. Gouda
example, in evaluating the performance of new protocols. Unfortunately, ISPs do not make maps of the true network topology publicly available. Consequently, a considerable amount of research effort has been devoted to the development of systems that reconstruct the topology of networks in the Internet from publicly available data - [10], [6], and [4]. The usual mechanism for generating the topology of a network is by the use of Traceroute [3]. Traceroute is executed on a node, called the source, by specifying the address of a destination node. This execution produces a sequence of identifiers, called a trace, corresponding to the route taken by packets traveling from the source to the destination. A trace set T is generated by repeatedly executing Traceroute over a network N , varying the terminal nodes, i.e. the source and destination. If T contains traces that identify every instance when an edge is incident on a node, it is possible to reconstruct the network exactly. However, practical trace sets do not have this property. The most common problems are incomplete coverage, anonymity (where a node can be detected, but will not state its unique identifier, i.e. its address), and aliasing (nodes may have multiple unique identifiers). The situation is further complicated by load balancing, which may cause incorrect traces; tools such as Paris Traceroute [8] attempt to correct this problem. In this paper, we deal with the problem of inferring the correct network topology in the presence of anonymous nodes. The problem posed by anonymous nodes in a trace is that a given anonymous node may or may not be identical to any other anonymous node. Clearly, a topology in which these nodes are distinct is not identical to one in which they are merged into a single node. Thus, there may be multiple topologies for the computed network. Note that all these candidate topologies can generate the observed trace set; no algorithm can tell, given the trace set as input, which of these topologies is correct. To solve this problem, Yao et al. [10] have suggested computing the minimal topology - the topology of the network with the smallest number of anonymous nodes (subject to some constraints - trace preservation and distance preservation) from which the given trace set is generable. They conclude that the problem of computing a minimal network topology from a given set of traces is NP-complete. Accordingly, most later research in the area, such as [6] and [4], has focused on heuristics for the problem. We attack this problem from a different direction. In our earlier papers [1] and [2], we introduced a theory of network tracing, i.e. reconstruction of network topology from trace sets. In these papers, we made the problem theoretically tractable by assuming that no node is strictly anonymous. In this theory, a node can be irregular, meaning it is anonymous in some traces, but there must exist at least one trace in which it is not anonymous. This simplifying assumption clearly does not hold in practice; in fact, an anonymous node is almost always consistently anonymous, not irregular. (In practical cases, anonymous nodes correspond to routers that do not respond to ping; irregular nodes are routers that drop ping due to excessive load. Clearly, the usual case is for nodes to be
On the Hardness of Topology Inference
253
consistently anonymous, rather than irregular.) However, it enabled us to develop a theory for the case when the number of nodes in the network is clearly known (equal to the number of unique identifiers). In this paper, we develop our theory of network tracing for networks with strictly anonymous nodes. Our initial assumption was that, as irregular nodes are “partially” anonymous, the hardness results in [1] should hold for anonymous nodes. To our surprise, this turned out not to be true; in Theorem 3, we show that networks with one anonymous node are completely specified by their trace sets, while networks with one irregular node are not [1]. Consequently, we constructed a complete new proof for network tracing in the presence of strict anonymity, presented in Section 3. We show that even under the assumption that the minimal topology is correct, the network tracing problem with anonymous nodes is in fact much harder than NP-complete; it is not just intractable, but unsolvable. Even if we weaken the problem and allow an algorithm to return a “small” number of topologies (one of which is correct) the problem remains unsolvable: an algorithm guaranteed to return the correct topology returns a number of topologies which is at least exponential in the total number of nodes (anonymous and non-anonymous). A very surprising fact is that this result even holds if the number of anonymous nodes is restricted to two. We demonstrate how to construct a trace set which is generable from an exponential number of networks with two anonymous nodes, but not generable from any network with one anonymous node or less. (It is interesting to note that our results are derived under a network model with multiple strong assumptions - stable and symmetric routing, no aliasing, and complete coverage. The reason we choose such friendly conditions for our model is to demonstrate that the problem cannot be made easier using advanced network tracing techniques, such as Paris Traceroute to detect artifact paths, and inference of missing links [5]. We would like to thank Dr Stefan Schmid for this observation.) We would like to clarify our claim that the problem of identifying the network from which a trace set was generated, given only the trace set, is unsolvable. Our proof does not involve a reduction to a known uncomputable problem, such as the halting problem. Instead, we demonstrate that there are many minimal networks - an exponential number of them - that could have generated a given trace set; so, given only the trace set, it is impossible to state with certainty that one particular topology (or even one member of a small set of topologies) represents the network from which the trace set was in fact generated. The earlier proof of NP-completeness (by a reduction to graph coloring) provided by Yao et al. holds for constructing a minimal topology, not the minimal topology from which the trace set was generated. It is NP-complete to find a single member of the exponential-sized solution set. Thus, even under the assumption that the true network is minimal in the number of anonymous nodes, trying to reconstruct it is much harder than previously thought. In the next section, we formally define terms such as network, trace and trace set, so as to be able to develop our mathematical treatment of the problem.
254
2
H.B. Acharya and M.G. Gouda
Minimal Network Tracing
In this section, we present formal definitions of the terms used in the paper. We also explain our network model and the reasoning underlying our assumptions. Finally, we provide a formal statement of the problem studied. 2.1
Term Definitions
A network N is a connected graph where nodes have unique identifiers. However, a node may or may not be labeled with its unique identifier. If a node is labeled with its unique identifier, it is non-anonymous; otherwise, it is anonymous. Further, non-anonymous nodes are either terminal or non-terminal. (These terms are used below.) A trace is a sequence of node identifiers. A trace t is said to be generable from a network N iff the following four conditions are satisfied: 1. t represents a simple path in N . 2. The first and last identifiers in t are the unique identifiers of terminal nodes in N . 3. If a non-anonymous node “a” in N appears in t, then it appears as “a”. 4. If an anonymous node “∗” in N appears in t, then it appears as “∗i ”. i is a unique integer in t, to distinguish anonymous nodes from each other. A trace set T is generable from a network N iff the following conditions are satisfied: 1. Every trace in T is generable from N . 2. For every pair of terminal nodes x, y in N , T has at least one trace between x and y. 3. Every edge in N occurs in at least one trace in T . 4. Every node in N occurs in at least one trace in T . 5. T is consistent: for every two distinct nodes x and y, exactly the same nodes must occur between x and y in every trace in T where both x and y occur. We now discuss the reason why we assume the above conditions. The first condition is obviously necessary. The third and fourth conditions are also clearly necessary, as we are interested in the problem of node anonymity, not incomplete coverage. However, the second and fifth conditions are non-trivial; we explain them as follows. Conditions like inconsistent or asymmetric routing may or may not be true. Furthermore, it is possible, using tools like source routing and public traceroute pages, to ensure that a trace set contains traces between every possible pair of terminals. As our primary results are negative, we show their robustness by assuming the worst case: we develop our theory assuming the best possible conditions for the inference algorithm, and prove that the results are still valid.
On the Hardness of Topology Inference
255
In our earlier work, [1] and [2], we developed our theory using another strong condition: no node was anonymous. For a trace set to be generable from a network, we required that the unique identifier of every node in the network appear in at least one trace. However, on further study we learned that routers in a network appear anonymous because they are configured either to never send ICMP responses, or to use the destination addresses of the traceroute packets instead of their real addresses [10]. Thus, if a node is anonymous in a single trace, it is usually anonymous in all traces in a trace set. This fact reduces our earlier study of network tracing to a theoretical exercise, as clearly its assumptions cannot be satisfied. Accordingly, in this paper, we have discarded this condition, and updated our theory of network tracing to include networks with anonymous nodes. The introduction of strictly anonymous nodes leads to a complication in our theory: we no longer have all unique identifiers, and cannot be sure of the total number of nodes in the network. Hence we will adopt the same approach as Yao et al. in [10] and attempt to reconstruct a topology with the smallest possible number of anonymous nodes. Accordingly, we adopt a new definition: A minimal network N from which trace set T is generable is a network with the following properties: 1. T is generable from N . 2. T is not generable from any network N which has fewer nodes than N . Note that, if there are multiple minimal networks from which a trace set T is generable, then they all have the same number of nodes. Further, as all such networks contain every non-anonymous node seen in T , it follows that all minimal networks from which a trace set T is generable also have the same number of anonymous nodes. 2.2
The Minimal Network Tracing Problem
We can now state a formal definition of the problem studied in this paper. The minimal network tracing problem can be stated as follows: “Design an algorithm that takes as input a trace set T , that is generable from a network, and produces a network N such that T is generable from N and, for any network N = N , at least one of the following conditions holds: 1. T is not generable from N . 2. N has more anonymous nodes than N .” The weak minimal network tracing problem can be stated as follows: “Design an algorithm that takes as input a trace set T , that is generable from a network, and produces a small set S = {N1 , .., Nk } of minimal networks such that T is generable from each network in this set and, for any network N ∈ / S, at least one of the following conditions holds: 1. T is not generable from N . 2. N has more anonymous nodes than any member of S.”
256
H.B. Acharya and M.G. Gouda
The minimal network tracing problem is clearly a special case of the weak minimal network tracing problem, where we consider only singleton sets to be small. In Section 3, we show that the weak minimal network tracing problem is unsolvable in the presence of anonymous nodes, even if we consider only sets of exponential size to be “not small”; of course, this means that the minimal network tracing problem is also unsolvable.
3
The Hardness of Minimal Network Tracing
In this section, we begin by constructing a very simple trace set with only one trace, T0,0 = {(a, ∗1 , b1 )} which, of course, corresponds to the network in Figure 1.
Fig. 1. Minimal topology for T0,0
We now define two operations to grow this network, Op1 and Op2. In Op1, we introduce a new non-anonymous node and a new anonymous node; the nonanonymous nodes introduced by Op1 are b-nodes. In Op2, we introduce a nonanonymous node, but may or may not introduce an anonymous node; if we only consider minimal networks, then in Op2 we only introduce non-anonymous nodes. To execute Op1, we introduce a new b-node (say bi ) which is connected to a through a new anonymous node ∗i . We will now explain how we ensure that ∗i is a new anonymous node. Note that our assumption of consistent routing ensures that there are no loops in traces. Thus, we can ensure that ∗i is a “new” anonymous node (and not an “old”, i.e. previously-seen anonymous node) by showing that it occurs on a trace with every old anonymous node. To achieve this, we add traces from bi to each pre-existing b-node bj . These traces are of the form (bi , ∗ii , a, ∗jj , bj ). We then use consistent routing to show that ∗i = ∗ii and ∗j = ∗jj , and (as we intended) ∗i = ∗j . We denote the trace set produced by applying Op1 k times to T0,0 by Tk,0 . For example, after one application of Op1 to T0,0 , we obtain trace set T1,0 : T1 = {(a, ∗1 , b1 ), (a, ∗2 , b2 ), (b1 , ∗3 , a, ∗4 , b2 )} As we assume consistent routing, ∗1 = ∗3 and ∗2 = ∗4 . Furthermore, as ∗3 and ∗4 occur in the same trace, ∗1 = ∗2 .
On the Hardness of Topology Inference
257
Fig. 2. Minimal topology for T1,0
There is exactly one possible network from which this trace set is generable; we present it in Figure 2. We now define operation “Op2”. In Op2, we introduce a new non-anonymous node (ci ). We add traces such that ci is connected to a through an anonymous node, and is directly connected to all b and c nodes. We denote the trace set produced by applying Op2 l times to Tk,0 by Tk,l . For example, one application of Op2 to the trace set T1,0 produces trace set T1,1 given below. T1,1 = {(a, ∗1 , b1 ), (a, ∗2 , b2 ), (b1 , ∗3 , a, ∗4 , b2 ), (a, ∗5 , c1 ), (b1 , c1 ), (b2 , c1 )} From Figure 3 we see that three topologies are possible: (a) ∗5 is a new node. ∗1 = ∗5 and ∗2 = ∗5 . (b) ∗1 = ∗5 . (c) ∗2 = ∗5 . But network N1,1.1 is not minimal; it has one more anonymous node than the networks N1,1.2 and N1,1.3 . Hence, in future we discard such topologies and only consider the cases where the anonymous nodes introduced by Op2 are “old” (previously-seen) anonymous nodes.
(a) Network N1,1.1
(b) Network N1,1.2 Fig. 3. Topologies for T1,1
(c) Network N1,1.3
258
H.B. Acharya and M.G. Gouda
We are now in a position to prove the following theorem: Theorem 1. For every pair of natural numbers (k, l), there exists a trace set Tk,l that is generable from (k + 1)l minimal networks, and the number of nodes in every such network is 2k + l + 3. Proof. Consider the following construction. Starting with T0,0 , apply Op1 k times successively. This constructs the trace set Tk,0 , which has k + 1 distinct anonymous nodes. Finally, apply Op2 l times in succession to get Tk,l . Now, we show that Op2 indeed has the properties claimed. Note that every time Op2 is applied, it introduces an anonymous identifier. This identifier can correspond to a new node or to a previously-seen anonymous node. As we are considering only minimal networks, we know that this is a previously-seen anonymous node. There are k +1 distinct anonymous nodes, and the newly-introduced identifier can correspond to any one of these. There is no information in the trace set to decide which one to choose. Furthermore, each of these nodes is distinct - it is connected to a different (non-anonymous) b-node. In other words, each choice produces a distinct topology from which the constructed trace set is generable. Hence the number of minimal networks from which the trace set Tk,l is generable, is (k + 1)l . Further, there are 3 nodes to begin with. Every execution of Op1 adds two new nodes (b-node and the new ∗-node), and every execution of Op2 adds one new node (the c-node). As the total number of nodes in a minimal network is n, we also have n = 3 + 2k + l. We can see that n grows linearly with k and l. The number of candidate networks from which Tk,l is generable, grows as (k + 1)l . So, for example if we take k = n (n 3 −1) , which is obviously l = n−3 3 , the number of candidate networks is ( 3 ) exponential. In fact, this expression is so strongly exponential that it remains exponential even in the special case where we restrict the number of anonymous nodes to exactly two. Note that, if we execute Op1 exactly once and Op2 l times, then by the formula above the number of minimal networks is 2l = 2n−5 , which is O(2n ) - exponential. We have proved the following theorem: Theorem 2. For any n ≥ 6, there exists a trace set T such that: (a) n is the number of nodes in a minimal network from which T is generable. (b) Every such minimal network has exactly two anonymous nodes. (c) The number of such minimal networks is O(2n ). As an example, Figure 4 shows all 23 = 8 possible networks from which the trace set T1,3 is generable. We are now in a position to state our result about the minimal network tracing problem.
On the Hardness of Topology Inference
259
Theorem 3. Both the minimal network tracing problem and the weak minimal network tracing problem are unsolvable in general, but solvable for the case where the minimal network N , from which trace set T is generable, has exactly one anonymous node. Proof. Consider any algorithm that can take a trace set and return the correct network. If the algorithm is given as input one of the trace sets shown in Theorems 1 and 2, it must return an exponentially large number of networks in the worst case. (If it does not return all networks from which the trace set is generable, it may fail to return the topology of the actual network from which the trace set was generated.) In other words, no algorithm that always returns a “small” number of networks can be guaranteed to have computed the correct network from the trace set; the weak minimal network tracing problem is unsolvable in general. As the minimal network tracing problem is a stricter version of this problem, it is also unsolvable. The case where the minimal network has only one anonymous node is special. If there is only one anonymous node, there is no need to distinguish between anonymous nodes. We assign it some identifier (say x) that is not the unique identifier of any non-anonymous node, and replace all instances of “∗” by this identifier. Now the problem reduces to finding a network from a trace set with no anonymous (or irregular) nodes, which is of course solvable [1]. As the minimal network tracing problem is solvable in this case, the weak minimal network tracing problem (which is easier) is solvable also.
4
Unsolvable, or NP-Complete?
In Section 3, we demonstrated the hardness of the minimal network tracing problem in the presence of anonymous nodes, and concluded that both the strict and the weak versions of the problem are unsolvable in general. It is natural to ask how we can claim a problem to be unsolvable, unless we reduce it to the halting problem or some other such uncomputable problem. Also, it seems on first observation that our findings conflict with the earlier results of Yao et al., who had found the problem of minimal topology inference to be NP-complete; an NP-complete problem lies in the intersection of NP-hard and NP, so it lies in NP and is definitely not unsolvable! In this section, we will answer these questions and resolve this apparent conflict. The problem we study is whether it is possible to identify the true network from which a given trace set T was generated in practice - in other words, to find a single network N such that T is generable from N and only from N . As there is not enough information in T to uniquely identify N (because T is generable from many minimal networks), the minimal network tracing problem is not solvable. In fact, even the weak minimal network tracing problem is not solvable, as T only provides enough information for us to identify that N is one member of an exponential-sized set (which is clearly not a small set). Thus, our statement that the problem is not solvable does not depend on proving uncomputability, but on
260
H.B. Acharya and M.G. Gouda
(a) Network N1,3.1
(b) Network N1,3.2
(c) Network N1,3.3
(d) Network N1,3.4
(e) Network N1,3.5
(f) Network N1,3.6
(g) Network N1,3.7
(h) Network N1,3.8
Fig. 4. Minimal Topologies for T1,3 (with two anonymous nodes)
the fact that no algorithm can identify the correct solution out of a large space of solutions, all of which are equally good. We now consider how our work relates to the proof of Yao [10]. The solution to our apparent conflict is that Yao et al. claim NP completeness for the decision problem TOP-INF-DEC, which asks “Does there exist a network, from which trace set T is generable, and which has at most k anonymous nodes?” This decision problem is equivalent to the problem of demonstrating any one network from which T is generable, with k or less anonymous nodes.
On the Hardness of Topology Inference
261
Yao et al. implicitly assume that the space of networks, from which a trace set T is generable, is a search space; identifying the smallest network in this space will yield the true network from which T was generated in practice. This is simply not true - the number of minimal networks from which T is generable is at least exponentially large, and as these are all minimal networks we cannot search for an optimum among them (they are all equally good solutions; in fact, they satisfy a stronger equivalence condition than having the same number of nodes - our construction produces networks with the same number of nodes and the same number of edges). Finding one minimal network N from which T is generable does not guarantee that N is actually the network from which T was generated! We say nothing about the difficulty of finding a random minimal network from which a trace set is generable (without regard to whether it is actually the network that generated the trace set). Hence, there is no conflict between our results and the results in [10].
5
Conclusion
In our previous work, we derived a theory of network tracing under the assumption that nodes were not consistently anonymous. As we later learned that this assumption is impossible to satisfy, we updated our theory to include networks with strictly anonymous nodes, which we present in this paper. As the introduction of irregularity - a limited form of anonymity - caused the problem to become hard in our previous study, we had expected that it would be even harder when we introduced strict anonymity. To our great surprise, we found a counterexample. Networks with a single anonymous node are completely specified by their trace sets (Theorem 3), while networks with a single irregular node are not (Figure 1 of [1]). We feel that this example is very interesting, as it disproves the intuition that introducing anonymous nodes should cause more trouble to a network tracing algorithm than introducing irregular (partly anonymous) nodes. In the general case, however, we prove in this paper that both the strict version and the weak versions of the minimal network tracing problem are unsolvable: no algorithm can do better than reporting that the required network is a member of an exponentially large set of networks. This result holds even when the number of anonymous nodes is restricted to two. The question of identifying the particular classes of networks, with the property that any such network can be uniquely identified from any trace set generable from it (even if the network contains anonymous nodes), is an open problem we will attack in future research.
References 1. Acharya, H.B., Gouda, M.G.: A theory of network tracing. In: 11th International Symposium on Stabilization, Safety, and Security of Distributed Systems (November 2009)
262
H.B. Acharya and M.G. Gouda
2. Acharya, H.B., Gouda, M.G.: The weak network tracing problem. In: International Conference on Distributed Computing and Networking (January 2010) 3. Cheswick, B., Burch, H., Branigan, S.: Mapping and visualizing the internet. In: Proceedings of the USENIX Annual Technical Conference, pp. 1–12. USENIX Association, Berkeley (2000) 4. Gunes, M., Sarac, K.: Resolving anonymous routers in internet topology measurement studies. In: INFOCOM 2008: The 27th Conference on Computer Communications, pp. 1076–1084. IEEE, Los Alamitos (April 2008) 5. Gunes, M.H., Sarac, K.: Inferring subnets in router-level topology collection studies. In: Proceedings of the 7th ACM SIGCOMM Conference on Internet Measurement, pp. 203–208. ACM, New York (2007) 6. Jin, X., Yiu, W.-P.K., Chan, S.-H.G., Wang, Y.: Network topology inference based on end-to-end measurements. IEEE Journal on Selected Areas in Communications 24(12), 2182–2195 (2006) 7. Paul, S., Sabnani, K.K., Lin, J.C., Bhattacharyya, S.: Reliable multicast transport protocol, rmtp (1996) 8. Viger, F., Augustin, B., Cuvellier, X., Magnien, C., Latapy, M., Friedman, T., Teixeira, R.: Detection, understanding, and prevention of traceroute measurement artifacts. Computer Networks 52(5), 998–1018 (2008) 9. Xie, H., Yang, Y.R., Krishnamurthy, A., Liu, Y.G., Silberschatz, A.: P4p: provider portal for applications. SIGCOMM Computer Communications Review 38(4), 351– 362 (2008) 10. Yao, B., Viswanathan, R., Chang, F., Waddington, D.: Topology inference in the presence of anonymous routers. In: Twenty-Second Annual Joint Conference of the IEEE Computer and Communications Societies, March - April 3, vol. 1, pp. 353–363. IEEE, Los Alamitos (2003)
An Algorithm for Traffic Grooming in WDM Mesh Networks Using Dynamic Path Selection Strategy Sukanta Bhattacharya1, Tanmay De1 , and Ajit Pal2 1 2
Department of Computer Science and Engineering, NIT Durgapur, India Department of Computer Science and Engineering, IIT Kharagpur, India
Abstract. In wavelength-division multiplexing (WDM) optical networks, the bandwidth request of a traffic stream is generally much lower than the capacity of a lightpath. Therefore, to utilize the network resources (such as bandwidth and transceivers) effectively, several low-speed traffic streams can be efficiently groomed or multiplexed into high-speed lightpaths, thus we can improve the network throughput and reduce the network cost. The traffic grooming problem of a static demand is considered as an optimization problem. In this work, we have proposed a traffic grooming algorithm to maximize the network throughput and reduce the number of transceivers used for wavelength-routed mesh networks and also proposed a dynamic path selection strategy for routing requests which selects the path such that the load on the network gets distributed throughout. The efficiency of our approach has been established through extensive simulation on different sets of traffic demands with different bandwidth granularities for different network topologies and compared the approach with existing algorithm. Keywords: Lightpath, WDM, Transceiver, Grooming.
1
Introduction
Wavelength division multiplexing (WDM) technology is now being widely used for expanding the capacity of optical networks. It has provided vast bandwidth to the optical fiber by allowing simultaneous transmission of traffic on many nonoverlapping channels (wavelengths). In a wavelength routed optical network, a lightpath may be established to carry traffic from a source node to a destination node. A lightpath is established by selecting a path of physical links between the source and destination nodes, and taking a particular wavelength on each of these links for the path. A lightpath must use the same wavelength on all of its links, if there is no wavelength converter at intermediate nodes, and this restriction is known as wavelength continuity constraint [3], [5]. An essential functionality of WDM networks, referred to as traffic grooming, is to aggregate low speed traffic connections onto high speed wavelength channels in a resource-efficient way, that is, to maximize the network throughput when the resources are given or to minimize the resource consumption when the M.K. Aguilera et al. (Eds.): ICDCN 2011, LNCS 6522, pp. 263–268, 2011. c Springer-Verlag Berlin Heidelberg 2011
264
S. Bhattacharya, T. De, and A. Pal
traffic requests to be satisfied are given. Efficient traffic grooming techniques (algorithms) can reduce network cost by reducing the number of transceivers and increase the throughput and also save time by accommodating more number of low-speed traffic streams on a single lightpath. The work in [4] and [6] investigates the static Traffic Grooming problem with the objective of maximizing network throughput. Zhu and Mukherjee [6] investigate the traffic grooming problem in WDM mesh networks with the objective of improving network throughput. They present an integer linear programming (ILP) formulation of the traffic grooming problem and propose two heuristics namely, Maximizing Single-Hop Traffic (MST) and Maximizing Resource Utilization (MRU), to solve the GRWA problem. Subsequently, A global approach for designing reliable WDM networks and grooming the traffic is presented in [1] and Traffic grooming, routing, and wavelength assignment in an optical WDM mesh networks based on clique partitioning [2] motivated us to use the concept of reducing the total network cost to solve the GRWA problem, which is presented in the following sections. The grooming problem consists of two interconnected parts: (a) designing light paths, which includes specifying the physical route of each path; (b) assigning each packet stream to a sequence of light paths. The work proposed in this paper is based on the static GRWA problem in WDM mesh networks with a limited number of wavelengths and transceivers and the proposed approach allows singlehop and multi-hop grooming that is similar to [6]. The objective of this work is to maximize the network throughput in terms of total successfully routed traffic and reduce the number of transceivers used. The performance of our proposed approach is evaluated through extensive simulation on different sets of traffic demands with different granularity for different network topologies. The results show that the proposed approach gives better performance with respect to the existing traffic grooming algorithm, Maximizing Single-Hop Traffic (MST). The problem formulation is presented in Section II. Section III gives the detailed description of proposed algorithm. Section IV contains experimental results and comparison with previous works. Finally, conclusion in Section V.
2
General Problem Statement
Given a network topology which is a directed connected graph G(V, E), where V and E are the sets of optical nodes and bi-directional links (edges) of the network, respectively, a number of transceivers at each node, a number of wavelengths on each fiber, the capacity of each wavelength and a set of connection requests with different bandwidth granularities, our objective is to set up lightpaths and multiplex low-speed connection requests on the same lightpath such that the network throughput is maximized in terms of total successfully routed lowspeed traffic and minimize in terms of number of transceivers used for satisfying the request. Since the Traffic Grooming problem is NP-complete. So, efficient heuristic approach gives a better solution. In the next section we have proposed a heuristic approach to solve the Traffic Grooming problem.
An Algorithm for Traffic Grooming in WDM Mesh Networks
3
265
Proposed Approach
In this section, we propose a Traffic Grooming 2 (TG2) algorithm based on dynamic path selection strategy for the GRWA problems. Our proposed approach has two steps similar to [6]. In the first step, we construct a virtual topology trying to satisfy given requests (in the decreasing order of request) in single-hop using single lightpath. In the second step, we try to satisfy the leftover blocked requests through multi-hop (in the decreasing order of request). We run the multi-hop on the spare capacity of the virtual topology created in the first step. The leftover requests are sorted and we try to satisfy them one by one with single hop. As soon one request is satisfied by single hop we try to satisfy all leftover requests by multi hop so that with the new lightpath created in single hop some requests may get satisfied by multi hop, thus we can reduce the number of transceivers used. The process is repeated on the leftover requests until all resources gets exhausted or all requests are satisfied or no leftover request can be satisfied by single hop. 3.1
Alternate Path Selection Strategy
In this work we have used a variant of adaptive routing. At each time when a request between a SD pair is satisfied we calculate all possible paths between source (S) and destination (D) and calculate cost of each path using a cost function: 1 C= (α) + L(β) (1) W Where, α and β are constants and C, W , L are cost of the path, Total common free wavelength in the physical path and Length of the path (i.e distance between SD pair) respectively. First parameter is more dominant than the second in determining the cost. This is taken into consideration so that traffic load on the network gets distributed and no single path gets more congested. 3.2
Traffic Grooming Algorithm
The proposed algorithm selects the minimum cost path dynamically such that the traffic load on the network gets distributed and no particular path gets congested. Algorithm TG2 1. Sort the requests in descending order of OC-x 2. Select the sorted requests one by one and start making the virtual topology and satisfy single-hop traffic. (a) Find all possible paths from source (S) to destination (D) for a request R. (b) Calculate cost of each path and thus find the minimum cost path for the SD pair using equation 1.
266
3. 4. 5.
6. 7. 8. 9. 10. 11.
S. Bhattacharya, T. De, and A. Pal
(c) Find the lowest index wavelength (w) among the common free wavelengths in the edges present in the minimum cost path. (d) Update the virtual topology assigning a lightpath from node (S) to node (D) using wavelength ”w”. (e) Update the request matrix. if Request is fully satisfied then Update the SD pair request in request matrix to zero(0). else Update the SD pair request in request matrix with left over (unsatisfied) request. end if (f) Update the wavelength status of the corresponding physical edge present in the lightpath. (g) Update the transceiver used status at the nodes. if Node is starting node then Reduce count of Transmitter by 1. end if if Node is ending node then Reduce count of Receiver by 1. end if Repeat Step-2 for all SD pair requests. Sort the blocked requests in descending order. Select the sorted requests one by one and try to satisfy them using multiple lightpaths on the Virtual topology (VT), created in step 2. (a) Update the request matrix. if Request is fully satisfied then Update the SD pair request in request matrix to zero(0). else Update the SD pair request in request matrix with left over (unsatisfied) request. end if (b) Update the wavelength status of the corresponding physical edges present in the lightpaths. (c) Update the virtual topology. Sort the blocked requests again in descending order. Try to satisfy the requests one by one with single hop in descending order until one of the request is satisfied. Update the system as described in step 2(d) to 2(g). When one request is satisfied with single hop try to satisfy all remaining requests with multi hop. Update the system as described in step 5(a) to 5(c). Repeat Steps 6 to 10 until all resources gets exhausted or all requests are satisfied or no leftover request can be satisfied by single hop.
End Algorithm TG2
An Algorithm for Traffic Grooming in WDM Mesh Networks
4
267
Experimental Results
We have evaluated the performance of our proposed heuristics TG2 for the GRWA problem using simulation and compared the results with the well-known MST algorithm [6]. We conducted our experiments on different network topologies but due to page limitations we have presented results only for 14-node NSFNET shown in Fig.1. The value of α and β in equation 1 is taken to be 10 and 0.3 respectively. We assume that each physical link is bidirectional with the same length. During simulation we have assumed that the capacity of each wavelength is OC-48 and allowed traffic bandwidth requests were assumed to be OC-1, OC-3, and OC-12 and are generated randomly. 3000 TxRx = 6, W = 7
11
12 2500
13
10 6
8
9
1 7
Throughput (OC−1 unit)
3
1500
1000
500
14 2
2000
MST TG2
4 0
0
1000
5
2000
3000
4000
5000
5500
Number of Requests (OC−1 unit)
Fig. 1. Node architectures used in simulation
Fig. 2. Relationship between throughput and requested bandwidth
3000
2500 Req = OC−3000, W = 7
Req = OC−3000, TxRx = 7 2500
Throughput (OC−1 unit)
Throughput (OC−1 units)
2000
2000
1500
1000
1500
1000
500 500 MST TG2 0 0
2
4
6
8
10
12
MST TG2 14
Number of Wavelengths per Link
Fig. 3. Relationship between throughput and number of wavelengths per fiber link
0
0
1
2
3
4
5
6
7
8
9
Number of Transceivers per Node
Fig. 4. Relationship between throughput and number of transceivers per node
Figure 2 shows the relationship between the network throughput and total requested bandwidth for a 14-node network (Fig.1). Initially, performance (in terms of throughput) of both the algorithms are similar but subsequently TG2 returns a better throughput than MST.
268
S. Bhattacharya, T. De, and A. Pal
The relationship between the network throughput and the number of wavelengths per link for the two algorithms is shown in Fig. 3. We observe that the proposed algorithm TG2 provides a higher network throughput than the existing MST algorithm. The throughput increases with the increase of wavelength and due to transceiver constraint there is no significant change in throughput after the number of wavelengths reaches a certain limit for both algorithms. The relationship between the network throughput and the number of transceivers per node for the proposed and existing algorithms is shown in Fig. 4. We observe that initially the throughput increases with the increase in the number of transceivers and there is no significant change in the throughput as the number of transceivers is increased beyond certain value, due to capacity of wavelengths is exhausted. However, the proposed TG2 algorithm performs better in terms of network throughput compared to the existing MST algorithm.
5
Conclusions
This study was aimed at traffic-grooming problem in a WDM mesh network.We have studied the problem of static single-hop and multi-hop GRWA with the objective of maximizing the network throughput for wavelength routed mesh networks. We have proposed a algorithm TG2, using the concept of single hop and multi hop grooming in static GRWA problem [6]. The performance of our proposed algorithm is evaluated through extensive simulation on different sets of traffic demands with different bandwidth granularities under different network topologies.
References 1. Bahri, A., Chamberland, S.: A global approach for designing reliable WDM networks and grooming the traffic. Computers & Operations Research 35(12), 3822– 3833 (2008) 2. De, T., Pal, A., Sengupta., I.: A Traffic grooming, routing, and wavelength assignment in an optical WDM mesh networks based on clique partitioning. Photonic Network Communications (February 2010) 3. Mohan, G., Murthy, C.S.: WDM optical networks: concepts, design and algorithms. Prentice Hall, India (2001) 4. Yoon, Y., Lee, T., Chung, M., Choo, H.: Traffic grooming based on shortest path in optical WDM mesh networks. In: Sunderam, V.S., van Albada, G.D., Sloot, P.M.A., Dongarra, J. (eds.) ICCS 2005. LNCS, vol. 3516, pp. 1120–1124. Springer, Heidelberg (2005) 5. Zang, H., Jue, J., Mukherjee, B.: A review of routing and wavelength assignment approaches for wavelength-routed optical WDM networks. SPIE Opt. Netw. Mag. 1(1), 47–60 (2000) 6. Zhu, K., Mukherjee, B.: Traffic grooming in an optical WDM mesh network. IEEE J. Sel. Areas Commun. 20(1), 122–133 (2002)
Analysis of a Simple Randomized Protocol to Establish Communication in Bounded Degree Sensor Networks Bala Kalyanasundaram and Mahe Velauthapillai Department of Computer Science, Georgetown University, Washington DC, USA
[email protected],
[email protected] Abstract. Co-operative computations in a network of sensor nodes rely on an established, interference free and repetitive communication between adjacent sensors. This paper analyzes a simple randomized and distributed protocol to establish a periodic communication schedule S where each sensor broadcasts once to communicate to all of its neighbors during each period of S. The result obtained holds for any bounded degree network. The existence of such randomized protocols is not new. Our protocol reduces the number of random bits and the number of transmissions by individual sensors from Θ(log 2 n) to O(log n) where n is the number of sensor nodes. These reductions conserve power which is a critical resource. Both protocols assume upper bound on the number of nodes n and the maximum number of neighbors B. For a small multiplicative (i.e., a factor ω(1)) increase in the resources, our algorithm can operate without an upper bound on B.
1
Introduction
A wireless sensor network (WSN) is a network of devices called sensor nodes that communicate wirelessly. The WSN is used in many applications including environment monitoring, traffic management, wild-life monitoring, etc. [1,2,4,7,8,9,5]. Depending on the application, a WSN can consist of a few nodes to millions of nodes. The goal of the network is to monitor the environment continuously to detect and/or react to certain predefined events or patterns. When an application requires millions of nodes, individually programming each node is impractical. When deployed, it is often difficult to control the exact location of each sensor. Even if we succeed in spreading the sensors evenly, it is inevitable that some nodes will fail and the resulting topology is no longer uniform. It may be reasonable to assume that the nodes may know an upper bound on the number of nodes in the network but nothing else about the network. This paper analyzes the performance of a randomized and distributed protocol that establishes communication among neighbors of a bounded degree network of sensors. We assume that B is a constant.
Supported in part by Craves Family Professorship. Supported in part by McBride Chair.
M.K. Aguilera et al. (Eds.): ICDCN 2011, LNCS 6522, pp. 269–280, 2011. c Springer-Verlag Berlin Heidelberg 2011
270
B. Kalyanasundaram and M. Velauthapillai
The following wireless transmission model is considered in this paper. A node cannot transmit and receive information simultaneously. Each node has a transmission range r, and any node within that range can receive information from this node. Also, each node A has an interference range r+ . It means that a transmission from node C to any node B within this range r+ of A can interfere with the transmission from node A. In general r + ≥ r. For the ease of presentation, we assume r+ = r. However, the proofs can be easily extended to r+ ≥ r as long as the number of nodes in the interference range is big-oh of number of nodes in the transmission range. In the literature of wireless/sensor network, there are many different media access control (MAC) protocols. In general, the protocols fall into the following three categories [6]: fixed assignment, demand assignment, and random access. The protocol which we will present is a fixed assignment media access protocol. However, the protocol that we use to derive a fixed assignment media access protocol is a random access protocol. In the random access protocol, the time is divided into slots of equal length. Intuitively, the sensors, using randomization, will first establish a schedule for pair-wise communication with their neighbors. The sensors run the second phase of the protocol where the schedule is compressed such that each sensor broadcasts once to communicate to all of its neighbors. After the compression, the resultant protocol is a fixed assignment protocol that can be used by the sensor network to communicate and detect patterns. We consider uniform transmission range for the sensors. One can view the network of sensors as a graph where each node is a sensor and there is an edge between two nodes if the nodes are within the transmission range. The resultant graph is often called a disk graph DG, and unit-disk graph UDG when the transmission range is same for all sensors. The problem addressed in this paper can be thought of as the problem of finding a interference-free communication schedule for a given UDG where the graph is unknown to the individual nodes. Gandhi and Parthasarathy [3] considered this problem and proposed a natural distance-2 coloring based randomized and distributed algorithm to establish an interference-free transmission schedule. Each node in the network ran Θ(log2 n) rounds of transmissions and used Θ(log2 n) random bits to establish the schedule. Comparing these two protocols, it is interesting to note that our protocol exhibits better performance. The major difference between the two approaches is in the way we split the protocol in two phases where pair-wise communication is established in the first phase and compression takes place in the second. Our protocol reduces the number of transmissions as well as the number of random bits to O(log n). Moreover, the number of bits transmitted by our protocol is O(log n) per node, whereas the protocol by Gandhi and Parthasarathy uses O(log2 n) bits per node. These reductions conserve power, a critical resource in sensor network. It is worthy to note that the running time of both protocols is O(log2 n) where each transmission is considered to be O(1) step. Let b be the maximum number of neighbors of any node. CDSColor protocol explicitly uses an upper bound on b and the number of nodes in the graph. After
Analysis of a Simple Randomized Protocol
271
Table 1. Comparing Algorithms Case CDSColor(see [3]) Our Alg. Random Bits O(log2 n) O(log n) # of Transmissions O(log2 n) O(log n) Bits Transmitted per Node O(log2 n) O(log n) Number of Steps O(log2 n) O(log2 n)
the first phase of our protocol, each node will know the exact number of its neighbors with high probability. In order to increase the confidence/probability of total communication, the length of the first phase will be set to O(log n) where the constant in big-oh depends on b. If we do not have a clear upper bound on b, the length of the first phase can be ω(log n) (e.g., O(log n log log n)). By increasing the number of transmissions in the first phase, our protocol will establish communication with high probability (i.e., (1 − 1/nc ) for any given constant c). Our analysis for the first phase of the algorithm uses a recurrence relation to derive accurate/exact probability in establishing a pair-wise communication. One can write a simple program to calculate this probability accurately. Using the probability on pair-wise communication, we can find the expected number of transmissions needed to establish communication. The bound obtained using the recurrence relation closely matches the value observed in simulation. For instance, the number of transmissions needed to establish communication between every pair of neighbors in an entire network with million nodes is around 400. For incredibly large network, n = 1050 , the bound on the number of transmissions is less than 3000. From this observation, we can safely say that our protocol does not need to know n for any real-life sensor network. Definition 1. Given a sensor network G = (V, E), we define H1 (v) to be set of nodes in V that are either immediate neighbor or neighbor’s neighbor (i.e., 1-hop away) of node v. For ease of presentation, we refer {v}∪H1 (v) to be 1-hop neighborhood of v.
2
First Phase - Establishing Pair-Wise Communication
Each sensor node v selects an id which should be unique among the id’s of nodes in {v} ∪ H1 (v). This can be accomplished with high probability by selecting c log n random bits as id where c ≥ 1 is a carefully chosen constant. This is captured in the following lemma which is trivial to establish. Lemma 1. Suppose there are n nodes in the network and for each node v we have |H1 (v)| ≤ c, a fixed constant. Each node chooses c1 log n random bits as its id where c1 ≥ 1. The probability that every node in the network will choose a unique id in the 1-hop neighborhood is at least 1 − nc1c−1 .
272
B. Kalyanasundaram and M. Velauthapillai
After establishing an id, each node executes the following simple protocol for c2 log n steps where c2 ≥ 1 is a constant. The choice of c2 depends on the confidence parameter in the high probability argument. This will become clear when we present the analysis of the protocol. TorL(p): (one step) Toss a biased coin where probability for head is p and tail is (1-p). Transmit the node’s id if the outcome is head and listen if the outcome is tail. We could present the analysis of this protocol for the arbitrary bounded degree network now. But we choose to consider both line topology and grid topology before presenting the arbitrary case. There are two reasons for this choice. The analysis is a bit more exact for the simpler case. We ran simulations for grid topology to see the effectiveness of the protocol for reasonably large-sized network. Our simulation and calculation showed that c2 log n is around only 3000 for network of size n = 1050 . 2.1
Line Topology
The analysis of the first phase of the protocol contains both high probability argument as well as expected case argument. Bounds we get from both arguments are not different in the asymptotic sense. However, our simulation result for grid network shows that the expectation value given by the recurrence relation matches very closely to the value we observed in the simulation. Hence, we choose to present both arguments. Definition 2. After running a protocol to establish communication, we say a sensor node is unsuccessful if it failed to receive information (that is id of the neighbor) from one of its neighbor nodes. 1 Theorem 1. Let p be be positive real in the range (0, 12 ] and let b = 1−p(1−p) 2. Suppose n sensor nodes are uniformly distributed on a line. Assume omnidirectional transmission where the signal from each node reaches both neighbors on the line. 1. After c2 log n steps, probability that there exists an unsuccessful sensor is at most 1/nd where d ≥ 1 is any fixed constant. 2. The number of steps of TorL(p) needed so that the expected number of unsuccessful sensor nodes is less than 1 is (1 + logb (2n)).
Proof. Consider a sensor node with a neighbor on both sides. For i = 0, 1 and 2, let R(i, k) be the probability that the sensor node successfully received information from exactly i neighbors on or before k rounds. According to protocol TorL(p), p is the probability that a sensor node transmits at each time step. So, in order to receive information at time t, a sensor node must not transmit at t and one neighbor must be transmitting while the other is not. Let α = 2p(1 − p)(1 − p) be the probability of a sensor node successfully receiving information from one of its neighbor. Since coin tosses are independent, we can express R(i, k) in the form of a recurrence relation shown below:
Analysis of a Simple Randomized Protocol
273
R(0, k) = (1 − α)k R(1, k) = R(0, k − 1)α + R(1, k − 1)(1 − α2 ) = (1 − α)k−1 α + R(1, k − 1)(1 − α2 ) R(1, 0) = 0 R(2, k) = R(1, k − 1) α2 + R(2, k − 1) R(2, 1) = 0. We now show some steps to solve the recurrence relation: R(1, k) = (1 − α)k−1 α + (1 − α2 )(1 − α)k−2 α + . . . + (1 − α2 )k−1 α = α(1 − α)k−1
k−1 1− α2 i i=0
1−α
= 2 [(1 − α2 )k − (1 − α)k ]. By expanding recursively, the function R(2, k) and substituting for R(1, j) results in: k−1 R(2, k) = α2 j=1 R(1, j) =
α 2
=α
k−1 j=1
2 [(1 − α2 )j − (1 − α)j ]
k−1
j=1 (1
− α2 )j −
k−1
j=1 (1
− α)j ]
= 2[1 − (1 − α2 )k ] − [1 − (1 − α)k ] = 1 − [2(1 − α2 )k − (1 − α)k ]. The probability that a node is not successful after k steps is at most 1−R(2, k) = 2(1 − α2 )k − (1 − α)k ≤ 2(1 − α2 )k ). We now find a bound on k such that 1 2 2(1 − α2 )k ≤ nd+1 for the given d. Simplifying, we get 2nd+1 ≤ ( 2−α )k . Simple algebra show that it holds for k ≥ c2 such that c2 log n ≥
(d+1) 2 log( 2−α )
(d+1) 2 log( 2−α )
log2 (2n). Choose a smallest constant
log2 (2n). So, if each node runs the protocol for
c2 log n steps, the probability that it will be unsuccessful is at most 1/nd+1 . There are n nodes in the network. Therefore, the probability that there exists 1 an unsuccessful node is at most n × nd+1 = n1d . We define a random variable 1 if node i received info from all of its neighbors on or before step k β(i, k) = 0 if otherwise. Suppose there are n sensor nodes in the network. Using R(2, k), we get that the expected value of β(i, k) is E[β(i, k)] = R(2, k) for each 1 ≤ i ≤ n. Observe
274
B. Kalyanasundaram and M. Velauthapillai
that a sensor node receives information from a neighbor not only depends on the random bit of the sensor node but also the random bits of its neighbor nodes. As a result, the random variables β(i, k) are not independent random variables. But, applying linearity of expectation we get E[
n
β(i, k)] =
i=1
n
E[β(i, k)] = n R(2, k).
i=1
Therefore, the number of steps k needed to reach E[ ni=1 β(i, k)] > n − 1 is given by the inequality n R(2, k) > n − 1. By substituting the bound on R(2, k) it suffices to satisfy the inequality 1 − [2(1 − α2 )k − (1 − α)k ] > n−1 n . Therefore it suffices to show that 2(1 − α2 )k − (1 − α)k < 1/n. For α ≥ 0, we have (1 − α α k 2 ) > (1 − α). Hence, it suffices to satisfy the equation (1 − 2 ) < 1/(2n) or k 2n < [2/(2 − α)] . This happens for k = 1 + logb (2n) where b = (2/(2 − α)). Observe that if α increases then k decreases. 2.2
Technical Lemmas
The following results and recurrence relation will help us establish the expected time needed to establish a pair-wise communication between all adjacent sensors for a general bounded degree network. Lemma 2. Let 0 < p < 1 , integer b ≥ 1 and α = bp(1 − p)b . Maximum value b (b+1) 1 of α is ( b+1 ) ≤ 12 and it occurs when p = b+1 . Proof. Differentiating the function bp(1 − p)b with respect to p and setting it equal to 0 we get zeros at p = 0, 1 and 1/(b + 1). Differentiating again, it is not hard to verify that maximum occurs when p = 1/(b + 1) and the maximum value is (b/(b + 1))b+1 . Lemma 3. Let 0 < p < 1, integer 2 ≤ b, α = bp(1 − p)b < 1 and integers i, k ≥ 0. Given the following recurrence relation and boundary conditions: R(0, k) = (1 − α)k R(i, k) = 0 R(i, k) = ((b+1)−i) αR(i − 1, k − 1) b + (1 − (b−i) α)R(i, k − 1) b the following hypothesis I(i, k) is true: αi R(i, k) ≤ ki (b−1)! (1 − (b−i) α)k−i (b−i)! bi−1 b
(k ≥ 0) (i > k) (0 ≤ i ≤ b − 1) ∧ (k ≥ i)
(0 ≤ i ≤ (b − 1)) ∧ (k ≥ i).
Proof. We will prove this by double induction (i.e., on i and k) where the desired inequality is the inductive hypothesis.
Analysis of a Simple Randomized Protocol
Base case where i = 0 is easy to verify. Observe that
k i
(b−i) α)k−i b
275
(b−1)! αi (1 (b−i)! bi−1 k
−
= (1 − α)k . The base case condition R(0, k) ≤ (1 − α) is true since R(0, k) is defined to be equal to (1 − α)k . Assume that the hypothesis holds for all i such that i ≤ x ≤ (b − 1) and for all k ≥ i. We will show that the hypothesis holds for i = x + 1 ≤ (b − 1) and for all k ≥ i. This is again proved by induction on k ≥ i = x + 1. The base case of this induction is when k = i = x + 1. x+1 (b−1)! (b−(x+1)) αx R(x + 1, x + 1) ≤ α)(x+1)−(x+1) x + 1 (b−(x+1))! bx (1 − b =
(b−1)! αx (b−(x+1))! bx
Now from the recurrence relation: R(x + 1, x + 1) = ((b+1)−(x+1)) αR(x, x) + (1 − (b−(x+1)) α)R(x + 1, x) b b (b−(x+1)) R(x + 1, x + 1) = b−x αR(x, x) + (1 − α)R(x + 1, x) b b x−1
(b−1)! α Substituting for R(x, x) ≤ ( (b−(x))! ) and R(x + 1, x) = 0, we get bx−1
R(x + 1, x + 1)
≤
b−x (b−1)! αx−1 b α (b−x)! bx−1
=
(b−1)! αx (b−(x+1))! bx .
Hence the base case is true. For the inductive step, we assume that the hypothesis I(i, k) holds for ( i < x < b and i ≤ k) or (i = x + 1 ≤ b and i ≤ k ≤ y). We will prove that the hypothesis holds for i = x + 1 and k = y + 1, that is I(x + 1, y + 1) is also true. Hypothesis I(x + 1, y) states that y (x+1)! α(x+1) (b−(x+1)) R(x + 1, y) ≤ x+ α)y−(x+1) 1 (b−(x+1))! bx (1 − b Now consider the recurrence relation with i = x + 1 and k = y + 1: R(x + 1, y + 1) = R(x + 1, y + 1) =
((b+1)−(x+1)) (b−(x+1)) αR(x, y) + (1 − α)R(x b b (b−(x+1)) b−x αR(x, y) + (1 − α)R(x + 1, y) b b
+ 1, y)
Substituting for R(x + 1, y) from hypothesis I(x + 1, y) we have R(x + 1, y + 1)
y (b−1)! b−(x+1) αx+1 + (1 − b−(x+1) α) α)y−(x+1) x + 1 (b−(x+1))! bx (1 − b b y (b−1)! (b−(x+1)) α(x+1) = b−x α)(y+1)−(x+1) b αR(x, y) + x + 1 (b−(x+1))! bx (1 − b ≤
b−x αR(x, y) b
Substituting for R(x, y) from hypothesis I(x, y) we have R(x + 1, y + 1) ≤ =
y x
y (b−1)! αx (b−1)! αx+1 (1 − b−x α)y−x + x + 1 (b−(x+1))! (1 − b−(x+1) α)y−x (b−x)! bx−1 b bx b (x+1) x+1 y (b−1)! (b−1)! α α (1 − b−x α)y−x + x + 1 (b−(x+1))! (1 − b−(x+1) α)y−x . (b−(x+1))! bx b bx b
b−x α b
y x
276
B. Kalyanasundaram and M. Velauthapillai (b−x) α) ≤ (1− (b−(x+1)) α), using this in the above expression: b b (x+1) y y (b−1)! 1) ≤ [ x + x + 1 ][ (b−(x+1))! α bx (1 − (b−(x+1)) α)y−x ] b (x+1) 1 (b−1)! (b−(x+1)) α = xy + α)y−x . + 1 (b−(x+1))! bx (1 − b
Note that (1 − R(x + 1, y +
Hence the result. Lemma 4. Let 0 < p < 1, integer b ≥ 2, and α = bp(1−p)b . Given the recursive b−1 definition of R(i, k) for 0 ≤ i < b and k ≥ 0, let R(b, k) = 1 − i=0 R(i, k). There exist constants c > 0 and 1 − αb < < 1 such that for integer k ≥ c, we have R(b, k) ≥ 1 − k That is, lim R(b, k) = 1, and the convergence rate is k→∞
exponential in k. Constants and c depends on constant b. Proof. From Lemma 2, we have α ≤ 12 . Applying Lemma 3 we get k (b−1)! αi (b−i) k−i R(i, k) ≤ (0 ≤ i ≤ (b − 1)) ∧ (k ≥ i) i (b−i)! bi−1 (1 − b α) ≤ αi k i (1 − αb )k−i ≤ k (b−1) (1 − αb )k = (1 −
since α ≤
(b−1) log k
1 2
and i ≤ (b − 1)
α k− log b−log(b−α) . b)
Recall that R(0, k) = (1 − α)k . Hence, lim R(i, k) = 0 for (0 ≤ i ≤ b − 1). k→∞
Substituting the bound for R(i, k), we get R(b, k) ≥ 1 − b(1 − αb )k−O(log k) which converges to 1 exponentially. Observe that there exists constants 1 − αb < < 1 and c > 0 such that for k ≥ c, we have b(1 − αb )k−O(log k) ≤ k . As a consequence we get R(b, k) ≥ 1 − k . 2.3
Arbitrary Bounded Degree Topology
Looking carefully at the protocol and proof techniques in the previous section, it will become clear that they can be extended to any arbitrary bounded degree sensor network. Theorem 2. Let G be an arbitrary sensor network with n nodes and each node has at most b neighbors where b is a constant. Let p be a positive real in the range (0, 12 ]. Each node repeatedly runs TorL(p). 1. After c3 log n steps, the probability that there exists an unsuccessful sensor is at most 1/nd where d ≥ 1 is any fixed constant and c3 is a positive constant that depends on d. 2. The number of steps needed so that the expected number of unsuccessful sensor is less than 1 is c4 log n where c4 is another constant. 3. The number of bits transmitted by a node is O(log2 n) and the number of random bits used by a node is O(log n).
Analysis of a Simple Randomized Protocol
277
Proof. Suppose a node x has a neighbors where 1 ≤ a ≤ b. Let R(i, k) for 0 ≤ i ≤ a be the probability that the sensor node has received information from exactly i neighbors at the end of k rounds. This probability obeys the following recurrence relation. R(0, k) = (1 − α)k R(i, k) = 0 R(i, k) = ((a+1)−i) αR(i − 1, k − 1) + (1 − a
(k ≥ 0) (i > k) (a−i) α)R(i, k a
− 1) when (0 ≤ i ≤ a − 1) ∧ (k ≥ i)
Applying Lemma 4, we know R(a, k) ≥ 1 − k where is a positive constant less than 1. Therefore, the probability that the sensor is unsuccessful after k rounds is 1 1 1 at most k = 2−k log2 ( ) . Substituting k = c3 log n, we get 2k log2 ( ) = c3 log 1 . 2( ) n
Choose c3 such that c3 log2 ( 1 ) ≥ d + 1. For this choice of c3 , the probability that 1 the node is unsuccessful is at most nd+1 . Since there are n nodes, the probability that any node is unsuccessful is at most n1d . Assume that we number the nodes i = 1 through n. Let ai ≤ b the number of neighbors for node i In order to calculate the expected number of steps of T orL(p) to have less than one unsuccessful node, we define a random variable 1 if node i received info from all of its neighbors on or before step k β(i, k) = 0 if otherwise. The expected value of βp (i, k), denoted by E[βp (i, k)], is equal to R(a, k). Observe that R(a, k) ≥ R(b, k) for all a ≤ b. The expected number of nodes in the entire network n that receives communication from all of its neighbors after k rounds is E[ i=1 βp (i, k)]. Applying linearity of expectation, we get E[
n i=1
βp (i, k)] =
n i=1
E[βp (i, k)] =
n
R(ai , k)
i=1
Applying Theorem 4, the number of steps k needed to reach the bound n k i=1 E[βp (i, k)] > n − 1 is given by the inequality n(1 − ) > n − 1 . Simk k log2 ( 1 ) plifying we get , < 1/n or 2 > n. The result follows if we choose k = c4 log2 n where c4 ≥ log 1( 1 ) . 2 Finally, observe that each node uses random bits to select id and O(1) random bits per step of T orL(p). Since id is O(log n), and number of steps of T orL(p) is also O(log n), the total number of random bits is O(log n). It is easy to see that the number of transmissions per node is O(log n). 2.4
Simulation and Practical Bounds for Grid Network
We ran simulations to estimate the number of steps needed for a large sensor network in practice. For a network with one million sensors, we needed approximately 373 time slots to establish communication between every pair of adjacent
278
B. Kalyanasundaram and M. Velauthapillai
nodes. This is an average over 20 random runs. For three million nodes, the number of rounds is approximately 395 time slots. Based on our recurrence relation, our calculations for network of size 1012 show that communication will be established in 650 time slots or steps. The table below gives the average number of rounds needed to establish communication for different sized network. Here the probability of transmission is set to p = 1/(B + 1) = 1/(8 + 1) = 1/9. Here B = 8 is the number of neighbors for any node. Table 2. Simulation On Grid Network Network Size 360,000 640,000 1,000,000 1,440,000 1,960,000 2,560,000 3,240,000 Avg. # Steps 342 359 373 385 394 392 395
Table 3. Probability Bounds for Grid - Based on Recurrence Relation Steps 501 1001 1501 2001 2501 3001 Prob of Failure of a Node 1.86e-09 4.541e-19 1.10e-28 2.69e-38 6.57e-48 1.601e-57
However, when we set the probability of transmission p = 12 , the number of rounds needed to establish communication exceeds 3000 for a small network of size one hundred. So it is critical to set p close to 1/(B + 1).
3
Second Phase: Compression Protocol
Given any schedule that establishes pair-wise communication between neighbors (e.g., the schedule of length O(log n) from last section), we will show how to compress the schedule to a small constant length where each node broadcasts once to communicate to its neighbors. After the first run of the protocol T orL(p) for c3 log n steps, each node has already communicated its id to the neighbors with high probability. However, each node does not know when it succeeds in its communication attempt. Let T (x) (resp. L(x)) be the set of transmission (resp. listening) steps of the sensor x. Each sensor x runs another c3 log n steps to communicate to its neighbors. In this case, no random bits are used and each sensor x transmits only during steps in T (x). Each sensor x transmits the following two pieces of information when it transmits during this iteration: 1. List of ids of its neighbors. 2. For each neighbor y of x, (id of neighbor y, earliest time in L(x) that x listens to a transmission from y). At the end of this second iteration, each sensor knows its b neighbors and its neighbor’s neighbor. Each sensor also knows at most b transmission times that it must use from now on to communicate to its neighbors during c3 log n steps.
Analysis of a Simple Randomized Protocol
279
It is important to observe that no more random bits are used to run the transmission schedule after the first round. During each round, each sensor must listen at most b times and transmit at most b times. This conserves power. However, the biggest drawback is that the communication between neighbors takes place once in c3 log n steps. Let us call such a long one-to-one communication schedule by the name Long1to1. After the compression, each node transmits once, and listens b times. Communication between neighbors take place once in every O(1) steps. Compressor: Protocol For a Node 1. Let b the number of neighbors and be the number of neighbor’s neighbor. 2. Let T = {1, 2, 3, . . . , (b + + 1)}. 3. Maintain a set of AV of available slots and initially AV = T . 4. Repeat the following until a slot is chosen for the Node: Choose a random number x in AV . Run one round of schedule Long1to1 to communicate (id, x) to the neighbors. Let N be the set of pairs (id, x) received from the b neighbors. Run one round of schedule Long1to1 to communicate (id, N ). to the neighbors. Let M be the set of all random numbers chosen by the neighbors or neighbor’s neighbor during this iteration. If x is not in M then x is set to be the chosen slot for the sensor. Run one round of schedule Long1to1 to communicate to the neighbors. Transmit (id, x) if x is the chosen slot and empty otherwise. Let C be the set of pairs (id, x) received from the neighbors. Run again one round of schedule Long1to1 to communicate (id, C) to the neighbors. Let P be the set of chosen slot numbers during this round by its neighbor or neighbor’s neighbor. Update AV = AV − P . End Compressor Protocol Theorem 3. Suppose there are n nodes in the network. For any d > 0, the probability that a node does not choose a slot after c5 loge n iterations of the loop at step 4 is at most 1/nd+1 where c5 = 2b+ (d + 1) loge n. With the probability at least 1 − n1d , every node in the network will successfully choose a number. Proof. Consider an arbitrary node z and one iteration of the loop at step 4. Without loss of generality, let 0 ≤ k ≤ b + be the number of neighbors or neighbor’s neighbor of z without a chosen slot for communication. Observe that if there is only one choice then the node will choose the only remaining slot. Otherwise, each node will have at least two choices. Hence, the probability of choosing a number is at most 12 . So, when a node chooses a slot, the probability that k other nodes in the neighborhood do not choose the number is at least (1 − 12 )k = 21k . Therefore the probability that z will succeed in choosing a 1 number is at least 2b+ since k ≤ b + . The probability that a node z fails
280
B. Kalyanasundaram and M. Velauthapillai
1 to succeed in choosing a number after c5 log n steps is at most (1 − 2b+ )c5 loge n . 1 b+ 2b+ Set c5 = 2 (d + 1) loge n. Observe that (1 − 2b+ ) ≤ 1/e. Therefore, the probability that z fails to succeed in choosing a number after c5 loge n steps is 1 at most e(d+1)1loge n = nd+1 . The result follows since there are at most n nodes.
4
Conclusion
This paper provides a tight analysis of a randomized protocol to establish a single interference-free broadcast schedule for nodes in any bounded degree networks. Our protocol is simple and it reduces the number of random bits and number of broadcasts from O(log2 n) to O(log n). Experimental results show that the bounds predicted by the analysis is reasonably accurate.
References 1. Abadi, D.J., Madden, S., Lindner, W.: Reed: Robust, efficient filtering and event detection in sensor networks. In: VLDB, pp. 769–780 (2005) 2. Bonnet, P., Gehrke, J., Seshadri, P.: Querying the physical world. IEEE Personal Communication Magazine, 10–15 (2000) 3. Gandhi, R., Parthasarathy, S.: Distributed algorithms for connected domination in wireless networks. Journal of Parallel and Distributed Computing 67(7), 848–862 (2007) 4. Juang, P., Oki, H., Wang, Y., Martonosi, M., Peh, L., Rubenstein, D.: [duplicate] energy-efficient computing for wildlife tracking: Design tradeoffs and early experiences with zebranet (2002) 5. Kalyanasundaram, B., Velauthapillai, M.: Communication complexity of continuous pattern detection. Unpublished manuscript (January 2009) 6. Karl, H., Willig, A.: Protocols and Architectures for Wireless Sensor Networks. John Wiley & Sons, Chichester (2005) 7. Kim, S., Pakzad, S., Culler, D., Demmel, J., Fenves, G., Glaser, S., Turon, M.: Wireless sensor networks for structural health monitoring. In: SenSys 2006: Proceedings of the 4th International Conference on Embedded Networked Sensor Systems, pp. 427–428. ACM, New York (2006) 8. Mainwaring, A., Polastre, J., Szewczyk, R., Culler, D.: Wireless sensor networks for habitat monitoring. In: Proceedings of the 1st ACM International Workshop on Wireless Sensor Networks and Applications, pp. 88–97 (2002) 9. Paek, J., Chintalapudi, K., Govindan, R., Caffrey, J., Masri, S.: A wireless sensor network for structural health monitoring: Performance and experience. In: The Second IEEE Workshop on Embedded Networked Sensors, EmNetS-II 2005, pp. 1–10 (May 2005)
Reliable Networks with Unreliable Sensors Srikanth Sastry1 , Tsvetomira Radeva2 , Jianer Chen1 , and Jennifer L. Welch1 1
2
Department of Computer Science and Engineering Texas A&M University College Station, TX 77840, USA {sastry,chen,welch}@cse.tamu.edu Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Cambridge, MA 02139, USA
[email protected] Abstract. Wireless sensor networks (WSNs) deployed in hostile environments suffer from a high rate of node failure. We investigate the effect of such failure rate on network connectivity. We provide a formal analysis that establishes the relationship between node density, network size, failure probability, and network connectivity. We show that as network size and density increase, the probability of network partitioning becomes arbitrarily small. We show that large networks can maintain connectivity despite a significantly high probability of node failure. We derive mathematical functions that provide lower bounds on network connectivity in WSNs. We compute these functions for some realistic values of node reliability, area covered by the network, and node density, to show that, for instance, networks with over a million nodes can maintain connectivity with a probability exceeding 99% despite node failure probability exceeding 57%.
1
Introduction
Wireless Sensor Networks (WSNs) [2] are being used in a variety of applications ranging from volcanology [21] and habitat monitoring [18] to military surveillance [10]. Often, in such deployments, premature uncontrolled node crashes are common. The reasons for this include, but are not limited to, hostility of the environment (like extreme temperature, humidity, soil acidity, and such), node fragility (especially if the nodes are deployed from the air on to the ground), and the quality control in the manufacturing of the sensors. Consequently, crash fault tolerance becomes a necessity (not just a desirable feature) in WSNs. Typically, a sufficiently dense node distribution with redundancy in connectivity and coverage provides the necessary fault tolerance. In this paper, we analyze the connectivity fault tolerance of such large scale sensor networks and show how, despite high unreliability, flaky sensors can build robust networks. The results in this paper address the following questions: Given a static WSN deployment (of up to a few million nodes) where (a) the node density is D nodes
This work was supported in part by NSF grant 0964696 and Texas Higher Education Coordinating Board grant NHARP 000512-0130-2007.
M.K. Aguilera et al. (Eds.): ICDCN 2011, LNCS 6522, pp. 281–292, 2011. c Springer-Verlag Berlin Heidelberg 2011
282
S. Sastry et al.
per unit area, (b) the area of the region is Z units, and (c) each node can fail1 with an independent and uniform probability ρ: what is the probability P that the network is connected (that is, the network is not partitioned)? What is the relationship between P , ρ, D, and Z? Motivation. The foregoing questions are of significant practical interest. A typical specification for designing a WSN is the area of coverage, an upper bound on the (financial) cost, and QoS guarantees on connectivity (and coverage). High reliability sensor nodes offer better guarantees on connectivity but also increase the cost. An alternative is to reduce the costs by using less reliable nodes, but the requisite guarantees on connectivity might necessitate greater node density (that is, greater number of nodes per unit area), which again increases the cost. As a network designer, it is desirable to have a function that accepts, as input, the specifications of a WSN and outputs feasible and appropriate design choices. We derive the elements of such a function in Sect. 6 and demonstrate the use of the results from Sect. 6 in Sect. 7. Contribution. This paper has three main contributions. First, we formalize and prove the intuitive conjecture that as node reliability and/or node density of a WSN increases, the probability of connectivity also increases. We provide a probabilistic analysis for the relationship between node reliability (ρ), node density (D), area of the WSN region (Z), and the probability of network connectivity(P ); we provide lower bounds for P as a function of ρ, D, and Z. Second, we provide concrete lower bounds for expected connectivity probability for various reasonable values of ρ, D, and Z. Third, we use a new technique of hierarchical network analysis to derive the lower bounds on a non-hierarchical WSN. To our knowledge, we are the first to utilize this approach in wireless sensor networks. The approach, model, and proof techniques themselves may be of independent interest. Organization. The rest of this paper is organized as follows: The related work is described next in Section 2. The system model assumptions are discussed in Section 3. The methodology includes tiling the plane with regular hexagons. The analysis and results in this paper use a topological object called a level-z polyhex that is derived from a regular hexagon. The level-z polyhex is introduced in Section 4. Section 5 introduces the notion of level-z connectedness of an arbitrary WSN region. Section 6 uses this notion of level-z to formally establish the relationship between P , ρ, D, and Z. Finally, section 7 provides lower bounds on connectivity for various values of ρ, D, and Z.
2
Related Work
There is a significant body of work on static analysis of topological issues associated with WSNs [12]. These issues are discussed in the context of coverage [13], connectivity [19], and routing [1]. 1
Node is said to fail if it crashes prior to its intended lifetime. See Sect. 3 for details.
Reliable Networks with Unreliable Sensors
283
The results in [19] focus on characterizing the fault tolerance of sensor networks by establishing the k-connectivity of a WSN. However, such characterization results in a poor lower bound of k − 1 on the fault tolerance which corresponds to the worst-case behavior of faults. It fails to characterize the expected probability of network partitioning in practical deployments. In other related results, Bhandari et al. [5] focus on optimal node density (or degree) for a WSN to be connected w.h.p, and Kim et al. [11] consider connectivity in randomly duty-cycled WSNs in which nodes take turns to be active to conserve power. A variant of network connectivity, called partial connectivity, is explored in [6] which derives derives the relationship between node density and the percentage f of the network expected to be connected. Our research addresses a different, but related question: given a fixed WSN region with a fixed initial node density (and hence, degree) and a fixed failure probability, what is the probability that the WSN will remain connected? The results in [16,4,22,20,3] establish and explore the relationship between coverage and connectivity. The results in [22] and [20] show that in large sensor networks if the communication radius rc is at least twice the coverage radius rs , then coverage of a convex area implies connectivity among the non-faulty nodes. In [4], Bai et al. establish optimal coverage and connectivity in regular patterns including square grids and hexagonal lattice where rc /rs < 2 by deploying additional sensors at specific locations. Results from [16] show that even if rc = rs , large networks in a square region can maintain connectivity despite high failure probability; however, connectivity does not imply coverage. Ammari et al., extend these results in [3] to show that if rc /rs = 1 in a k-covered WSN, then the network fault tolerance is given by 4rc (rc + rs )k/rs2 − 1 for a sparse distribution of node crashes. Another related result [17] shows that in a uniform random deployment of sensors in a WSN covering the entire region, the probability of maintaining connectivity approaches 1 as rc /rs approaches 2. Our work differs from the works cited above in three aspects: (a) we focus exclusively on maintaining total connectivity, (b) while the results in [16,4,22,20] apply to specific deployment patterns or shape of a region, our results and methodology can be applied to any arbitrary region and any constant node density, and (c) our analysis is probabilistic insofar as node crashes are assumed to be independent random events, and we focus on the probability of network connectivity in the average case instead of the worst case. The tiling used in our model induces a hierarchical structure which can be used to decompose the connectivity property of a large network into connectivity properties of constituent smaller sub-networks of similar structure. This approach was first introduced in [9], and subsequently used to analyze fault tolerance of hypercube networks [7] and mesh networks [8]. Our approach differs from those in [7] and [8] as we construct higher order polyhex tiling using the underlying hexagons to derive a recursive function that establishes a lower bound on network connectivity as a function of ρ and D.
284
3
S. Sastry et al.
System Model
We make the following simplifying assumptions: – Node. The WSN has a finite fixed set of n nodes. Each node has a communication radius R. – Region and tiles. A WSN region is assumed to be a finite plane tiled by regular hexagons whose sides are of length l such that nodes located in a given hexagon can communicate reliably2 with all the nodes in the same hexagon and adjacent hexagons. We assume that each hexagon contains at least D nodes. – Faults. A node can fail only by crashing before the end of its intended lifetime. Faults are independent and each node has a constant probability ρ of failing. – Empty tile. A hexagon is said to be empty if it contains only faulty nodes. We say that two non-faulty nodes p and p are connected if either p and p are in the same or neighboring hexagons, or there exists some sequence of nonfaulty nodes pi , pi+1 , . . . , pj such that p (and p , respectively) and pi (and pj , respectively) are in adjacent hexagons, and pk and pk+1 are in adjacent hexagons, where i ≤ k ≤ j. We say that a region is connected if every pair of non-faulty nodes p and p in the region are connected.
4
Higher Level Tilings: Polyhexes
For the analysis of WSNs in an arbitrary region, we use of the notion of higher level tilings by grouping sets of contiguous hexagons into ‘super tiles’ such that some specific properties (like the ability to tile the Euclidean plane) are preserved. Such ‘super tiles’ are called level-z polyhexes. Different values of z specify different level-z polyhexes. In this section we define a level-z polyhex and specify its properties. The following definitions are borrowed from [14]: A tiling of the Euclidean plane is a countable family of closed sets called tiles, such that the union of the sets is the entire plane and such that the interiors of the sets are pairwise disjoint. We are concerned only with monohedral tilings — tilings in which every tile is congruent to a single fixed tile called the prototile. In our case, a regular hexagon is a prototile. We say that the prototile admits the tiling. A patch is a finite collection of non-overlapping tiles such that their union is a closed topological disk3 . A translational patch is a patch such that the tiling consists entirely of a lattice of translations of that patch. 2 3
We assume that collision resolution techniques are always successful in ensuring reliable communication. A closed topological disk is the image of a closed circular disk under a homeomorphism. Roughly speaking, homeomorphism is a continuous stretching and bending of the object into a new shape (you are not allowed to tear or ‘cut holes’ into the object). Thus, any two-dimensional shape that has a closed boundary, finite area, and no ‘holes’ is a closed topological disk. This includes squares, circles, ellipses, hexagons, and polyhexes.
Reliable Networks with Unreliable Sensors
(a) The gray tiles form a level-2 polyhex.
285
(b) A level-3 polyhex formed by 7 level-2 polyhexes A–F.
Fig. 1. Examples of Polyhexes
We now define a translational patch of regular hexagons called level-z polyhexes for z ∈ N as follows: – A level-1 polyhex is a regular hexagon: a prototile. – A level-z polyhex for z > 1 is a translational patch of seven level-(z − 1) polyhexes that admits a hexagonal tiling. Note that each level-z polyhex is made of seven level-(z −1) polyhexes. Therefore, the total number of tiles in a level-z polyhex is size(z) = 7z−1 . Figure 1(a) illustrates the formation of a level-2 polyhex with seven regular hexagons, and Fig. 1(b) illustrates how seven level-2 polyhexes form a level-3 polyhex. A formal proof that such level-z polyhexes exist for arbitrary values of z (in an infinite plane tessellated by regular hexagons) is available at [15].
5
Level-z Polyhexes and Connectivity
The analysis in Section 6 is based on the notion of level-z connectedness that is introduced here. First, we define a ‘side’ to each level-z polyhex. Second, we introduce the concepts of connected level-z polyhexes and level-z connectedness in a WSN region. Finally, we show how level-z connectedness implies that all non-faulty nodes in a level-z polyhex of a WSN are connected. We use this result and the definition of level-z connectedness to derive a lower bound on the probability of network connectivity in Section 6. Side. The set of boundary hexagons that are adjacent to a given level-z polyhex are said be a ‘side’ of the level-z polyhex. Since a level-z polyhex can have 6 neighboring level-z polyhexes, every level-z polyhex has 6 ‘sides’. The number of hexagons along each ‘side’ (also called the ‘length of the side’) is given by z−2 sidelen(z) = 1 + i=0 3i where z ≥ 2.4 4
The proof for this equation is a straightforward induction on z and the proof has been omitted.
286
S. Sastry et al.
We now define what it means for a level-z polyhex to be connected. Intuitively, we say that a level-z polyhex is connected if the network of nodes in the level-z polyhex is not partitioned. Connected level-z polyhex. A level-z polyhex Tzi is said to be connected if, given the set Λ of all hexagons in Tzi that contain at least one non-faulty node, for every pair of hexagons p and q from Λ, there exists some (possibly empty) sequence of hexagons t1 , t2 , . . . , tj such that {t1 , t2 , . . . , tj } ⊆ Λ, and t1 is a neighbor of p, every ti is a neighbor of ti+1 , and tj is a neighbor of q. Note that if a level-z polyhex is connected, then all the non-faulty nodes in the level-z polyhex are connected as well. We are now ready to define the notion of level-z connectedness in a WSN region. Level-z connectedness. A WSN region W is said to be level-z connected if there exists some partitioning of W into disjoint level-z polyhexes such that each such level-z polyhex is connected, and for every pair of such level-z polyhexes Tzp and Tzq , there exists some (possibly empty) sequence of (connected) levelz polyhexes Tz1 , Tz2 , . . . , Tzj (from the partitioning of W) such that Tz1 is a neighbor of Tzp , every Tzi is a neighbor of Tz(i+1) , and Tzj is a neighbor of Tzq . Additionally, each ‘side’ of Tzi has at least sidelen(z) non-empty hexagons. 2 We are now ready to prove the following theorem: Theorem 1. Given a WSN region W, if W is level-z connected, then all nonfaulty nodes in W are connected. Proof. Suppose that the region W is level-z connected. It follows that there exists some partitioning Λ of W into disjoint level-z polyhexes such that each such level-z polyhex is connected, and for every pair of such level-z polyhexes Tzp and Tzq , there exists some (possibly empty) sequence of (connected) levelz polyhexes Tz1 , Tz2 , . . . , Tzj (from the partitioning of W) such that Tz1 is a neighbor of Tzp , every Tzi is a neighbor of Tz(i+1) , and Tzj is a neighbor of Tzq . sidelen(z) Additionally, each ‘side’ of Tzi has at least non-empty hexagons. 2 To prove the theorem, it is sufficient to show that for any two non-faulty nodes in W in hexagons p and q, respectively, the hexagons p and q are connected. Let hexagon p lie in a level-z polyhex Tzp (∈ Λ), and let q lie in a level-z polyhex Tzq (∈ Λ). Note that since Λ is a partitioning of W, either Tzp = Tzq or Tzp and Tzq are disjoint. If Tzp = Tzq , then since Tzp is connected, it follows that p and q are connected. Hence, all non-faulty nodes in p are connected with all non-faulty nodes in q. Thus, the theorem is satisfied. If Tzp and Tzq are disjoint, then it follows from the definition of level-z connectedness that there exists some sequence of connected level-z polyhex Tz1 , Tz2 , . . . , Tzj such that Tz1 is a neighbor of Tzp , every Tzi is a neighbor of Tz(i+1) , and Tzj is a neighbor of Tzq . Additionally, each ‘side’ of Tzi has at least sidelen(z) non-empty hexagons. 2 Consider any two neighboring level-z polyhexes (Tzm , Tzn ) ∈ Λ × Λ . Each ‘side’ of Tzm and Tzn has sidelen(z) hexagons. Therefore, Tzm and Tzn have
Reliable Networks with Unreliable Sensors
287
sidelen(z) boundary hexagons such that each such hexagon from Tzm (and respectively, Tzn ) is adjacent to two boundary hexagons in Tzn (and respectively, Tzm ), except for the two boundary hexagons on either end of the ‘side’ of Tzm (and respectively, Tzn ); these two hexagons are adjacent to just one hexagon in Tzn (and respectively, Tzm ). We know that at least sidelen(z) of these bound2 ary hexagons are non-empty. It follows that there exists at least one non-empty hexagon in Tzm that is adjacent to a non-empty hexagon in Tzn . Such a pair of non-empty hexagons (one in Tzm and the other in Tzn ) form a “bridge” between Tzm and Tzn allowing nodes in Tzm to communicate with nodes in Tzn . Since Tzm and Tzn are connected level-z polyhexes, it follows that nodes within Tzm and Tzn are connected as well. Additionally, we have established that there exist at least two hexagons, one in Tzm and one in Tzn that are connected. It follows that nodes in Tzm and Tzn are connected with each other as well. Thus, it follows that Tzp and Tz1 are connected, every Tzi is connected with Tz(i+1) , and Tzj is connected with Tzq . From the transitivity of connectedness, it follows that Tzp is connected with Tzq . That is, all non-faulty nodes in hexagon p are connected with all non-faulty nodes in q. Since p and q are arbitrary hexagons in W, it follows that all the nodes in W are connected. Theorem 1 provides the following insight into connectivity analysis of a WSN: for appropriate values of z, a level-z polyhex has fewer nodes than the entire region W. In fact, a level-z polyhex could have orders of magnitude fewer nodes than W. Consequently, the analysis of connectedness of a level-z polyhex is simpler and easier than the connectedness of the entire region W. Using Theorem 1, we can leverage such an analysis of a level-z polyhex to derive a lower bound on the connectivity probability of W. The foregoing motivation is explored next.
6
On Fault Tolerance of WSN Regions
We are now ready to derive a lower bound on the connectivity probability of an arbitrarily-shaped WSN region. Let W be a WSN region with node density of D nodes per hexagon such that the region is approximated by a patch of x level-z polyhexes that constitute a set Λ. Let each node in the region fail independently with probability ρ. Let ConnW denote the event that all the non-faulty nodes in the region W are connected. Let Conn(T,z,side) denote the event that a level-z polyhex T is connected and each ‘side’ of T has at least sidelen(z)/2 nonempty hexagons. We know that if W is level-z connected, then all the non-faulty nodes in W are connected. Also, W is level-z connected if: ∀T ∈ Λ :: Conn(T,z,side) . Therefore, the probability that W is connected is bounded by: P r [ConnW ] ≥ (P r Conn(T,z,side) )x . Thus, in order to find a lower bound on P r [ConnW ], we have to find the lower bound on (P r Conn(T,z,side) )x . Lemma 2. In a level-z polyhex T with node density of D nodes per hexagon, suppose each node fails independently with a probability ρ. Then the probability
288
S. Sastry et al.
that T is connected and each ‘side’ of T has at least sidelen(z)/2 non-empty size(z) hexagons is given by P r Conn(T,z,side) = i=0 Nz,i (1 − ρD )size(z)−i ρD×i , where Nz,i is the number of ways by which we can have i empty hexagons and size(z) − i non-empty hexagons in a level-z polyhex such that the level-z polyhex is connected and each ‘side’ of the level-z polyhex has at least sidelen(k)/2 non-empty hexagons. Proof. Fix i hexagons in T to be empty such that T is connected and each ‘side’ of T has at least sidelen(k)/2 non-empty hexagons. Since nodes fail independently with probability ρ, and there are D nodes per hexagon, the probability that a hexagon is empty is ρD . Therefore, the probability that exactly i hexagons are empty in T is given by (1 − ρD )size(z)−i ρD×i . By assumption, there are Nz,i ways to fix i hexagons to be empty. Therefore, the probability that T is connected and each ‘side’ of T has at least sidelen(k)/2 non-empty hexagons despite i empty hexagons is given by Nz,i (1 − ρD )size(z)−i ρD×i . However, note that we can set i (the number of empty hexagons) to be anything from 0 to size(z). Therefore, P r Conn(T,z,side) is given by size(z) Nz,i (1 − ρD )size(z)−i ρD×i . i=0 Given the probability of Conn(T,z,side) , we can now establish a lower bound for the probability that the region W is connected. Theorem 3. Suppose each node in a WSN region W fails independently with probability ρ, W has a node density of D nodes per hexagon and tiled by a patch of x level-z polyhexes. Then that all non-faulty nodes in W are the probability connected is at least (P r Conn(T,z,side) )x Proof. There are x level-z polyhexes in W. Note that if W is level-z connected, then all non-faulty nodes in W are connected. However, observe that W is level-z connected if each such level-z polyhex is connected and each ‘side’ of each such level-z polyhex has at least sidelen(z)/2 non-empty hexagons. Recall from Lemma 2 that the probability of such an event for each polyhex is given by P r Conn(T,z,side) . Since there are x such level-z polyhex, and failure probability of nodes (and hence disjoint level-z polyhexes) is independent, it follows that the probability of W being connected is at least (P r Conn(T,z,side) )x . Note that the lower bound we have established depends on the function Nz,i defined in Lemma 2. Unfortunately, to the best of our knowledge, there is no known algorithm that computes Nz,i in a reasonable amount of time. Since this is a potentially infeasible approach for large WSNs with millions of nodes, we provide an alternate lower bound for P r Conn(T,z,side) . from Lemma 2 is bounded below Lemma 4. The value of P r Conn(T,z,side) 7 by: P r Conn(T,z,side) ≥ (P r Conn(T,z−1,side) ) + (P r Conn(T,z−1,side) )6 × ρD×size(z−1) where P r Conn(T,1,side) = 1 − ρD . Proof. Recall that a level-z polyhex consists for seven level-(z−1) polyhexes with one internal level-(z − 1) polyhex and six outer level-(z − 1) polyhexes. Observe
Reliable Networks with Unreliable Sensors
289
that a level-z polyhex satisfies Conn(T,z,side) if either (a) all the seven level(z − 1) polyhexes satisfy Conn(T,z−1,side) , or (b) the internal level-(z − 1) polyhex is empty and the six outer level-(z − 1) polyhexes satisfy Conn(T,z−1,side) . From Lemma 2 we know that the probability of alevel-(z − 1) polyhex satisfying Conn(T,z−1,side) is given by P r Conn(T,z−1,side) and the probability of a level(z − 1) polyhex being empty is ρD×size(z−1) . For a level-1 polyhex (which is a regular hexagon tile), the probability that the hexagon is not empty is 1 − ρD . Therefore, for z > 1 is given the probability that cases (a) or (b) is satisfied by (P r Conn(T,z−1,side) )7 + (P r Conn(T,z−1,side) )6 × ρD×size(z−1) . There fore, P r Conn(T,z,side) ≥ (P r Conn(T,z−1,side) )7 + (P r Conn(T,z−1,side) )6 × ρD×size(z−1) where P r Conn(T,1,side) = 1 − ρD . Analyzing the connectivity probability for WSN regions that are level-z connected where z is large, can be simplified by invoking Lemma 4, and reducing the complexity of the computation to smaller values of z for which P r Conn(T,z,side) can be computed (by brute force) fairly quickly.
7
Discussion
Choosing the size of the hexagon. For the results from the previous section to be of practical use, it is important that we choose the size of the hexagons in our system model carefully. On the one hand, choosing very large hexagons could violate the system model assumption that nodes can communicate with nodes in neighboring hexagons, and on the other hand, choosing small hexagons could result in poor lower bounds and thus result in over-engineered WSNs that incur high costs but with incommensurate benefits. If we make no assumptions about the locations of nodes within √ hexagons, then the length l of the sides of a hexagon must be at most R/ 13 to ensure connectivity between non-faulty nodes in neighboring hexagons. However, if the nodes are “evenly” placed within each hexagon, then l can be as large as R/2 while still ensuring connectivity between neighboring hexagons. In both cases, the requirement is that the distance between two non-faulty nodes in neighboring hexagons is at most R. Computing Nz,i from Lemma 2. The function Nz,i does not have a closedform solution. It needs to be computed through exhaustive enumeration. We computed Nz,i for some useful values of z and i and included them in Table 1. Using these values, we applied Theorem 3 and Lemma 4 to sensor networks of different sizes, node densities, and node failure probabilities. The results are presented in Table 2. Next, we demonstrate how to interpret and understand the entries in these tables through an illustrative example. Practicality. Our results can be utilized in the following two practical scenarios. (1) Given an existing WSN with known node failure probability, node density, and area of coverage, we can estimate the probability of connectivity of the entire network. First, we decide on the size of a hexagon as discussed previously,
290
S. Sastry et al. Table 1. Computed Values of Nz,i z k>2 3 3 3 3 3
i
Nz,i
z i k−1
1 size(k) = 7 2 1176 3 18346 4 208372 5 1830282 6 12899198
3 3 3 4 4 5
Nz,i
7 74729943 8 361856172 9 1481515771 2 58653 3 6666849 2 2881200
Table 2. Various values of node failure probability ρ, node density D, and level-z polyhex that yield network connectivity probability exceeding 99% Node No. of Node failure No. of Node failure density D Nodes prob. ρ Nodes prob. ρ 3 5 10 3 5 10 3 5 10
z = 2 (level-2 polyhex) z = 5 (level-5 polyhex) 21 35% 7203 24% 35 53% 12005 40% 70 70% 24010 63% z = 3 (level-3 polyhex) z = 6 (level-6 polyhex) 137 37% 50421 19% 245 50% 84035 36% 490 70% 24010 63% z = 4 (level-4 polyhex) z = 7 (level-7 polyhex) 1029 29% 352947 15% 1715 47% 588245 31% 3430 67% 1176490 57%
and then we consider level-z polyhexes that cover the region. Next, we apply Theorem 3 and Lemma 4 to compute the probability of connectivity of the network for the given values of ρ, D and z, and the precomputed values of Nz,i in Table 1. (2) The results in this paper can be used to design a network with a specified probability of connectivity. In this case, we decide on a hexagon size that best suits the purposes of the sensor network and determine the level of the polyhex(es) needed to cover the desired area. As an example, consider a 200 sq. km region (approximately circular, so that there are no ‘bottle neck’ regions) that needs to be covered by a sensor network with a 99% connectivity probability. Let the communication radius of each sensor be 50 meters. The average-case value of the length l of the side of the hexagon is 25 meters, and the 200 sq. km region is tiled by a single level-7 polyhex. From Table 2, we see that if the network consists of 3 nodes per hexagon, then the region will require about 352947 nodes with a failure probability of 15% (85% reliability). However, if the node redundancy is increased to 5 nodes per hexagon, then the region will require about 588245 nodes with a failure probability of 31% (69% reliability). If the node density is
Reliable Networks with Unreliable Sensors
291
increased further to 10 nodes per hexagon, then the region will require about 1176490 nodes with a failure probability of 57% (43% reliability). On the lower bounds. An important observation is that these values for node reliability are lower bounds, but are definitely not tight bounds. This is largely because in order to obtain tighter lower bounds, we need to compute the probability of network connectivity from Theorem 3. However, this requires us to compute the values for Nz,i for all values of i ranging from 1 to z, which is expensive for z exceeding 3. Consequently, we are forced to use the recursive function in Lemma 4 for computing the network connectivity for larger networks. This reduces the accuracy of the lower bound significantly. A side effect of this error is that in Table 2, we see that for a given D, ρ decreases as z increases. If we were to invest the time and computing resources to compute Nz,i for higher values of z (5, 6, 7, and greater), then the computed values for ρ in Table 2 would be significantly larger.
References 1. Akkaya, K., Younis, M.: A survey on routing protocols for wireless sensor networks. Ad Hoc Networks 3(3), 325–349 (2005), http://dx.doi.org/10.1016/j.adhoc.2003.09.010 2. Akyildiz, I.F., Su, W., Sankarasubramaniam, Y., Cayirci, E.: Wireless sensor networks: a survey. Computer Networks 38(4), 393–422 (2002), http://dx.doi.org/10.1016/S1389-12860100302-4 3. Ammari, H.M., Das, S.K.: Fault tolerance measures for large-scale wireless sensor networks. ACM Transactions on Autonomous and Adaptive Systems 4(1), 1–28 (2009), http://doi.acm.org/10.1145/1462187.1462189 4. Bai, X., Kumar, S., Xuan, D., Yun, Z., Lai, T.H.: Deploying wireless sensors to achieve both coverage and connectivity. In: MobiHoc 2006: Proceedings of the 7th ACM International Symposium on Mobile and Hoc Networking and Computing, pp. 131–142. ACM, New York (2006), http://doi.acm.org/10.1145/1132905.1132921 5. Bhandari, V., Vaidya, N.H.: Reliable broadcast in wireless networks with probabilistic failures. In: Proceedings of the 26th IEEE International Conference on Computer Communications, pp. 715–723 (2007), http://dx.doi.org/10.1109/INFCOM.2007.89 6. Cai, H., Jia, X., Sha, M.: Critical sensor density for partial connectivity in large area wireless sensor networks. In: Proceedings of the 27th IEEE International Conference on Computer Communications, pp. 1–5 (2010), http://dx.doi.org/10.1109/INFCOM.2010.5462211 7. Chen, J., Kanj, I.A., Wang, G.: Hypercube network fault tolerance: A probabilistic approach. Journal of Interconnection Networks 6(1), 17–34 (2005), http://dx.doi.org/10.1142/S0219265905001290 8. Chen, J., Wang, G., Lin, C., Wang, T., Wang, G.: Probabilistic analysis on mesh network fault tolerance. Journal of Parallel and Distributed Computing 67, 100–110 (2007), http://dx.doi.org/10.1016/j.jpdc.2006.09.002 9. Chen, J., Wang, G., Chen, S.: Locally subcube-connected hypercube networks: theoretical analysis and experimental results. IEEE Transactions on Computers 51(5), 530–540 (2002), http://dx.doi.org/10.1109/TC.2002.1004592
292
S. Sastry et al.
10. Kikiras, P., Avaritsiotis, J.: Unattended ground sensor network for force protection. Journal of Battlefield Technology 7(3), 29–34 (2004) 11. Kim, D., Hsin, C.F., Liu, M.: Asymptotic connectivity of low duty-cycled wireless sensor networks. In: Military Communications Conference, pp. 2441–2447 (2005), http://dx.doi.org/10.1109/MILCOM.2005.1606034 12. Li, M., Yang, B.: A survey on topology issues in wireless sensor network. In: Proceedings of the 2006 International Conference On Wireless Networks, pp. 503–509 (2006) 13. Meguerdichian, S., Koushanfar, F., Potkonjak, M., Srivastava, M.: Coverage problems in wireless ad-hoc sensor networks. In: Proceedings of the Twentieth Annual Joint Conference of the IEEE Computer and Communications Societies, pp. 1380– 1387 (2001), http://dx.doi.org/10.1109/INFCOM.2001.916633 14. Rhoads, G.C.: Planar tilings by polyominoes, polyhexes, and polyiamonds. Journal of Computational and Applied Mathematics 174(2), 329–353 (2005), http://dx.doi.org/10.1016/j.cam.2004.05.002 15. Sastry, S., Radeva, T., Chen, J.: Reliable networks with unreliable sensors. Tech. Rep. TAMU-CSE-TR-2010-7-4, Texas A&M University (2010), http://www.cse.tamu.edu/academics/tr/2010-7-4 16. Shakkottai, S., Srikant, R., Shroff, N.B.: Unreliable sensor grids: coverage, connectivity and diameter. Ad Hoc Networks 3(6), 702–716 (2005), http://dx.doi.org/10.1016/j.adhoc.2004.02.001 17. Su, H., Wang, Y., Shen, Z.: The condition for probabilistic connectivity in wireless sensor networks. In: Proceedings of the Third International Conference on Pervasive Computing and Applications, pp. 78–82 (2008), http://dx.doi.org/10.1109/ICPCA.2008.4783653 18. Szewczyk, R., Polastre, J., Mainwaring, A., Culler, D.: Lessons from a sensor network expedition. In: Proceedings of the First European Workshop on Wireless Sensor Networks, pp. 307–322 (2004), http://dx.doi.org/10.1007/978-3-540-24606-0_21 19. Vincent, P., Tummala, M., McEachen, J.: Connectivity in sensor networks. In: Proceedings of the Fortieth Hawaii International Conference on System Sciences, p. 293c (2007), http://dx.doi.org/10.1109/HICSS.2007.145 20. Wang, X., Xing, G., Zhang, Y., Lu, C., Pless, R., Gill, C.: Integrated coverage and connectivity configuration in wireless sensor networks. In: SenSys 2003: Proceedings of the 1st International Conference on Embedded Networked Sensor Systems, pp. 28–39 (2003), http://doi.acm.org/10.1145/958491.958496 21. Werner-Allen, G., Lorincz, K., Welsh, M., Marcillo, O., Johnson, J., Ruiz, M., Lees, J.: Deploying a wireless sensor network on an active volcano. IEEE Internet Computing 10, 18–25 (2006), http://dx.doi.org/10.1109/MIC.2006.26 22. Zhang, H., Hou, J.: Maintaining sensing coverage and connectivity in large sensor networks. Ad Hoc & Sensor Wireless Networks 1(1-2) (2005), http://oldcitypublishing.com/AHSWN/AHSWNabstracts/AHSWN1.1-2abstracts/ AHSWNv1n1-2p89-124Zhang.html
Energy Aware Fault Tolerant Routing in Two-Tiered Sensor Networks Ataul Bari, Arunita Jaekel, and Subir Bandyopadhyay School of Computer Science, University of Windsor 401 Sunset Avenue, Windsor, ON, N9B 3P4, Canada {bari1,arunita,subir}@uwindsor.ca
Abstract. Design of fault-tolerant sensor networks is receiving increasing attention in recent times. In this paper we point out that simply ensuring that a sensor network can tolerate fault (s) is not sufficient. It is also important to ensure that the network remains viable for the longest possible time, even if a fault occurs. We have focussed on the problem of designing 2-tier sensor networks using relay nodes as cluster heads. Our objective is to ensure that the network has a communication strategy that extends, as much as possible, the period for which the network remains operational when there is a single relay node failure. We have described an Integer Linear Program (ILP) formulation and have used this formulation to study the effect of single faults. We have compared our results to those obtained using standard routing protocols (Minimum Transmission Energy Model (MTEM)) and the Minimum Hop Routing Model (MHRM)). We have shown that our routing algorithm performs significantly better, compared to the MTEM and the MHRM.
1
Introduction
A wireless sensor network (WSN) is a network of battery powered, multi-functional devices, known as sensor nodes. Each sensor node typically consists of a micro-controller, a limited amount of memory, sensing device(s), and wireless trans-ceiver(s) [2]. A sensor network performs its tasks through the collaborative efforts of a large number of sensor nodes that are densely deployed within the sensing field [2], [3], [4]. Data from each node in a sensor network are gathered at a central entity, called the base station [2], [5]. Sensor nodes are powered by batteries, and recharging or replacing the batteries is usually not feasible due to economic reasons and/or environmental constraints [2]. Therefore it is extremely important to design communication protocols and algorithms that are energy efficient, so that the duration of useful operation, often called the lifetime [6], of a network, can be extended as much as possible [3], [4], [5], [7], [24]. The lifetime of a sensor network is defined as the time interval from the inception of the operation of the network, to the time when a number of critical nodes “die” [5], [6].
A. Jaekel and S. Bandyopadhyay have been supported by discovery grants from the Natural Sciences and Engineering Research Council of Canada.
M.K. Aguilera et al. (Eds.): ICDCN 2011, LNCS 6522, pp. 293–302, 2011. c Springer-Verlag Berlin Heidelberg 2011
294
A. Bari, A. Jaekel, and S. Bandyopadhyay
Recently some special nodes, called relay nodes have been proposed for sensor networks [8] - [17]. The use of relay nodes, provisioned with higher power, has been proposed as cluster heads in two-tiered sensor networks [10], [12], [17], [18], [19] where each relay node is responsible for collecting data from the sensor nodes belonging to its own cluster and for forwarding the collected data to the base station. The model for transmission of data from a relay node to the base station may be categorized either as the single-hop data transmission model (SHDTM) or the multi-hop data transmission model (MHDTM) [15], [16], [17], [20]. In MHDTM, each relay node, in general, uses some intermediate relay node(s) to forward the data to the base station. The MHDTM is considered in this paper since it is particularly suitable for larger networks. In the non-flowsplitting model a relay node is not allowed to split the traffic, and forwards all its data to a single relay node (or to the base station), and there is always a single path from each relay node to the base station. This is a more appropriate model for 2-tier networks, with important technological advantages [8], [15], and has been used in this paper. In the periodic data gathering model [8] considered in this paper, each period of data gathering (starting from sensing until all data reach the base station) is referred to as a round [20]. Although provisioned with higher power, the relay nodes are also battery operated and hence, are power constrained [16], [17]. In 2-tier networks, the lifetime is primarily determined by the duration for which the relay nodes are operational [10], [19]. It is very important to allocate the sensor nodes to the relay nodes appropriately, and find an efficient communication scheme that minimizes the energy dissipation of the relay nodes. We have measured the lifetime of a 2-tier network, following the N-of-N metric [6], by the number of rounds the network operates from the start, until the first relay node depletes its energy completely. In a 2-tier network using the N-of-N metric, assuming equal initial energy provisioning in each relay node, the lifetime of the network is defined by the ratio of the initial energy to the maximum energy dissipated by any relay node in a round. Thus, maximizing the lifetime is equivalent to minimizing the maximum energy dissipated by any relay node in a round [8], [19]. In the first-order radio model [5], [6] used here, energy is dissipated at a rate of α1 /bit (α2 /bit) for receiving (transmitting ) the data. The transmit amplifier also dissipates β amount of energy to transmit unit bit of data over unit distance. The energy dissipated to receive b bits (transmit b bits over a distance d) is given by ERx = α1 b (ETx (b, d) = α2 b + βbdq ), where q is the path loss exponent, 2 ≤ q ≤ 4, for free space using short to medium-range radio communication. Due to the nature of the wireless media, and based on the territory of the deployment, nodes in a sensor network are prone to faults [23]. A sensor network should ideally be resilient with respect to faults. In 2-tier networks the failure of a single relay node may have a significant effect on the overall lifetime of the network [8]. In a fault-free environment, it is sufficient that each sensor node is able to send the data it collects to at least one relay node. To provide fault tolerance, we need a placement strategy that allows some redundancy of the relay nodes, so that, in the event of any failure(s) in relay node(s), each sensor
Energy Aware Fault Tolerant Routing in Two-Tiered Sensor Networks
295
node belonging to the cluster of a failed relay node should be able to send its data to another fault-free relay node, and data from all fault-free relay nodes should still be able to reach the base station successfully. In [22], the authors have proposed an approximation algorithm to achieve single-connectivity and double-connectivity. In [25], authors have presented a two-step approximation algorithm to obtain a 1-connected and a 2-connected network. In [17], a 2-tier architecture is considered and an optimal placement of relay nodes for each cell is computed to allow 2-connectivity. Even though a significant amount of work has focussed on extending the lifetime of a fault-free sensor network, including two-tier networks [5], [10], [15], [19] the primary objective of research on fault tolerant sensor networks was to ensure k-connectivity of the network, for some pre-specified k > 1. Such a design ensures that the network can handle k −1 faults, since there exists at least one route of fault-free relay nodes from every sensor node to the base station. However, when a fault occurs, some relay nodes will be communicating more data compared to the fault-free case and it is quite likely that the fault may significantly affect the lifetime of the network. To the best of our knowledge, no research has attempted to minimize the effect of faults on the lifetime of a sensor network. Our approach is different from other research in this area since our objective is to guarantee that the network will be operational for the maximum possible period of time, even in the presence of faults. We have confined our work to the most likely scenario of single relay node failure and have shown how to design the network to maximize the lifetime of the network when a single relay node becomes faulty. Obviously, to handle single faults in any relay node, all approaches, including ours, must design a network which guarantees that each sensor node has a fault-free path to the base station avoiding the faulty relay node. In our approach, for any case of single relay node failure, we select the paths from all sensor nodes to the base station in such a way that a) we avoid the faulty relay node and b) the selected paths are such that the lifetime of the network is guaranteed to be as high as possible.
2 2.1
Fault Tolerant Routing Design Network Model
We consider a two-tiered wireless sensor network model with n relay nodes and a base station. All data from the network are collected at the base station. For convenience we assign labels 1, 2, 3, 4, ..., n to the relay nodes and label n + 1 to the base station. If a sensor node i can send it data to relay node j, we will say that j covers i. We assume that relay nodes are placed in such a way that each sensor node is covered by at least two relay nodes. This ensures that when a relay node fails, all sensor nodes in its cluster can be reassigned to other cluster (s), and the load (in terms of the number of bits) generated in the cluster of the failed node is redistributed among the neighboring relay nodes. A number of recent papers have addressed the issue of fault-tolerant placement of relay nodes to implement double (or multiple) coverage of each sensor node [17], [21], [22],
296
A. Bari, A. Jaekel, and S. Bandyopadhyay
[25]. Such fault-tolerant placement schemes can also indicate the “backup” relay node for each sensor node, when its original cluster head fails. We assume that the initial placement of the relay nodes has been done according to one of these existing approaches, so that the necessary level of coverage is achieved. Based on the average amount of data generated by each cluster and the location of the relay nodes, the Integer Linear Program (ILP) given below calculates the optimal routing schedule such that the worst case lifetime for any single fault scenario is maximized. In the worst case situation, a relay node fails from the very beginning. We therefore consider all single relay node failures, occurring when the network starts operating and determine which failure has the worst effect on the lifetime, even if an optimal routing schedule is followed to handle the failure. This calculation is performed offline, so it is reasonable to use an ILP to compute the most energy-efficient routing schedule. The backup routing schedule for each possible fault can be stored either at the individual relay nodes or at the base station. In the second option, the base station, which is not power constrained, can transmit the updated schedule to the relay nodes, when needed. For each relay node, the energy required for receiving the updated schedules is negligible, compared to the energy required for data transmission, and hence is not expected to affect the overall lifetime significantly. In our model, applications are assumed to have long idle time and are able to tolerate some latency [26], [27]. The nodes sleep during the idle time, and transmit/receive when they are awake. Hence, energy is dissipated by a node only during while the nodes are either transmitting, or receiving. We further assume that both sensor and relay nodes communicate through an ideal shared medium. As in [12], [13], we assume that communication between nodes, including the sleep/wake scheduling and the underlying synchronization protocol, is handled by appropriate state-of-the-art MAC protocols, such as those proposed in [26], [27], [28], [29]. 2.2
Notation Used
In our formulation we are given the following data as input: • α1 (α2 ): Energy coefficient for reception (transmission). • β: Energy coefficient for amplifier. • q: Path loss exponent. • bi : Number of bits generated per round by the sensor nodes belonging to cluster i, in the fault-free case. that are reas• bki : Number of bits per round, originally from cluster k, n signed to cluster i, when relay node k fails. Clearly, i=1;i=k bki = bk . • n: Total number of relay nodes.
Energy Aware Fault Tolerant Routing in Two-Tiered Sensor Networks
297
• n + 1: Index of the base station. • C: A large constant, greater than the total number of bits received by the base station in a round. • dmax : Transmission range of each relay node. • di,j : Euclidean distance from node i to node j. We also define the following variables: k : A binary variable defined as follows: • Xi,j 1 if node i selects j to send its data when relay node k fails, k Xi,j = 0 otherwise.
• Tik : Number of bits transmitted by relay node i when relay node k fails. • Gki : Amount of energy needed by the amplifier in relay node i to send its data to the next hop in its path to the base station when relay node k fails. • Rik : Number of bits received by relay node i from other relay nodes when relay node k fails. k • fi,j : Amount of flow from relay node i to relay node j, when relay node k fails. • Fmax : The total energy spent per round by the relay node which is being depleted at the fastest rate when any one relay node fails. 2.3
ILP Formulation for Fault Tolerant Routing (ILP-FTR) Minimize Fmax
(1)
Subject to: a) The range of transmission from a relay node is dmax . k · di,j ≤ dmax Xi,j
∀i, k, 1 ≤ i, k ≤ n; k = i, j; i = j ∀j, 1 ≤ j ≤ n + 1
(2)
b) Ensure that the non-flow-splitting model is followed so that all data from relay node i are forwarded to only one other node j. n+1
k Xi,j =1
∀i, k, 1 ≤ i, k ≤ n; k = i
(3)
j=1;j=i,k
c) Only one outgoing link from relay node i can have non-zero data flow. k k fi,j ≤ C · Xi,j
∀i, k, 1 ≤ i, k ≤ n; k = i, j; i = j ∀j, 1 ≤ j ≤ n + 1; j = k;
(4)
298
A. Bari, A. Jaekel, and S. Bandyopadhyay
d) Satisfy flow constraints. n+1
n
k fi,j −
j=1;j=i,k
k fj,i = bi + bki
∀i, k, 1 ≤ i, k ≤ n; k = i
(5)
j=1;j=i,k
e) Calculate the total number of bits transmitted by relay node i. n+1
Tik =
k fi,j
∀i, k, 1 ≤ i, k ≤ n; k = i
(6)
j=1;j=i,k
f) Calculate the amplifier energy dissipated by relay node i to transmit to the next hop. Gki = β
n+1
k fi,j · (di,j )q
∀i, k, 1 ≤ i, k ≤ n; k = i
(7)
j=1;j=i,k
g) Calculate the number of bits received by node i from other relay node(s). Rik =
n
k fj,i
∀i, k, 1 ≤ i, k ≤ n; k = i
(8)
j=1;j=i,k
h) Energy dissipated per round by relay node i, when node k has failed, must be less than Fmax . α1 (Rik + bki ) + α2 Tik + Gki ≤ Fmax 2.4
∀i, k, 1 ≤ i, k ≤ n; k = i
(9)
Justification of the ILP Equations
Equation (1) is the objective function for the formulation that minimizes the maximum energy dissipation by any individual relay node in one round of data gathering, for all possible fault scenarios. Constraint (2) ensures that a relay node i cannot transmit to a node j, if j is outside the transmission range of node i. Constraints (3) and (4) indicate that, for any given fault (e.g. a fault in node k), a non-faulty relay node i can only transmit data to exactly one other (nonfaulty) node j. Constraint (5) is the standard flow constraint [1], used to find a route to the base station, for the data originating in each cluster, when node k fails. We note that the total data (number of bits) generated in cluster i, when node k fails, is given by the number of bits bi , originally generated in cluster i, plus the additional number of bits bki , reassigned from cluster k to cluster i, due to the failure of relay node k. Constraint (6) specifies the total number of bits Tik transmitted by the relay node i, when node k has failed. Constraint (7) is used to calculate Gki , the total amplifier energy needed at relay node i when node k fails, by directly applying the first order radio model [5], [6]. Constraint (8) is used to calculate the total number bits Rik received at relay node i from other relay node(s), when node k fails. Finally, (9) gives the total energy dissipated by each relay node, when node k fails. The total energy dissipated by a relay node, for any possible fault scenario (i.e. any value of k), cannot exceed Fmax , which the formulation attempts to minimize.
Energy Aware Fault Tolerant Routing in Two-Tiered Sensor Networks
3
299
Experimental Results
In this section, we present the simulation results for our fault-tolerant routing scheme. We have considered a 240×240m networking area with seventeen relay nodes, and with sensor nodes randomly distributed in the area. The results were obtained using the ILOG CPLEX 9.1 solver [30]. For each fault scenario (i.e. a specified relay node becomes faulty), we have measured the achieved lifetime of the network by the number of rounds until the first fault-free relay node runs out of battery power. The lifetime in the presence of a fault can vary widely, depending on the location and load of the faulty node. When reporting the lifetime achieved in the presence of a fault, we have taken the worst case value, i.e. the lowest lifetime obtained after considering all possible single node failures, with the node failure occurring immediately after the network has been deployed. For experimental purposes, we have considered a number of different sensor node distributions, with the number of sensor nodes in the network varying from 136 to 255. We have assumed that 1. the communication energy dissipation is based on the first order radio model, described in Section 1 and 2. the values for the constants are the same as in [5], so that: (a) α1 = α2 = 50nJ/bit, (b) β = 100pJ/bit/m2 and (c) the path-loss exponent, q = 2. 3. the range of each sensor node is 80m, 4. the range of each relay node is 200m, as in [17], and 5. the initial energy of each relay node was 5J, as in [17]. We also assumed that a separate node placement and clustering scheme (as in [17], [21]) is used to ensure that each sensor and relay node has a valid path to the base station for all single fault scenarios, and to pre-assign the sensor nodes to clusters. Under these assumptions, we have compared the performance of our scheme with two existing well-known schemes that are widely used in sensor networks. i) Minimum transmission energy model (MTEM) [5], where each node i transmits to its nearest neighbor j, such that node j is closer to the base station than node i. ii) Minimum hop routing model (MHRM) [12], [14], where each node finds a path to the base station that minimizes the number of hops. Figure 1 compares the obtained network lifetime using ILP-FTR, MTEM and MHRM schemes. As shown in the figure, our formulation substantially outperforms both MTEM and MHRM approaches, under any single relay node failure. Furthermore, the ILP guarantees the “best” solution (with respect to the objective being optimized). The results show that, under any single relay node failure, our method can typically achieve an improvement of more than 2.7 times the
300
A. Bari, A. Jaekel, and S. Bandyopadhyay
Lifetime in rounds
2500
2000
1500
MTEM MHRM ILP-FTR
1000
500
0 136
170
255
Number of sensor nodes
Fig. 1. Comparison of the lifetimes in rounds, obtained using the ILP-FTR, MTEM and MHRM on networks with different number of sensor nodes
2500
Lifetime in rounds
2000
1500
MTEM MHRM ILP-FTR
1000
500
0 1
3
5
7
9
11
13
15
17
Index of relay nodes
Fig. 2. Variation of the lifetimes in rounds, under the fault of different relay nodes, obtained using the ILP-FTR, MTEM and MHRM on a network with 170 sensor nodes
network lifetime, compared to MTEM, and 2.3 times the network lifetime compared to MHRM. Figure 2 shows how the network lifetime varies with the failure of a relay node under ILP-FTR, MTEM and MHRM schemes on a network with 170 sensor nodes. As the figure shows, our approach provides substantial improvement over the other approaches, in terms of the network lifetime, considering all failure scenarios. Using our approach, it is also possible to identify the relay node(s) that is(are) most critical, and possibly, provide some additional protection (e.g., deployment of back-up node(s)) to guarantee the lifetime. Finally, we note that MTEM appears to be much more vulnerable to fluctuations in lifetime, depending on the particular node that failed, compared to the other two schemes.
Energy Aware Fault Tolerant Routing in Two-Tiered Sensor Networks
4
301
Conclusions
In this paper we have addressed the problem of maximizing the lifetime of a two-tier sensor network, in the presence of faults. Although many papers have considered energy-aware routing for the fault-free case, and also proposed deploying redundant relay nodes for meeting connectivity and coverage requirements, we believe this is the first paper to investigate energy-aware routing for different fault-scenarios. Our approach optimizes the network lifetime that can be achieved, and provides the corresponding routing scheme to be followed to achieve this goal, for any single node fault. The simulation results show that the proposed approach can significantly improve network lifetime, compared to standard schemes such as MTEM and MHRM.
References 1. Ahuja, R.K., Magnanti, T.L., Orlin, J.B.: Network Flows: Theory, Algorithms, and Applications. Prentice Hall, Englewood Cliffs (1993) 2. Akyildiz, I.F., Su, W., Sankarasubramaniam, Y., Cayirci, E.: Wireless sensor networks: a survey. Computer Networks 38, 393–422 (2002) 3. Akkaya, K., Younis, M.: A survey on routing protocols for wireless sensor networks. IEEE Transactions On Mobile Computing 3(3), 325–349 (2005) 4. Chong, C.-Y., Kumar, S.P.: Sensor Networks: Evolution, Opportunities, and Challenges. The Proceeding of IEEE 91(8), 1247–1256 (2003) 5. Heinzelman, W., Chandrakasan, A., Balakrishnan, H.: Energy effcient communication protocol for wireless micro-sensor networks. In: 33rd HICSS, pp. 3005–3014 (2000) 6. Pan, J., Hou, Y.T., Cai, L., Shi, Y., Shen, S.X.: Topology Control for Wireless Sensor Networks. In: The Proceedings of International Conference on Mobile Computing and Networking, pp. 286–299 (2003) 7. Duarte-Melo, E.J., Liu, M.: Analysis of energy consumption and lifetime of heterogeneous wireless sensor networks. In: The Proceeding of IEEE Global Telecommunications Conference, vol. 1, pp. 21–25 (2002) 8. Bari, A.: Energy Aware Design Strategies for Heterogeneous Sensor Networks. PhD thesis, University of Windsor (2010) 9. Bari, A., Jaekel, A., Bandyopadhyay, S.: Integrated Clustering and Routing Strategies for Large Scale Sensor Networks. In: Akyildiz, I.F., Sivakumar, R., Ekici, E., de Oliveira, J.C., McNair, J. (eds.) NETWORKING 2007. LNCS, vol. 4479, pp. 143–154. Springer, Heidelberg (2007) 10. Bari, A., Jaekel, A., Bandyopadhyay, S.: Optimal Placement and Routing Strategies for Resilient Two-Tiered Sensor Networks. Wireless Communications and Mobile Computing 9(7), 920–937 (2008), doi:10.1002/wcm.639 11. Cheng, X., Du, D.-Z., Wang, L., Xu, B.B.: Relay Sensor Placement in Wireless Sensor Networks. Wireless Networks 14(3), 347–355 (2008) 12. Gupta, G., Younis, M.: Load-balanced clustering of wireless sensor networks. In: IEEE International Conference on Communications, vol. 3, pp. 1848–1852 (2003) 13. Gupta, G., Younis, M.: Fault-tolerant clustering of wireless sensor networks. In: IEEE WCNC, pp. 1579–1584 (2003)
302
A. Bari, A. Jaekel, and S. Bandyopadhyay
14. Gupta, G., Younis, M.: Performance evaluation of load-balanced clustering of wireless sensor networks. In: International Conference on Telecommunications, vol. 2, pp. 1577–1583 (2003) 15. Hou, Y.T., Shi, Y., Pan, J., Midkiff, S.F.: Maximizing the Lifetime of Wireless Sensor Networks through Optimal Single-Session Flow Routing. IEEE Transactions on Mobile Computing 5(9), 1255–1266 (2006) 16. Hou, Y.T., Shi, Y., Sherali, H.D., Midkiff, S.F.: On Energy Provisioning and Relay Node Placement for Wireless Sensor Networks. In: IEEE International Conference on Sensor and Ad Hoc Communications and Networks (SECON), vol. 32 (2005) 17. Tang, J., Hao, B., Sen, A.: Relay node placement in large scale wireless sensor networks. Computer Communications 29(4), 490–501 (2006) 18. Bari, A., Jaekel, A., Bandyopadhyay, S.: Clustering Strategies for Improving the Lifetime of Two-Tiered Sensor Networks. Computer Communications 31(14), 3451– 3459 (2008) 19. Bari, A., Jaekel, A., Bandyopadhyay, S.: A Genetic Algorithm Based Approach for Energy Efficient Routing in Two-Tiered Sensor Networks. Ad Hoc Networks Journal, Special Issue: Bio-Inspired Computing 7(4), 665–676 (2009) 20. Kalpakis, K., Dasgupta, K., Namjoshi, P.: Efficient algorithms for maximum lifetime data gathering and aggregation in wireless sensor networks. Computer Networks 42(6), 697–716 (2003) 21. Bari, A., Wu, Y.: Jaekel, A. Integrated Placement and Routing of Relay Nodes for Fault-Tolerant Hierarchical Sensor Networks. In: IEEE ICCCN - SN, pp. 1–6 (2008) 22. Hao, B., Tang, J., Xue, G.: Fault-tolerant relay node placement in wireless sensor networks: formulation and approximation. In: Workshop on High Performance Switching and Routing (HPSR), pp. 246–250 (2004) 23. Alwan, H., Agarwal, A.: A Survey on Fault Tolerant Routing Techniques in Wireless Sensor Networks. In: SensorComm, pp. 366–371 (2009) 24. Wu, Y., Fahmy, S., Shroff, N.B.: On the Construction of a Maximum-Lifetime Data Gathering Tree in Sensor Networks: NP-Completeness and Approximation Algorithm. In: INFOCOM, pp. 356–360 (2008) 25. Liu, H., Wan, P.-J., Jia, X.: Fault-Tolerant Relay Node Placement in Wireless Sensor Networks. In: Wang, L. (ed.) COCOON 2005. LNCS, vol. 3595, pp. 230– 239. Springer, Heidelberg (2005) 26. Ye, W., Heidemann, J., Estrin, D.: An Energy-Efficient MAC protocol for Wireless Sensor Networks. In: IEEE INFOCOM, pp. 1567–1576 (2002) 27. Ye, W., Heidemann, J., Estrin, D.: Medium access control with coordinated adaptive sleeping for wireless sensor networks. IEEE/ACM Transactions on Networking 12(3), 493–506 (2004) 28. Wu, Y., Fahmy, S., Shroff, N.B.: Optimal Sleep/Wake Scheduling for timesynchronized sensor networks with QoS guarantees. In: The Proceedings of IEEE IWQoS, pp. 102–111 (2006) 29. Wu, Y., Fahmy, S., Shroff, N.B.: Energy Efficient Sleep/Wake Scheduling for MultiHop Sensor Networks: Non-Convexity and Approximation Algorithm. In: The Proceedings of IEEE INFOCOM, pp. 1568–1576 (2007) 30. ILOG CPLEX 9.1 Documentation. Available at the website, http://www.columbia.edu/~ dano/resources/cplex91_man/index.html
Scheduling Randomly-Deployed Heterogeneous Video Sensor Nodes for Reduced Intrusion Detection Time Congduc Pham University of Pau, LIUPPA Laboratory, Avenue de l’Universit´e - BP 1155 64013 PAU CEDEX, France
[email protected] http://web.univ-pau.fr/~ cpham
Abstract. This paper proposes to use video sensor nodes to provide an efficient intrusion detection system. We use a scheduling mechanism that takes into account the criticality of the surveillance application and present a performance study of various cover set construction strategies that take into account cameras with heterogeneous angle of view and those with very small angle of view. We show by simulation how a dynamic criticality management scheme can provide fast event detection for mission-critical surveillance applications by increasing the network lifetime and providing low stealth time of intrusions. Keywords: Sensor networks, video surveillance, coverage, mission-critical applications.
1
Introduction
The monitoring capability of Wireless Sensor Networks (WSN) make them very suitable for large scale surveillance systems. Most of these applications have a high level of criticality and can not be deployed with the current state of technology. This article focuses on Wireless Video Sensor Networks (WVSN) where sensor nodes are equipped with miniaturized video cameras. We consider WVSN for mission-critical surveillance applications where sensors can be thrown in mass when needed for intrusion detection or disaster relief applications. This article also focuses on taking into account cameras with heterogeneous angle of view and those with very small angle of view. Surveillance applications [1,2,3,4,5] have very specific needs due to their inherently critical nature associated to security . Early surveillance applications involving WSN have been applied to critical infrastructures such as production systems or oil/water pipeline systems [6,7]. There have also been some propositions for intrusion detection applications [8,9,10,11] but most of these studies focused on coverage and energy optimizations without explicitly having the application’s criticality in the control loop which is the main concern in our work. M.K. Aguilera et al. (Eds.): ICDCN 2011, LNCS 6522, pp. 303–314, 2011. c Springer-Verlag Berlin Heidelberg 2011
304
C. Pham
For instance, with video sensors, the higher the capture rate is, the better relevant events could be detected and identified. However, even in the case of very mission-critical applications, it is not realistic to consider that video nodes should always capture at their maximum rate when in active mode. The notion of cover set has been introduced to define the redundancy level of a sensor nodes that monitor the same region. In [12] we developed the idea that when a node has several covers, it can increase its frame capture rate because if it runs out of energy it can be replaced by one of its cover sets. Then, depending on the application’s criticality, the frame capture rate of those nodes with large number of cover sets can vary: a low criticality level indicates that the application does not require a high video frame capture rate while a high criticality level does. According to the application’s requirements, an R0 value that indicate the criticality level could be initialized accordingly into all sensors nodes prior to deployment. Based on the criticality model we developed previously in [12], this article has 2 contributions. The first contribution is an enhanced model for determining sensor’s cover sets that takes into account cameras with heterogeneous angle of view and those with very small angle of view. The performance of this approach is evaluated through simulation. The second contribution is to show the performance of the multiple cover sets criticality-based scheduling method proposed in [12] for fast event detection in mission-critical applications. The paper is then organized as follows: Section 2 present the coverage model and our approach for quickly building multiple cover sets per sensor. In section 3 we quickly present the dynamic criticality management model and then present the main contribution of this paper that focuses on fast event detection in section 4. We conclude in section 5.
2
Video Sensor Model
A video sensor node v is represented by the FoV of its camera. In our approach, we consider a commonly used 2-D model of a video sensor node where the FoV − → is defined as a triangle (pbc) denoted by a 4-tuple v(P, d, V , α). Here P is the → − position of v, d is the distance pv (depth of view, DoV), V is the vector rep-
(a) Coverage model
(b) Heterogeneous AoV Fig. 1. Coverage model
Scheduling Randomly-Deployed Heterogeneous Video Sensor Nodes
305
resenting the line of sight of the camera’s FoV which determines the sensing → − direction, and α is the angle of the FoV on both sides of V (2α can be denoted as the angle of view, AoV). The left side of figure 1(a) illustrates the FoV of a video sensor node in our model. The AoV (2α) is 30o and distance bc is the linear FoV which is usually expressed in ft/1000yd or millimeters/meter. By using simple trigonometry relations we can link bc to pv with the following relation sin α bc = 2cos α .pv. We define a cover set Coi (v) of a video node v as a subset of video nodes such that: v ∈Coi (v) (v ’s FoV area) covers v’s FoV area. Co(v) is defined as the set of all the cover sets Coi (v) of node v. One of the first embedded camera on a wireless sensor hardware is the Cyclops board designed for the CrossBow Mica2 sensor [13] which is advertized to have an AoV of 52o . Recently, the IMB400 multimedia board has been designed for the Intel Mote2 sensor and has an AoV of about 20o , which is rather small. Obviously, the linear FoV and the AoV are important criteria in video sensor networks deployed for mission-critical surveillance applications. The DoV is a more subjective parameter. Technically, DoV could be very large but practically it is limited by the fact that an observed object must be sufficiently big to be identified. 2.1
Determining Cover Sets
In the case of an omnidirectional sensing, a node can simply determine what parts of the coverage disc is covered by its neighbors. For the FoV coverage the task is more complex and determining whether a sensor’s FoV is completely covered or not by a subset of neighbor sensors is a time consuming task which is usually too resource-consuming for autonomous sensors. A simple approach presented in [14] is to use significant points of a sensor’s FoV to quickly determine cover sets that may not completely cover sensor v’s FoV but a high percentage of it. First, sensor v can classify its neighbors into 3 categories of nodes, (i) those that cover point p, (ii) those cover point b and (iii) those that cover point c. Then, in order to avoid selecting neighbors that cover only a small portion of v’s FoV, we add a fourth point taken near the center of v’s FoV to construct a fourth set and require that candidate neighbors covers at least one of the 3 vertices and the fourth point. It is possible to use pbc’s center of gravity, noted point g, as depicted in figure 1(a)(right). In this case, a node v can practically computes Co(v) by finding the following sets, where N (v) represents the set of neighbors of node v: – P/B/C/G = {v ∈ N (v) : v covers point p/b/c/g of the FoV} – P G = {P ∩ G}, BG = {B ∩ G}, CG = {C ∩ G} Then, Co(v) can be computed as the Cartesian product of sets P G, BG and CG ({P G × BG × CG}). However, compared to the basic approach described in [14], point g may not be the best choice in case of heterogeneous camera’s AoV and very small AoV as will be explained in the next subsections.
306
2.2
C. Pham
The Case of Heterogeneous AoV
It is highly possible that video sensors with different angles of view are randomly deployed. In this case, a wide-angle FoV could be covered by narrow-angle FoV sensors and vice-versa. Figure 1(b) shows these cases and the left part of the figure shows the most problematic case when a wide FoV (2α = 60o ) has to be covered by a narrow FoV (2α = 30o ). As we can see, it becomes very difficult for a narrow angle node to cover pbc’s center of gravity g and one of the vertice at the same time.
(a) Heterogeneous AoV
(b) Very small AoV
Fig. 2. Using more alternate points
The solution we propose in this paper is to use alternate points gp , gb and gc that are set in figure 2(a)(left) as the mid-point of segment [pg], [bg] and [cg] respectively. It is also possible to give different weights as shown in the right part of the figure. When using these additional points, it is possible to require that a sensor vx either covers both c and gc or gc and g (the same for b and gb , and p and gp ) depending on whether the edges or the center of sensor v’s FoV are privileged. Generalizing this method by using different weights to set gc , gb and gp closer or farther from there respective vertices can be useful to set which parts v’s FoV has more priority as depicted in figure 2(a)(right) where gc has moved closer to g, gb closer to b and gp closer to p. 2.3
The Case of Very Small AoV
On some hardware, the AoV can be very small. This is the case for instance with the IMB400 multimedia board on the iMote2 which has an AoV of 2α = 20o . Figure 2(b)(left) shows that in this case, the most difficult scenario is to be able to cover both point p and point gp if gp is set too far from p. As it is not interesting to move gp closer to p with such a small AoV, the solution we propose is to discard point p and only consider point gp that could move along segment [pg] as previously. Therefore in the scenario depicted in figure 2(b)(right), we have P G = {v3 , v6 }, BG = {v1 , v2 , v5 } and CG = {v4 } resulting in Co(v) = {{v3 , v1 , v4 }, {v3 , v2 , v4 }, {v3 , v5 , v4 }, {v6 , v1 , v4 }, {v6 , v2 , v4 }, {v6 , v5 , v4 }}.
Scheduling Randomly-Deployed Heterogeneous Video Sensor Nodes
2.4
307
Accuracy of the Proposed Method
Using specific points is of course approximative and a cover can satisfy the specific points coverage conditions without ensuring the coverage of the entire FoV. To evaluate the accuracy of our cover sets construction technique, especially for very small AoV, we conducted a series of simulations based on the discrete event simulator OMNet++ (http://www.omnetpp.org/). The results were obtained from iterations with various node populations on a 75m.75m area. Nodes → − have random position P , random line of sight V , equal communication ranges of 30m (which determines neighbor nodes), equal DoV of 25m and an offset angle α. We will test with 2α = 20o (α = π/18), 2α = 36o (α = π/10) and 2α = 60o (α = π/6). We run each simulation 15 times to reduce the impact of randomness. The results (averaged over the 15 simulation runs) are summarized in table 1. We will denote by COpbcG , COpbcApbc and CObcApbc the following respective strategies: (i) the triangle points are used with g, which is pbc’s center of gravity, when determining eligible neighbors to be included in a sensor’s cover sets, (ii) alternates points gp , gb and gc are used with the triangle points and, (iii) same as previously except that point p is discarded. The ”stddev of %coverage” column is the standard deviation over all the simulation runs. A small standard deviation value means that the various cover sets have percentages of coverage of the initial FoV close to each other. When ”stddev of %coverage” is 0, it means that each simulation run gives only 1 node with 1 cover set. This is usually the case when the strategy to construct cover sets is too restrictive. Table 1 is divided in 3 parts. The first part shows the COpbcG strategy with 2α = 60o , 2α = 36o and 2α = 20o . We can see that using point g gives very high percentage of coverage but with 2α = 36o very few nodes do have cover sets compared to the case when 2α = 60o . With very small AoV, the position of point g is not suitable as no cover sets are found. The second part of table 1 shows the COpbcApbc strategy, where alternate points gp , gb and gc are used along with the triangle vertices, with 2α = 36o and 2α = 20o . For 2α = 36o , this strategy succeeds in providing both a high percentage of coverage and a larger number of nodes with cover sets. When 2α = 20o the percentage of coverage is over 70% but once again very few nodes do have cover sets. This second part also shows CObcApbc (point p is discarded) with 2α = 20o . We can see that this strategy is quite interesting as the number of nodes with cover sets increases for a percentage of coverage very close to the previous case. In addition, the mean number of cover sets per node greatly increases which is highly interesting as nodes with high number of cover sets could act as sentry nodes in the network. The last part of table 1 uses a mixed AoV scenario where 80% of nodes have an AoV of 20o and 20% of nodes an AoV of 36o . This last part shows the performance of the 3 strategies and we can see that CObcApbc presents the best tradeoff in terms of percentage of coverage, number of nodes with cover sets and mean number of cover sets per nodes when many nodes have a small AoV.
308
C. Pham
Table 1. Results for COpbcG , COpbcApbc and CObcApbc . 2α = 20o , 2α = 36o and mixed AoV COpbcG #nodes 75 100 125 150 175
60o
COpbcG #nodes 75 100 125 150 175
36o
COpbcG #nodes all cases
20o
% nodes with mean coverset coverage 4.89 7.13 11.73 17.11 26.19
94.04 94.63 95.06 95.44 94.64
% nodes with mean coverset coverage 0 1 1.87 1.78 3.43
0 92,03 91.45 95.06 94.42
% nodes with mean coverset coverage 0
0
COpbcApbc 36o #nodes
% nodes with mean coverset coverage
75 100 125 150 175
12.44 20.13 30.67 35.11 48.57
COpbcApbc 20o #nodes
% nodes with mean coverset coverage
75 100 125 150 175
1.13 2 2.67 4 7.43
CObcApbc 20o #nodes
% nodes with mean coverset coverage
75 100 125 150 175
7.56 9.13 12.53 21.13 25.13
COpbcG 20o (80%) 36o (20%)
% nodes with mean coverset coverage
#nodes 75,100,125 150 175 COpbcApbc 20o (80%) 36o (20%) #nodes 75 100 125 150 175 CObcApbc 20o (80%) 36o (20%) #nodes 75 100 125 150 175
0 0.66 0.57
77.48 79.62 76.89 78.47 77.76
70.61 73.89 71.78 71.67 75.50
73.79 67.16 70.12 70.10 71.79
0 92.13 93.45
% nodes with mean coverset coverage
3.11 3 4.80 8.67 10.19
81.89 69.83 78.58 78.12 76.60
% nodes with mean coverset coverage
9.13 6 10.93 20 20.95
81.48 80.10 73.15 72.12 75.15
% min,max % cover- stddev of age/coverset coverage 90.16,98.15 86.99,98.49 85.10,99.52 84,99.82 83.57,99.89
3.67 4.40 4.12 3.98 4.01
% min,max % cover- stddev of age/coverset coverage 0,0 89.78,98.64 88.83,93.15 91.47,98.19 87.60,99.03
nan 0 2.97 4.06 4.40
% min,max % cover- stddev of age/coverset coverage 0,0
nan
% min,max % cover- stddev of age/coverset coverage 56.46,91.81 53.65,98.98 50.53,97.92 52.07,96.09 49.97,98.10
13.13 12.05 11.58 10.60 10.54
% min,max % cover- stddev of age/coverset coverage 57.60,91.54 69.45,79.80 58.67,84.98 54.18,92.19 54.69,94.01
0 9.50 12.45 14.10 12.87
% min,max % cover- stddev of age/coverset coverage 56.18,88.54 47.78,88.71 40.41,87.46 45.72,91.57 44.15,94.18
12.45 13.80 13.11 11.57 11.91
% min,max % cover- stddev of age/coverset coverage
0,0 83.64,95.83 85.75,96.14
nan 0 0
% min,max % cover- stddev of age/coverset coverage
78.13,89.02 65.50,74.55 69.52,90.92 56.41,97.59 50.4,95.47
8.15 8.18 8.03 13.71 13.48
% min,max % cover- stddev of age/coverset coverage
69.18,93.72 62.82,90.16 47.14,92.14 45.53,95.94 43.01,97.57
9.72 11.81 14.43 12.19 12.59
% min,max #coverset/node 1,5.66 1,6 1,13 1,16.13 1,35.66 % min,max #coverset/node 0,0 1,1 1.13,2 1,3 1.13,2.66 % min,max #coverset/node 0,0 % min,max #coverset/node 1.13,9.13 1,10.66 1,34 1,31.13 1,50.13 % min,max #coverset/node 1,1 1.13,2 1.13,2 1,3.66 1,8 % min,max #coverset/node 1,5 1,4.66 1,11.13 1,19.13 1,37 % min,max #coverset/node
0,0 1,1 1,1 % min,max #coverset/node
1.13,2 1,3.66 1,3.13 1,5 1,8.66 % min,max #coverset/node
1,5.66 1,3.66 1.13,9.13 1,16.66 1,18.13
mean #coverset/node 2.10 2.99 3.53 4.15 6.40 mean #coverset/node 0 1 1.56 1.94 1.92 mean #coverset/node 0 mean #coverset/node 3.62 3.94 5.40 6.90 11.57 mean #coverset/node 1 1.58 1.75 1.91 2.74 mean #coverset/node 2.10 2.14 3.17 4.18 7.05 mean #coverset/node
0 1 1 mean #coverset/node
1.58 1.89 1.56 1.95 2.62 mean #coverset/node
2.06 1.94 3.65 4.83 5.15
Scheduling Randomly-Deployed Heterogeneous Video Sensor Nodes
3
309
Criticality-Based Scheduling of Randomly Deployed Nodes with Cover Sets
As said previously, the frame capture rate is an important parameter that defines the surveillance quality. In [12], we proposed to link a sensor’s frame capture rate to the size of its cover set. In our approach we define two classes of applications: low and high criticality applications. This criticality level can oscillate from a concave to a convex shape as illustrated in Figure 3 with the following interesting properties: – Class 1 ”low criticality”, does not need high frame capture rate. This characteristic can be represented by a concave curve (figure 3(a) box A), most projections of x values are gathered close to 0. – Class 2 ”high criticality”, needs high frame capture rate. This characteristic can be represented by a convex curve (figure 3(a) box B), most projections of x values are gathered close to the max frame capture rate.
(a) Application classes
(b) The Behavior curve functions
Fig. 3. Modeling criticality
[12] proposes to use a Bezier curve to model the 2 application classes. The advantage of using Bezier curves is that with only three points we can easily define a ready-to-use convex (high criticality) or concave (low criticality) curve. In figure 3(b) P0 (0, 0) is the origin point, P1 (bx , by ) is the behavior point and P2 (hx , hy ) is the threshold point where hx is the highest cover cardinality and hy is the maximum frame capture rate determined by the sensor node hardware capabilities. As illustrated in Figure 3(b), by moving the behavior point P1 inside the rectangle defined by P0 and P2 , we are able to adjust the curvature of the Bezier curve, therefore adjusting the risk level r0 introduced in the introduction of this paper. According to this level, we define the risk function called Rk which operates on the behavior point P1 to control the BV function curvature. According to the position of point P1 the Bezier curve will morph between a convex and a concave form. As illustrated in figure 3(b) the first and the last points delimit the curve frame. This frame is a rectangle and is defined by the source
310
C. Pham Table 2. Capture rate in fps when P2 is at (12,3) r0 0 .1 .4 .6 .8 1
1 .01 .07 .17 .16 .75 1.5
2 .02 .15 .15 .69 1.1 1.9
3 .05 .15 .55 1.0 1.6 2.1
4 0.1 .17 .75 1.1 1.9 2.4
5 .17 .51 .97 1.5 2.1 2.6
6 .16 .67 1.1 1.8 2.1 2.7
7 .18 .86 1.4 2.0 2.5 2.8
8 .54 1.1 1.7 2.1 2.6 2.9
9 .75 1.4 2.0 2.4 2.7 2.9
10 1.1 1.7 2.1 2.6 2.8 2.9
11 1.5 2.1 2.6 2.8 2.9 2
12 3 3 3 3 3 3
point P0 (0, 0) and the threshold point P2 (hx , hy ). The middle point P1 (bx , by ) defines the risk level. We assume that this point can move through the second −h diagonal of the defined rectangle bx = hxy ∗ by + hy . Table 2 shows the corresponding capture rate for some relevant values of r0 . The cover set cardinality |Co(v)| ∈ [1, 12] and the maximum frame capture rate is set to 3fps.
4
Fast Event Detection with Criticality Management
We are evaluating in this section the performance of an intrusion detection system by investigating the stealth time of the system. For these set of simulations, 150 sensor nodes are randomly deployed in a 75m ∗ 75m area. Unless specified, sensors have an 36o AoV and the COpbcApbc strategy is used to construct cover sets. Each sensor node captures with a given number of frames per second (between 0.01fps and 3fps) according to the model defined in figure 3(b). Nodes with 12 or more cover sets will capture at the maximum speed. Simulation ends when there are no active nodes anymore. 4.1
Static Criticality-Based Scheduling
We ran simulations for 4 levels of criticality: r0 = 0.1, 0.4, 0.6 and 0.8. The corresponding capture rates are those shown in table 2. Nodes with high capture rate will use more battery power until they run out of battery (initial battery level is 100 units and 1 captured image consumes 1 unit) but, according to the scheduling model, nodes with high capture rate are also those with large number of cover sets. Note that it is the number of valid cover sets that defines the capture rate and not the number of cover sets found at the beginning of the cover sets construction procedure. In order to show the benefit of the adaptive behavior, we computed the mean capture rate for each criticality level and then used that value as a fixed capture rate for all the sensor nodes in the simulation model. r0 = 0.1 gives a mean capture rate of 0.12fps, r0 = 0.4 gives 0.56fps, r0 = 0.6 gives 0.83fps and r0 = 0.8 gives 1.18fps. Table 3 shows the network lifetime for the various criticality and frame capture rate values. Using the adaptive frame rate is very efficient as the network lifetime is 2900s for r0 = 0.1 while the 0.12fps fixed capture rate last only 620s. In order to evaluate further the quality of surveillance we show in figure 4(top) the mean
Scheduling Randomly-Deployed Heterogeneous Video Sensor Nodes
311
Table 3. Network lifetime r 0 = 0.1 0.12 fps r0 = 0.4 0.56 fps r 0 = 0.6 0.83 fps r0 = 0.8 1.18 fps 2900s 620s 1160s 360s 560s 240s 270s 170s
stealth time when r0 = 0.1, f ps = 0.12, r0 = 0.4 and f ps = 0.56, and in figure 4(bottom) the case when r0 = 0.6, f ps = 0.83, r0 = 0.8 and f ps = 1.18. The stealth time is the time during which an intruder can travel in the field without being seen. The first intrusion starts at time 10s at a random position in the field. The scan line mobility model is then used with a constant velocity of 5m/s to make the intruder moving to the right part of the field. When the intruder is seen for the first time by a sensor, the stealth time is recorded and the mean stealth time computed. Then a new intrusion appears at another random position. This process is repeated until the simulation ends. 4
meanStealthTime r°=0.2
mean stealth time (second)
meanStealthTime 0.32fps meanStealthTime r°=0.4 meanStealthTime 0.56fps
3
2
1
0 1E2
1E3 time (second)
2
meanStealthTime r°=0.6
mean stealth time (second)
meanStealthTime 0.83fps meanStealthTime r°=0.8 1.5
meanStealthTime 1.18fps
1
0.5
0 1E2 time (second)
Fig. 4. Mean stealth time. Top: r0 = 0.1, f ps = 0.12, r0 = 0.4, f ps = 0.56. Bottom: r 0 = 0.6, f ps = 0.83, r0 = 0.8, f ps = 1.18.
Figure 5(left) shows for a criticality level r0 = 0.6 the special case of small AoV sensor nodes. When 2α = 20o , we compare the stealth time under the COpbcGpbc and the CObcGpbc strategies. Discarding point p in the cover set construction procedure gives a larger number of nodes with larger number of cover sets, as shown previously in table 1. In figure 5(left) we can see that the stealth time is very close to the COpbcGpbc case while the network lifetime almost doubles to reach 420s instead of 212s. The explanation is as follows: as more nodes have cover sets, they act as sentry nodes allowing the other nodes to be in sleep mode while ensuring a high responsiveness of the network.
312
C. Pham
1.6
stealthTime, AoV=20°, bcGpbc (discard point p) stealthTime, AoV=20°, pbcGpbc
stealth time (second)
1.4
1.2
1
0
100
200
300
400
time (second)
Fig. 5. Left: Stealth time, sliding winavg with 20 samples batch, r 0 = 0.6, AoV=20o , COpbcGpbc and CObcGpbc . Right: Rectangle with 8 significant points, initial sensor v and 2 different cover sets.
In addition, for the particular case of disambiguation, we introduce a 8m.4m rectangle at random positions in the field. COpbcGpbc is used and 2α = 36o . The rectangle has 8 significant points as depicted in figure 5(right) and moves at the velocity of 5m/s in a scan line mobility model (left to right). Each time a sensor node covers at least 1 significant point or when the rectangle reaches the right boundary of the field, it appears at another random position. This process starts at time t = 10s and is repeated until the simulation ends. The purpose is to determine how many significant points are covered by the initial sensor v and how many can be covered by using one of v’s cover set. For instance, figure 5(right) shows a scenario where v’s FoV covers 3 points, the left cover set ({v3 , v1 , v4 }) covers 5 points while the right cover set ({v3 , v2 , v4 }) covers 6 points. In the simulations, each time a sensor v covers at least 1 significant point of the intrusion rectangle, it determines how many significant points are covered by each of its cover sets. The minimum and the maximum number of significant points covered by v’s cover sets are recorded along with the number of significant points v was able to cover initially. Figure 6 shows these results using a sliding window averaging filter with a batch window of 10 samples. We can see that node’s cover sets always succeed in identifying more significant points. Figure 7 shows that with the rectangle intrusion (that could represent a group of intruders instead of a single intruder) the stealth time can be further reduced.
8
max number of points (slidingwinavg10) min number of points (slidingwinavg10)
number of covered points
initial number of points (slidingwinavg10)
6
4
2
0
100
200
time (second)
Fig. 6. Number of covered points of an intrusion rectangle. Sliding winavg of 10.
Scheduling Randomly-Deployed Heterogeneous Video Sensor Nodes 4
313
stealthTime r°=0.8 (winavg10) stealthTime 1.18fps (winavg10) stealthTime r°=0.8 (winavg10), rectangle intrusion
stealth time (second)
3
2
1
0 100
200
time (second)
Fig. 7. Stealth time, winavg with 10 samples batch, r0 = 0.8, f ps = 1.18 and r 0 = 0.8 with rectangle intrusion
4.2
Dynamic Criticality-Based Scheduling
In this section we are presenting preliminary results in dynamically varying the criticality level during the network lifetime. The purpose is to only set the surveillance network in an alerted mode (high criticality value) when needed, i.e. on intrusions. With the same network topology than the previous simulations, we set the initial criticality level of all the sensor nodes to r0 = 0.1. As shown in the previous simulations, some nodes with large number of cover sets will act as sentries in the surveillance network. When a sensor node detects an intrusion, it sends an alert message to its neighbors and increases its criticality level to r0 = 0.8. Alerted nodes will then also increase their criticality level to r0 = 0.8. Both the node that detects the intrusion and the alerted nodes will run at a high criticality level for an alerted period, noted Ta , before going back to r0 = 0.1. Nodes may be alerted several times but an already alerted nodes will not increase its Ta value any further in this simple scenario. As said previously, we do not attempt here to optimize the Ta value nor using several level of criticality values. Figure 8shows the mean stealth time with this dynamic behavior. Ta is varied from 5s to 60s. We can see that this simple dynamic scenario already succeeds in reducing the mean stealth time while increasing the network lifetime when compared to a static scenario that provides the same level of service.
4
mean stealthTime Ta=5s
mean stealth time (second)
mean stealthTime Ta=10s mean stealthTime Ta=15s mean stealthTime Ta=20s
3
mean stealthTime Ta=30s mean stealthTime Ta=40s 2
1
500
1000
1500
time (second)
Fig. 8. Mean stealth time with dynamic criticality management
314
5
C. Pham
Conclusions
This paper presented the performances of cover sets construction strategies and dynamic criticality scheduling that enable fast event detection for mission-critical surveillance with video sensors. We focused on taking into account cameras with heterogeneous angle of view and those with very small angle of view. We show that our approach improves the network lifetime while providing low stealth time in case of intrusion detection systems. Preliminary results with dynamic criticality management also show that the network lifetime can further be increased. These results show that besides providing a model for translating a subjective criticality level into a quantitative parameter, our approach for video sensor nodes also optimize the resource usage by dynamically adjusting the provided service level. Acknowledgment. This work is partially supported by the FEDER POCTEFA EFA35/08 PIREGRID project, the Aquitaine-Aragon OMNI-DATA project and by the PHC Tassili project 09MDU784.
References 1. Collins, H.F.R., Lipton, A., Kanade, T.: Algorithms for cooperative multisensor surveillance. Proceedings of the IEEE 89(10) (2001) 2. Yan, T., He, T., Stankovi, J.A.: Differentiated surveillance for sensor networks. In: ACM SenSys (2003) 3. He, T., et al.: Energy-efficient surveillance system using wireless sensor networks. In: ACM MobiSys (2004) 4. Oh, S., Chen, P., Manzo, M., Sastry, S.: Instrumenting wireless sensor networks for real-time surveillance. In: International Conference on Robotics and Automation (2006) 5. Cucchiara, R., et al.: Using a wireless sensor network to enhance video surveillance. J. Ubiquitous Computing and Intelligence 1(2) (2007) 6. Stoianov, L.N.I., Madden, S.: Pipenet: A wireless sensor network for pipeline monitoring. In: ACM IPSN (2007) 7. Albano, M., Pietro, R.D.: A model with applications for data survivability in critical infrastructures. J. of Information Assurance and Security 4 (2009) 8. Dousse, O., Tavoularis, C., Thiran, P.: Delay of intrusion detection in wireless sensor networks. In: ACM MobiHoc (2006) 9. Zhu, Y., Ni, L.M.: Probabilistic approach to provisioning guaranteed qos for distributed event detection. In: IEEE INFOCOM (2008) 10. Freitas, E., et al.: Evaluation of coordination strategies for heterogeneous sensor networks aiming at surveillance applications. In: IEEE Sensors (2009) 11. Keally, M., Zhou, G., Xing, G.: Watchdog: Confident event detection in heterogeneous sensor networks. In: IEEE Real-Time and Embedded Technology and Applications Symposium (2010) 12. A. Makhoul, R. Saadi, and C. Pham: Risk management in intrusion detection applications with wireless video sensor networks. In: IEEE WCNC (2010). 13. Rahimi, M., et al.: Cyclops: In situ image sensing and interpretation in wireless sensor networks. In: ACM SenSys (2005) 14. Makhoul, A., Pham, C.: Dynamic scheduling of cover-sets in randomly deployed wireless video sensor networks for surveillance applications. In: IFIP Wireless Days (2009)
An Integrated Routing and Medium Access Control Framework for Surveillance Networks of Mobile Devices Nicholas Martin1, Yamin Al-Mousa1, and Nirmala Shenoy2 1 College of Computing and Information Science, Networking Security and Systems Administration Dept, Rochester Institute of Technology, 1 Lomb Dr, Rochester, NY USA 14623
[email protected], {ysa49,nxsvks}@rit.edu 2
Abstract. In this paper we present an integrated solution that combines routing, clustering and medium access control operations while basing them on a common meshed tree algorithm. The aim is to achieve an efficient airborne surveillance network of unmanned aerial vehicles, wherein any loss of captured data is kept to the minimum while maintaining low latency in packet and data delivery. Surveillance networks of varying sizes were evaluated with varying numbers of senders, while the physical layer was maintained invariant. Keywords: meshed trees, burst forwarding medium access control, surveillance.
1 Introduction Mobile Ad Hoc Networks (MANETs) of unmanned aerial vehicles (UAVs) face severe challenges to deliver surveillance data without loss of information to specific aggregation nodes. Depending on the time sensitivity of the captured data, the end to end packet and file delivery latency could also be critical metrics. Two major protocols from a networking perspective that can impact lossless and timely delivery are the medium access control (MAC) and routing protocols. Physical layer and transport layer protocols will certainly play a major role; however, we limit the scope of this work to MAC and routing protocols. These types of surveillance networks require several UAVs to cover a wide area while the UAVs normally travel at speeds of 300 to 400 km/h. These features pose additional significant challenges to the design of MANET routing and MAC protocols as they now must be both scalable and resilient: being able to handle the frequent route breaks due to node mobility. The predominant traffic pattern in surveillance networks is converge-cast, where data travels from several nodes to an aggregation node. We leverage this property in the proposed solution. We also integrate routing and MAC functions into a single protocol layer, where both routing and MAC operations are achieved with a single address. The routing protocol uses the inherent path information contained in the addresses, while the MAC uses the same addresses for hop by hop packet forwarding. Data aggregation or converge-cast types of traffic are best handled through multi hop clustering, wherein a cluster head (CH) is the special type of node that aggregates M.K. Aguilera et al. (Eds.): ICDCN 2011, LNCS 6522, pp. 315–327, 2011. © Springer-Verlag Berlin Heidelberg 2011
316
N. Martin, Y. Al-Mousa, and N. Shenoy
the data and manages the cluster. The solution we propose uses one such clustering scheme that is based on a ‘meshed tree’ principle [1], where the root of the meshed tree is the CH. As the cluster is a tree, the branches connecting the cluster clients (CCs) to the CH provide a path to send the data from the CCs to the CH. Thus, a clustering mechanism is integrated into the routing and MAC framework. The ‘meshing’ of the tree branches allows one node to reside in multiple tree branches that originate from the root, namely the CH. The duration of residency on a branch depends on the movement pattern and speeds of the nodes. Thus, as nodes move, they may leave one or more branches and connect to new branches. Most importantly, even if a node loses one path to the CH, it likely remains connected to the CH via another branch and thus has an alternate path. The clustering scheme also allows for the creation of several overlapped multi hop clusters leading to the notion of multi meshed trees (MMT). The overlap is achieved by allowing the branches of one meshed tree to further mesh with the branches in neighboring clusters. This provides connectivity to cluster clients moving across clusters. It also helps extend the coverage area of the surveillance network to address scalability.
2 Related Work The topic areas of major contribution in this article relate to routing protocols, clustering algorithms and MAC protocols for mobile ad hoc networks. The significance in this framework solution lies in the closely integrated operations of routing, clustering and MAC. To the best of our knowledge in the literature published thus far no solution targets such an approach. Cross layered approaches, which break down the limitations of inter-layer communications to facilitate more effective integration and coordination between protocol layers, is one approach that has similar goals. However, our solution is not a cross layered approach. We felt that in a dedicated and critical MANET application, such as a surveillance network, one should not be constrained by the protocol layers or stacks, but achieve the operations through efficient integration of required functions. For the above reasons, it becomes difficult to cite and discuss related work that has an approach similar to ours. However, as we use a multi hop clustering scheme we will highlight multi hop clustering algorithms discussed in the literature. Our solution includes a routing scheme, so we will discuss some proactive, reactive and hybrid routing algorithms to highlight the differences in the proposed routing scheme. We will cite some framework solutions that combine clustering and routing to explain the difference in the approaches. The MAC adopted in this work, is based of CSMA/CA, but uses the addresses adopted in our solution to achieve efficient data aggregation by sending several packets in a burst i.e. a sequence of packets. Several survey articles published on MANET routing and clustering schemes from different perspectives indicate the continuing challenges in this topic area [2, 3]. Proactive routing protocols require dissemination of link information periodically so that a node can use standard algorithms such as Dijkstra’s to compute routes, to all other nodes in the network or in a given zone [4]. Link information dissemination requires flooding of messages that contain such information. In large networks such transmissions or control messages can consume significant amounts of the bandwidth
An Integrated Routing and Medium Access Control Framework
317
making the proactive routing approach not scalable. Several proactive routing protocols thus target mechanisms to reduce this control overhead i.e. bandwidth used for control messages. Fisheye State Routing (FSR) [8] Fuzzy Sighted Link State Hazy Sighted Link State [10] Optimized Link State [6] and Topology Broadcast Reverse Path Forwarding [9] are some such approaches. Reactive routing protocols avoid the periodic link information dissemination and allow a node to discover routes to a destination node only when it has data to send to that destination node. The reactive route discovery process can result in the source node receiving several route responses which it may cache. As mobility increases, route caching may become ineffective as pre-discovered routes may become stale and unusable. Dynamic Source Routing (DSR) [5], Ad Hoc On-demand Distance Vector (AODV) [4] Temporally Ordered Routing Algorithm [3] and Light-Weight Mobile Routing [13] are some the more popular reactive routing approaches. Partitioning a MANET physically or logically and introducing hierarchy has been used to limit message flooding and also addresses scalability. Mobile Backbone Networks (MBNs) [14] use hierarchy to form a higher level backbone network by utilizing special backbone nodes with low mobility to have an additional powerful radio to establish wireless links amongst themselves. LANMAR [13], Sharp Hybrid Adaptive Routing Protocol (SHARP) [15] Hybrid Routing for Path Optimality [11], Zone Routing Protocol (ZRP) [12] are protocols under this category. Nodes physically close to each other form clusters with a CH communicating to other nodes on behalf of the cluster. Different routing strategies can be used inside and outside a cluster. Several cluster based routing protocols address scalability issues faced in MANETs. Cluster head Gateway Switch Routing (CGSR) [16] and Hierarchical State Routing (HSR) [17] are two popular cluster based routing schemes.
3 Meshed Tree Clustering It is important to understand the cluster formation in the clustering scheme under consideration and the routing capabilities within the cluster for data aggregation at the CH. The multi hop clustering scheme and the cluster formation based on ‘meshed’ tree algorithm is described with the aid of Figure 1. The dotted lines connect nodes that are in communication range with one another at the physical layer. The data aggregation node or cluster head is labeled ‘CH’. Nodes A through G are the CCs. 142
F
141
C 14 142
E
143 131
CH 111
13 D
G
B
11
132
Fig. 1. Cluster Formation Based on Meshed Trees
12
A
121
318
N. Martin, Y. Al-Mousa, and N. Shenoy
At each node several ‘values’ have been noted. These are the virtual IDs (VIDs) assigned to the node when it joins the cluster. In Figure 1, each arrow from the CH is a branch of connection to the CCs. Each branch is a sequence of VIDs that is assigned to CCs connecting at different points of the branch. The branch denoted by VIDs 14, 142 and 1421, connects nodes C (via VID 14), F (via VID 142) and E (via VID 1421) respectively to the CH. Assuming that the CH has a VID ‘1’, the CCs in this cluster will have ‘1’ as the first prefix in their VIDs. Any CC that attaches to a branch is assigned a VID, which will inherit its prefix from its parent node, followed by an integer, which indicates the child number under that parent. This pattern of inheriting the parent’s VID will be clear if the reader follows through the branches identified in Figure 1 by the arrows. The meshed tree cluster is formed in a distributed manner, where a node listens to its neighbor nodes advertising their VIDs, and decides to join any or all of branches as noted in the advertised VIDs. A VID contains information about number of hops from the CH. This is inherent in the VID length that can then be used by a node to decide the branch it would like to join if shortest hop count is a criterion. Once a node decides to join a branch it has to inform the CH. The CH then registers the node as its CC and confirms its admittance to the cluster and accordingly updates a VID table of its CCs. A CH can restrict admittance of nodes that are within a certain number of hops or not admit new nodes to keep the number of CCs in the cluster under a certain value. This is useful to contain the data collection zone of a cluster. Routing in the Cluster: The branches of the meshed tree provide the route to send and receive data and control packets between the CCs and the CH. As an example, consider packet routing where the CH has a packet to send to node E. The CH may decide to use the path given by VID 1421 to E. The CH will include its VID ‘1’ as the source address and E’s VID 1421, as the destination address and broadcast the packet. The nodes that will perform the hop by hop forwarding are nodes C and F. This is so, as from the source VID and destination VID, C will know that it is the next hop en route, because it has a VID 14 and the packet came from VID ‘1’ and is destined to 1421 i.e. it uses a path vector concept. When C broadcasts the packet subsequently, F will receive and eventually forward to E. The VID of a node thus provides a virtual path vector from the CH to itself. Note that the CH could have also used VIDs 143 or 131 for node E, in which case the path taken by the packet would have been CH-C-E or CH-D-E respectively. Thus between the CH and node E there are multiple routes as identified by the multiple VIDs. The concept of support for multiple routes through multiple VIDs, allows for robust and dynamic route adaptability to topology changes in the cluster. 221 J
251 F
21
K
142
2
H
142 E
CH2
143 131
1 1 CH1
25
23 L 241
M
24
B
1
14
211
22
14
C
111 252 G
D
13
132
Fig. 2. Overlapped Cluster Formation Based on Meshed Trees
12
A
121
An Integrated Routing and Medium Access Control Framework
319
Route failures: Capturing all data without loss is very important in surveillance networks used in tactical applications. Loss of data can be caused due to route failures or collisions at the MAC. There are two cases of route failures that can occur, yet be swiftly rectified, in the proposed solution. In the first case, a node may be in the process of sending data, and has even sent part of the data using a particular VID, only to discover that said VID or path is not valid anymore. In the second case, a node may be forwarding data for another node, but after collecting and forwarding a few data packets, this forwarding node also loses the VID which was being used. Case I: Source node loses a route: For example, node B in Figure 2 is sending a 1 MB file to the CH using its shortest VID ‘11’. Assume that node B was able to send ½ MB, at which time due to its mobility it lost its VID ‘11’ but was still able to continue with VID ‘121’ and send the rest ½ MB of data using VID ‘121’. Case II: Intermediate node loses a route: Let us continue the above example. Node A is forwarding the data from node B on its VID 12 (data comes from node B via its VID 121). After sending a ¼ MB assume that node A moves in the direction of node D, loses its VID 12 but gains a new VID ‘131’ as it joins the branch under node D. Node A can continue sending the rest of the file using its new VID 131. As the knowledge about the destination node is consistent (i.e. it is the CH with VID ‘1’) any node is able to forward the collected data towards the CH. Disconnects: In a disconnect situation, a missing VID link may first be noticed by the parent or child of a node with whom the link is shared. In such cases, the parent node will inform the CH of the missing child VID, such that the CH will not send any messages to it. Meanwhile the child node, which is downstream on the branch, will notify its children about their lost VIDs (VIDs derived from the missing VID) so that they will invalidate those VIDs and hence not use them to send data to the CH. Inter-cluster Overlap and Scalability: As a surveillance network can have several tens of nodes the solution proposed must be scalable. We assume that several ‘data aggregation nodes (i.e. CHs)’ be uniformly distributed among the non-data aggregation nodes during deployment of the surveillance network. Meshed tree clusters can be formed around each of the data aggregation nodes by assuming them to be the CHs. Nodes bordering two or more clusters are allowed to join the branches originating from different CHs, and will accordingly inform their respective CHs about their multiple VIDs under the different clusters. When a node moves away from one cluster, it can still be connected to other clusters, and the surveillance data collected by that node will not be lost. Also, by allowing nodes to belong to multiple clusters, the single meshed tree cluster-based data collection can be extended to multiple overlapping meshed tree clusters that can collect data from several nodes deployed over a wider area with a very low loss probability of the captured data. Figure 2 shows two overlapped clusters and some border nodes that share multiple VIDs across the two clusters. The concept is extendable to several neighboring clusters. Nodes G and F have VIDs 142, 132 under CH1 and VIDs 251 and 252 under CH2, respectively. Note that a node is aware of the cluster under which it has a VID as the information is inherent in the VIDs it acquires, thus a node has some intelligence to decide which VIDs it would like to acquire – i.e. it can decide to have several VIDs under one cluster, or acquire VIDs that span several clusters and so on.
320
N. Martin, Y. Al-Mousa, and N. Shenoy
Significance of the Approach: From the meshed tree based clustering and routing scheme described thus far, it should be clear that our scheme adopts a proactive routing approach, where the proactive routes between CCs and CH in a cluster are established as the meshed trees or clusters are formed around each CH. Thus using a single algorithm during the cluster joining process a node automatically acquires routes to the CH. There is flexibility in dimensioning the cluster in terms of CCs in a cluster and the maximum hops a CC is allowed from a CH. The tree formation is different from the spanning trees discussed in the literature, as a node is allowed to simultaneously reside in several branches, and thus allowing for dynamic adaptability to route changes as nodes move. This also enhances robustness in connectivity to the CH. This approach is ideal for data aggregation from the CCs to the CH, and is very suitable for MANETs with highly mobile nodes.
4 Burst Forwarding Medium Access Control Protocol The Burst Forwarding Medium Access Control (BF-MAC) is primarily focused with reducing collisions while providing the capability of MAC forwarding of multiple data packets from one node to another node in the same cluster. Additionally, the MAC allows for sequential ‘node’ forwarding where all intermediate nodes forward a burst of packets one after another in a sequence between a source and destination node through multiple hops. These capabilities are created through careful creation of MAC data sessions, which encompass the time necessary to burst multiple packets across multiple hops. For non-data control packets, such as those from the routing and cluster formation process, the MAC uses a system based on Carrier Sense Multiple Access/Collision Avoidance (CSMA/CA). 121
Data
12
Data
CH 1
Fig. 3. Illustration of traffic forwarding along a single tree branch
The above type of MAC forwarding is possible due to the VIDs, which carry information about a node’s CH, and the intermediate nodes which the MAC makes use of. As such, a node’s data will physically travel up a VID branch to the CH in that tree. Therefore, by knowing which VID was used by a node to send a data packet, and that packet’s intended destination (the CH), an overhearing node can determine the next VID in the path. This process is used by all overhearing nodes to forward in their turn a packet all the way to the CH. This is illustrated in Figure 3, where when the node with VID 121 has data to send to CH1, the intermediate node with VID 12 will pick up and forward to the CH. The MAC process at a node that has data to send creates a MAC data session. A Request to Send (RTS) packet is sent by the node and is forwarded by the intermediate nodes till it reaches the CH. When a recipient node (i.e. a forwarding
An Integrated Routing and Medium Access Control Framework
321
node) along the path receives the RTS, it becomes a part of the data session. A set of data packets may then be sent to the intended destination, in this case the CH, along the same path as the RTS packet. The final node in the path, the CH, will send an explicit acknowledgement (eACK) packet to the previous node for a reliability check. eACKs are not forwarded back to the initial sender. Nodes in the path of the data session, except for the penultimate node, instead listen for the packet just sent to the next node. This packet will be the same packet being forwarded by the next node in the data session path (be it either an RTS or a data packet). Receiving this packet is an indication of an implicit acknowledgment (iACK), as the next node must have received the sent packet if it is now attempting to forward it. Note that the iACK is really the forwarded RTS or data packet. Not receiving any type of acknowledgment will cause a node to use the MAC retry model, discussed below. During a data session collisions from neighboring nodes are prevented in the same way as the collision avoidance mechanism in CSMA/CA. Nodes that hear a session in progress keep silent. When a node overhears an RTS, eACK or data packet, for which it is not the destination or the next node in line to forward, it will switch to a Not Clear to Send (NCTS) mode. This will prevent a node from sending any control packets or joining a data session. If a node is already part of a separate data session, the node will continue with that data session. The NCTS mode lasts for a duration specified as the Session on Wait (SOW) time, noted in the packets being transmitted during the session. The SOW time is calculated by the initial sender within a data session, and marks the amount of time left for a particular data session. At each hop, it is decremented by the transmission time of the current packet to send plus a guard time to account for propagation delay as shown in Figure 4. When SOW time has elapsed, the data session is over and all nodes return to a Clear to Send (CTS) mode. A node in CTS mode may start a new data session, join a data session via forwarding, or send control packets.
SOW p-n A
X
Y SOW p-2n
SOW p-n
B
SOW p-3n
SOW p-2n
Z
eACK SOW p-5n
C SOW p-3n D
SOW p-5n SOW p-4n
CH
Fig. 4. Illustration of dissemination of SOW timings. ‘p’ represents the time necessary for all packets remaining to be sent, while n represents the time to transmit a single packet plus propagation delay
Control packets from the routing and clustering process are queued and sent using CSMA/CA whenever a node is in CTS mode. To take further advantage of the MAC’s data sessions in preventing possible collisions, nodes are also allowed to send control packets within a data session by extending SOW time a fixed amount. Retry Model: The MAC stores any RTS or data packet sent into a retry queue. Until an eACK or iACK is heard for that packet, the packet will be retried up to three times within a single data session. Nodes will continue to receive data and issue eACKs for
322
N. Martin, Y. Al-Mousa, and N. Shenoy
data packets while retrying the other packet. At the end of the data session, nodes will move any outstanding packets into their own data queues and will send them subsequently pretending to be the initial sender. If a packet fails to be sent in two separate data sessions, an error report is sent to the routing and clustering process for further action. The MAC brings the added capability of any node taking over and forwarding the packets to the destination, which is the CH, and uses the VIDs, which burst forward packets from CCs to the CH. This is the uniqueness of the proposed solution and the primary reason for integrating the different operation due to the natural dependency of all three schemes upon the one algorithm. Separating them into different layers would have resulted in suboptimal performance of the framework, which may not be an efficient solution for such critical applications as surveillance networks.
5 Simulation and Performance While there are numerous routing and cluster based routing algorithms proposed in the literature, they have not been evaluated for the type of surveillance applications stated in this article, nor are the performance metrics the same as ours. Hence the results published for these algorithms cannot be compared with our results. Neither would it be reasonable for us to model the solutions that we decide to be suitable solutions for a comparative study with the proposed solution. Hence we decided to conduct our comparison with two well known routing protocols OLSR and AODV. The first is a proactive routing protocol and the second is a reactive routing protocol. We use the proactive routing protocol OLSR to evaluate and compare with the performance of our solution to small networks of size around 20 nodes. Furthermore to make the studies comparable, we designated certain nodes as data collection nodes and the destination for data sending nodes in it vicinity, such that we overcome the cluster formation problem. We used the reactive routing protocol to evaluate and compare the performances from the control overhead perspective in networks of sizes 50 and 100. In this case also the collection nodes were designated as destination for nodes in its vicinity. For completeness we evaluated OLSR, AODV and the MMT for all 20, 50 and 100 node scenarios, with varying numbers of senders. For OLSR and AODV we used the custom developed 802.11 CSMA/CA models available with Opnet. This work was conducted as part of an ONR funded project, where it was expected for us to use the ns2 simulation tool. We used ns2.3.4. The OLSR and AODV models available in ns2 were not designed to operate in network scenarios as those outlined above, hence we used the custom developed models of OLSR and AODV in Opnet. These Opnet models provide flexibility in selecting optimal parameters and thus optimal operational conditions through proper setting of retry times, intervals for sending ‘hello’, ‘topology control’ and other control messages for OLSR and AODV. The scenario set up in the MMT solution in ns2 however faced constraints due to the random placement and selection of sending nodes as compared to selecting the closest nodes to send to the closest designated destination (alias the CH) as in Opnet. We therefore recorded the average hops between a source and destination node in all our test scenarios, to serve as a baseline for comparison.
An Integrated Routing and Medium Access Control Framework
323
Simulation parameters: The transmission range was maintained at approximately 10 km. The data rate was set to 11 Mbps, the standard 802.11 data rates. No error correction was used for the transmitted packets and any packet with a single bit error was dropped. Circular trajectories with radii of 10 Km were used. The reasons for using circular trajectories was to introduce more stress into the test scenarios, as these trajectories would result in more route breaks than elliptical trajectories, which should have been used normally. Some of the trajectories used clockwise movement, while others used an anti-clockwise movement. This was done again to introduce stress in the test scenarios. The UAV speeds of the nodes varied between 300 and 400 km/h. Hello interval was maintained at 10 seconds. The above scenario parameters were maintained consistent for all test scenarios. The performance metrics targeted were • • •
Success rate, calculated as the percentage of number of packets successfully delivered to the destination node. Average end to end packet delivery latency calculated in seconds. Overhead was calculated as the ratio of control bits to the sum of control and data bits during data delivery for compatibility of comparisons with reactive routing.
All the above performance metrics were recorded along with the average hops between sender and receiver nodes, for 20, 50 and 100 nodes, where the number of sending nodes was varied depending on the test scenario. The file sizes used for data sessions were each 1 MB, and the packet sizes were 2 KB. In a session all senders would start sending the 1 MB file simultaneously towards the CH. We provide indepth explanation for the 20 node graphs; the graphs in 50 and 100 nodes have a similar trend, hence we do not repeat the explanations.
Fig. 5. Performance Graphs for 20 Node Scenario
324
N. Martin, Y. Al-Mousa, and N. Shenoy
Analysis of results for the 20 nodes test scenario: Figure 5 shows the four performance graphs based on results collected under the 20 nodes scenario. The number of senders was varied from 5 to 10 to 16, where in the last case as there were 4 data aggregation nodes, all other nodes i.e. all CCs were sending data to their respective CHs. The first graph is the plot of the success rate versus the number of sending nodes. In the MMT based framework, the success rate was 100% as the number of sending nodes was increased from 5, 10 to 16. For AODV and OLSR the success rate was high with 5 senders but decreased with an increase in the number of senders. While the success rate for AODV drops to 82%, for OLSR it dropped only to 87%. The success rate for OLSR with 10 senders is less than with 16 senders. This discrepancy will be clear if we look at the average number of hops between sending and receiving nodes: with 5 senders the average hops recorded was 1.38, for 10 senders it was 1.32 and for 16 senders it dropped down to 1.22. This happened because the 5 senders selected first were further away from the designated destination node. In the case of 10 senders the added 5 senders were now closer to the destination node but when the last 6 senders were included they were still closer to the destination node bringing down the average hops, and thus were able to increase the success rate in packet delivery. However between 5 senders and 10 senders, due to the increase in traffic in the network, the average hops dropped by 0.6, yet the success rate still experienced a decrease. A similar explanation holds for the MMT framework too, where the average hops with 10 senders is lower than with 16 senders; however this did not affect the success rate and all packets were delivered successfully. MMT and AODV show very low latency as compared to OLSR. Due to the reduced success rates in the case of AODV, fewer packets were delivered and thus there is a dip in the average latency for 10 sending nodes, as the amount of traffic due to data packets is less in the network and also the packets which were taking longer did not make it to the destination. OLSR shows a higher latency due to the control traffic which delays the data traffic. The MMT solution has very low overhead compared to OLSR and AODV in all 3 cases of 5, 10 and 16 senders. The reason for this can be attributed to the local recovery of any link failures as handled by MMT as compared to OLSR which requires resending the updated link information, or in the case of AODV, which has to rediscover routes if the cached routes are stale. The second reason could be the reduced collision and better throughput due to the BF-MAC. A point worth noting is that though MMT adopts a proactive routing approach, its overhead is very much lower than the reactive routing used in AODV even with fewer number of sending nodes i.e. 5 senders. Validation of the Comparison Process: It may seem to the reader that there are several improved variations of OLSR and AODV that have may have performed better than just OLSR and AODV. However, it should be noted that the proposed framework outperforms OLSR and AODV significantly in all performance aspects, especially for the type of surveillance applications considered in the work. This is despite the fact that the average number of hops encountered between the sending and receiving nodes in the MMT framework is higher than OLSR by a significant amount in all 3 cases of 5, 10 and 16 senders and comparable with AODV for the 10 and 16 senders, but higher in the case of 5 senders.
An Integrated Routing and Medium Access Control Framework
325
Fig. 6. Performance Graphs for 50 Node Scenario
Analysis of results for the 50 node test scenario: Figure 6 shows the four graphs for the 50 node scenario. The MMT based solution continues to maintain the success rate very close to 100% as the number of senders increased to 40, where all CCs send to their respective CHs. OLSR and AODV show a decrease in the success rate with AODV drop being higher than OLSR with 40 senders, which can be attributed to the increased number of senders, which is a well known phenomenon with reactive routing protocols. The average end to end packet delivery latency for OLSR is higher than AODV, because of the higher number of average hops with 20 senders and higher successful packets transmitted at 40 senders. The end to end packet delivery latency for MMT is still quite low and comparable to that achieved with AODV, in which 15 to 35% of the packets were not delivered. The overhead with MMT is now at 10% compared with OLSR’s around 20% and AODV with over 30%. Analysis of results for the 100 node test scenario: Figure 10 shows the four graphs for the 100 node scenario. While MMT consistently exhibits a similar performance as seen for the 20 and 50 nodes with a slight increase in the overhead and latency with increased number of senders, while the average hops still being greater than AODV and OLSR. OLSR shows a further drop in the success rate as compared to the 50 node scenario, which is due to the limitations faced when flooding the topology control messages. While the AODV success rate starts at 75% and drops to 68% for 40 senders and 47.5% for 80 senders, which is as expected. Overhead for AODV is higher than for the 50 nodes scenario as there are more discovery messages, while OLSR maintains the overhead between 20% to 30%.
326
N. Martin, Y. Al-Mousa, and N. Shenoy
Fig. 7. Performance Graphs for 100 Node Scenario
6 Conclusion In this paper, we presented an integrated routing, clustering and MAC framework based on a meshed tree principle, where all three operations use features of the meshed tree algorithm for their operation. The framework was designed especially to handle airborne surveillance networks for collection of surveillance data with the least data loss and in a timely manner. We evaluated the framework and compared it with the two standard protocols, OLSR and AODV, by providing comparable network settings in each case. The performance of the proposed solution indicates its high suitability to such surveillance applications.
References 1. Shenoy, N., Pan, Y., Narayan, D., Ross, D., Lutzer, C.: Route Robustness of a Multimeshed Tree Routing Scheme for Internet MANETs. In: Proceeding of IEEE Globecom 2005, St Louis, November 28-December 2 (2005) 2. Abolhasan, M., Wysocki, T., Dutkiewicz, E.: A review of routing protocols for mobile ad hoc networks. Journal of ad hoc networks (2004) 3. Royer, E.M., Toh, C.-K.: A Review of Current Routing Protocols for Ad Hoc Mobile Wireless Networks. IEEE Personal Communications Magazine, 46–55 (April 1999) 4. Perkins, C.E., Royer, E.M., Das, S.R.: Ad Hoc On-Demand Distance Vector (AODV) Routing, IETF Mobile Ad Hoc Networks Working Group. IETF RFC 3561 5. Johnson, D.B., Maltz, D.A., Hu, Y.-C.: Dynamic Source Routing Protocol for Mobile Ad Hoc Networks. IETF MANET Working Group, Internet Draft (February 24, 2003)
An Integrated Routing and Medium Access Control Framework
327
6. Clausen, T., Jacquet, P.: Optimized Link State Routing Protocol (OLSR), Network Working Group, Request for Comments: 3626 7. Das, S., Castaneda, R., Yan, J.: Simulation-Based Performance Evaluation of Routing Protocols for MANETs. In: Mobile Networks and Applications 2000, vol. 5, pp. 179–189 (2000) 8. Pei, G., Gerla, M., Chen, T.-W.: Fisheye State Routing: A Routing Scheme for Ad Hoc Wireless Networks. In: IEEE ICC, vol. 1, pp. 70–74 (2000) 9. Bellur, B., Ogier, R.G.: A Reliable, Efficient Topology Broadcast Protocol for Dynamic Networks. In: Proc. IEEE INFOCOM 1999, New York (March 1999) 10. Santivanez, C., Ramanathan, R., Stavrakakis, I.: Making Link-State Routing Scale for Ad Hoc Networks. In: Proceedings of Mobihoc 2001, Long Beach, California (October 2001) 11. Pei, G., Gerla, M., Hong, X., Chiang, C.-C.: A Wireless Hierarchical Routing Protocol with Group Mobility. In: Proceedings of IEEE WCNC 1999, New Orleans, LA (September 1999) 12. Haas, Z.J., Pearlman, M.R.: Performance of Query Control Schemes for Zone Routing Protocol. ACM/IEEE Transactions on Networking 9(4), 427–438 (2001) 13. Pei, G., Gerla, M., Hong, X.: LANMAR: Landmark Routing for Large Scale Wireless Ad Hoc Networks with Group Mobility. In: Proceedings of IEEE/ACM MobiHOC 2000, Boston, MA, pp. 11–18 (August 2000) 14. Xu, K., Hong, X., Gerla, M.: An Ad hoc Network with Mobile Backbone. In: Proceeding ICC 2002, April 28-May 2, vol. 5, pp. 3138–3143 (2002) 15. Ramasubramanian, V., Haas, Z.J., Sirer, E.G.: SHARP: A Hybrid Adaptive Routing Protocol for Mobile Ad Hoc Networks 16. Lin, C.R., Gerla, M.: Adaptive clustering for mobile wireless networks. IEEE Journal on Selected Areas in Communications 15(7), 1265–1275 (1997) 17. Basagni, S.: Distributed and mobility-adaptive clustering for multimedia support in multihop wireless networks. In: IEEE VTS 50th Vehicular Technology Conference, VTC 1999 Fall, vol. 2, pp. 889–893 (1999)
Security in the Cache and Forward Architecture for the Next Generation Internet G.C. Hadjichristofi1, C.N. Hadjicostis1, and D. Raychaudhuri2 1 University of Cyprus, Cyprus WINLAB, Rutgers University, USA
[email protected],
[email protected],
[email protected] 2
Abstract. The future Internet architecture will be comprised predominately of wireless devices. It is evident at this stage that the TCP/IP protocol that was developed decades ago will not properly support the required network functionalities since contemporary communication profiles tend to be datadriven rather than host-based. To address this paradigm shift in data propagation, a next generation architecture has been proposed, the Cache and Forward (CNF) architecture. This research investigates security aspects of this new Internet architecture. More specifically, we discuss content privacy, secure routing, key management and trust management. We identify security weaknesses of this architecture that need to be addressed and we derive security requirements that should guide future research directions. Aspects of the research can be adopted as a step-stone as we build the future Internet. Keywords: wireless networks, security, cache and forward, key management, trust management, next generation Internet.
1 Introduction The number of wireless devices has increased exponentially in the last few years indicating that wireless will be the key driver to future communication paradigms. This explosion of wireless devices has shifted the Internet architecture from one whose structure is based mainly on wired communication to a hybrid of wired and wireless communication. Wireless devices are no longer considered the edge devices of the Internet, but are also shifting into the role of mobile routers that transmit data over multiple hops to other wireless devices. In the current Internet, TCP/IP was designed as the network protocol for transmitting information and has served the Internet well for several decades. However, wireless connections are characterized by intermittent, error-prone, and low bandwidth connectivity, which causes TCP to fail [1]. Therefore, the nature of the networking problem now is different requiring a drastic shift in the solution space and, with that a new Internet architecture. These next-generation Internet architectures aim to shift away from TCP/IP-based communication that assumes stable connectivity between end-hosts, and instead move into a paradigm where communication is content-driven. M.K. Aguilera et al. (Eds.): ICDCN 2011, LNCS 6522, pp. 328–339, 2011. © Springer-Verlag Berlin Heidelberg 2011
Security in the CNF Architecture for the Next Generation Internet
329
A recently proposed next-generation Internet architecture, is the Cache and Forward architecture (CNF). The objective of this architecture is to move files or packages from source to destination over both wired and wireless hops as connectivity becomes available, i.e., use opportunistic transport. The architecture is built on a subset of existing Internet routers and leverages the decreasing cost of memory storage. Due to the difference in the operation of this architecture, security aspects (such as key management) need to be revisited and augmented accordingly. This research aims to be an investigation of security aspects in the future Internet architecture. We investigate ways with which the CNF architecture can be used to provide the required security regarding issues of data privacy, secure routing of files at higher OSI layers (i.e., at the CNF layers), key management, and trust management. The aim of this paper is not to present complete solutions for the aforementioned security areas, but rather to present security strength and weaknesses of the CNF architecture and discuss possible solution scenarios as a means to point security vulnerabilities, and to motivate and direct future research. Based on the discussion we extract key challenges that need to be addressed to provide a more complete system security solution. It is important to ensure that security is built into systems to allow the secure and dynamic access of information. To the best of our knowledge, it is the first investigation of security issues in this architecture. Section 2 describes the CNF architecture. Section 3 provides the security analysis of the CNF architecture and extracts the security requirements for this new architecture. Topics covered are content privacy, secure routing, key management, and trust management. Section 4 concludes the paper.
2 Cache and Forward Architecture Existing and, even more so, future Internet routers will have higher processing power and storage. In the CNF architecture it is envisioned that the wired core network consists of such high capacity routers [2]. It is not necessary that all the nodes in the network will have high capacity storage. Thus, we use the term CNF routers to signify that the router has higher capabilities. In addition to CNF routers, the future Internet will have edge networks with access points called post offices (POs), and multi-hop wireless forwarding nodes called Cache and Carry (CNC) routers. POs are CNF routers that link mobile nodes to the wired backbone and act as post offices by withholding files for the mobile nodes when disconnected or unavailable. CNC routes are mobile wireless routers with relatively smaller capacity compared to CNF routers. The storage cache of nodes in the CNF architecture is used to store packets in transit, as well as to offer in-network caching of popular content. The unit of transportation in the CNF architecture is a package. A package may represent an entire file or it may represent a portion of a file when the file is very large (e.g., a couple of Gigabytes). Therefore, it is expected that fragmentation of files will be executed by the CNF architecture. Fragmentation allows more flexibility in terms of routing and Quality of Service (QoS), and makes the data propagation more robust over single CNF hops especially between wireless devices. In this paper, we will use the terms package or file in an interleaved manner, to denote the unit of transportation within the CNF architecture.
330
G.C. Hadjichristofi, C.N. Hadjicostis, and D. Raychaudhuri
The main service offered by this architecture is the provision of content delivery by overcoming the pitfalls of TCP/IP in the face of intermittent connectivity that characterizes wireless networks. Files are transferred hop-to-hop either in “push” or “pull” mode, i.e., a mobile end user may request a specific piece of content, or the content provider may push the content to one (unicast) or more (multicast) end-users. A performance evaluation of the CNF architecture is out of the scope of this paper and has been offered in [3]-[7]. Fig. 1 shows the concept of the CNF network. At the edge of the wired core network, POs, such as CNF1 and CNF4 serve as holding and forwarding points for content (or pointers to content) intended for mobiles, which may be disconnected at times. The sender, Mobile Node 1 (MN1) forwards the file, or portions of the file, to the receiver’s PO (CNF4) using conventional point-to-point routing. CNF4 holds the file or pointer until contacting MN3 to arrange delivery. Delivery from CNF4 could be by direct transmission if the mobile is in range, or by a series of wireless hops as determined by the routing protocol. In the latter case a CNC node, such as MN2 is used. A GUI-based demonstration of the operation of the CNF architecture as opposed to the traditional TCP type of communication has been developed and can be viewed at [8].
Sender’s Post Office
CNF2 CNF3
Receiver’s Post Office
CNC
Receiver
Sender CNF4
CNF1
MN2
MN3
MN1 CNF : Cache and Forward CNC: Cache and Carry MN: Mobile Node
Fig. 1. The Cache and Forward Architecture
CNF architecture operates above the IP layer and its operation is supported by a series of protocols that handle the propagation of packages (see Fig. 2). The role of the various CNF layers is similar to operation of the OSI layer stack, but their scope is different as they focus on handling of packages. The CNF Transport Protocol (CNF TP) is responsible for sending content query and receiving content. The content fragmentation and reassembly are also implemented here. It also checks the content error and maintains a content cache. The CNF Network Protocol (CNF NP) is responsible for content discovery and routing content towards the destination after it has been located in the network. The CNF Link Protocol (CNF LP) is designed to reliably deliver the IP packets of a package to the next hop. The control plane of the CNF architecture is supported by three protocols. The routing protocol is responsible for establishing routing paths across CNF routers. The Content Name Resolution Service (CNRS) provides an indexing mechanism to map the content ID to multiple locations of the content. The location closest to the client can be chosen for content
Security in the CNF Architecture for the Next Generation Internet
331
retrieval. The Cache Management Protocol (CMP) is used to facilitate content discovery by maintaining and updating a summary cache containing all the contents that are cached within an autonomous system (AS). Nodes within an AS update the summary cache and adjacent AS gateways exchange summary cache information. control plane CNF TP CNF NP CNRS
CNF LL
Protocol
Routing Protocol
Cache mgmt Protocol
802.11 / 802.3 (IP and MAC) Physical Layer (RF)
Fig. 2. The Cache and Forward protocol stack
3 Security Analysis In this section we look at security in the CNF architecture. Before proceeding with the security analysis, we describe some key aspects regarding the physiology of the CNF architecture, which are considered in our security analysis. The objective of the CNF architecture is to overcome intermittent connectivity over wireless links and facilitate communication among a multitude of wireless devices. Thus, we envision the existing wired Internet as a sphere surrounded by an increasing number of wireless devices as shown in Fig. 3. Outer layers represent wireless devices at different number of hops away from the wired Internet. For a specific flow, a specific number of outer layers or hops can represent the number of multiple wireless routers required to provide service to nodes (see Fig. 3). This figure emphasizes that communication may come in different forms depending on the type of communication over this spherical representation of the future Internet. We classify the communication patterns of the CNF architecture in 3 generic variations: 1) communication within strictly the wireless Internet or strictly the wired Internet 2) communication from the wireless Internet to a node that belongs to the wired Internet and vice versa, and 3) communication that will link two wireless parts of the Internet through the existing wired infrastructure. The third communication pattern poses more of a challenge in this architecture compared to the first and second pattern as it is more dynamic due to the change in connectivity. Furthermore, wireless nodes may also move and connect to the Internet through different CNC routers that may belong to different POs that are not even collocated. Therefore, the communication patterns change dynamically as connectivity to POs and other wireless devices vary. This architecture uses content caching, which introduces more complexity as the communication patterns for acquiring a specific content can vary with time due to the dynamic and changing number of CNF routers holding that specific file. Caching is vital in this architecture as it can decrease the bandwidth utilization in the Internet by increasing content availability. During content propagation over the Internet, CNF
332
G.C. Hadjichristofi, C.N. Hadjicostis, and D. Raychaudhuri
routers may save a copy of a file prior to transmission. Nodes may obtain data from a cache that is closer compared to the original source. Even though, an investigation of cache optimization over the CNF architecture is offered in [3], security aspects were not taken into account.
Layers 1 to n End-user layer
End-user
CNF router Post Office
Traditional router CNC router
Fig. 3. Layering in the CNF architecture. Multiple layers representing intermediate wireless routers can exist.
In this research, we assume that the CNF architecture provides a methodology to name content. The development of supporting mechanisms for such a service has not been addressed and no issues of security have been taken into account. 3.1 Privacy and File Propagation One of the key aspects of data propagation is that entire files can be located on CNF routers. This functionality enables the caching of content at specific CNF locations while in transit. In terms of security this aspect provides the benefit of stopping malicious activity early in its transmission stage by having the CNF router check the content and validate that it is not virus or spam. Furthermore, it can counteract attacks whose aim is to overload the network and simply utilize its bandwidth, e.g., by checking for repeated transmissions of the same content. However, it concurrently enables access to content that is sensitive as entire files are located on routers. Thus, it bridges privacy policies for the specific file. Caching over this architecture complicates further privacy issues as it increases the exposure of sensitive content. A CNF router can be dynamically configured via a set of security policies to execute selective caching and stop caching sensitive content. However, there is a need to verify that such security policies have been correctly executed, i.e., a file has not been cached. Such verification is complicated as a control mechanism needs to be in place across CNF routers to check for possible propagation of specific content that should have been deleted by the routers. Furthermore, such a selective mechanism provides privacy from the perspective of
Security in the CNF Architecture for the Next Generation Internet
333
limiting the disclosure of information by limiting the number of cached copies of sensitive files. However, it does not really prevent CNF routers from viewing the content prior to transmission. Therefore, there still exists the issue of how privacy can be guaranteed, while harnessing the advantages of spam control or virus detection, which is an inherent security strength of the CNF architecture. To promote privacy, cryptographic methods need to be utilized such that propagation of entire files from one CNF router to the next minimizes the disclosure of information. Typically, cryptography utilized at the user level, i.e., between two end users can hinder the disclosure of information and provide privacy. However, Internet users do not tend to utilize cryptographic methods for a variety of reasons, such as lack of knowledge regarding security mechanisms, or the hidden belief that the Internet is safe, or carelessness in handling their company’s data. Regardless, at the Internet architecture level, end-to-end cryptography is not desirable because it removes the benefits that can be obtained through this architecture in terms of spam or virus control. CNF routers can no longer analyze the content of encrypted files. A simple way around this is to check the content for spam or viruses prior to transmission and then encrypt it. More specifically, have the CNF architecture protocol that assigns content identifiers to files, to also verify and sign that content with a public key signature certifying that it is spam or virus free. The assigned content identifier can also be checked by CNF routers for reply attacks that aim to consume the network bandwidth. This approach can be a good preliminary solution, but it does not account for the vast amount of data generated every second in the Internet. It is virtually impossible to have a service verify the validity of the content prior to assigning a content identifier because vast amounts of data are being generated from all possible locations of the Internet in the world and from all types of nodes and networks. A 2008 International Data Corporation analysis estimated that 281 exabytes of data were produced in 2007, which is equivalent to 281 trillion digitized novels [9]. Therefore, the challenge is to have the distributed service that provides globally certifiable content identifiers check the content and guarantee virus or spam free content. A variation to the above possible solution would be to assign the responsibility of checking content to the CNF routers. CNF routers close to the source (that are at the initial steps of the data propagation,) can check the data for spam and then carry out encryption and provide privacy. The above methodology of handling content provides spam or virus free content as intended by this architecture. However, relating these mechanisms back to the original scope of this architecture, encrypting the files to provide confidentially across CNF routers hinders caching. To provide privacy, content is encrypted at the first CNF router close to the source and decrypted by the CNF router closer to the destination. When using symmetric key cryptography between these two CNF routers, an encrypted file that is cached at any intermediate CNF router cannot be accessed. The file can only be decrypted by the two CNF routers that have the corresponding key, i.e., the key that was originally used to facilitate the communication. Even though privacy is provided, caching is infeasible. Caching on intermediate CNF routers along the routing path cannot work as an intermediate CNF router would need to have the symmetric key that was used by the two CNF routers to be able to redistribute the encrypted cached file. Utilizing public keys can still not address this issue. A package encrypted with the public key of the CNF router will provide privacy of the content as only the CNF router with the
334
G.C. Hadjichristofi, C.N. Hadjicostis, and D. Raychaudhuri
corresponding private key can decrypt the content. Thus, a file on route cannot be cached and redistributed as needed. It is evident that more complex content distribution schemes are needed among CNF routers to balance the requirements of privacy and caching, while allowing the inherent security features (such as spam or virus control) of this architecture to exist. These solutions should strategically cache files on intermediary CNF routers while concurrently allowing them to decrypt and redistribute files to multiple parties. To achieve this, symmetric keys could be shared among dynamically selected groups of CNF routers. These group selections need to be integrated with caching algorithms that optimize content availability. In addition, the selection of groups of CNFs routers to handle specific content needs to be in accordance with the privacy requirements of the content. This is a topic of future work that needs to be addressed for this architecture. Summarizing, content in the CNF architecture must be handled at its initial stages of transmission to provide spam or virus free content. This aspect can be executed by checking the content during the provision of a content identifier or at the first CNF router close to the source. Issues of privacy need to revisited keeping caching in mind such that strategically chosen CNF routers can increase availability while complying with the content security requirements. The key management system should provide the necessary key distribution methodologies to support dynamic group keys for dynamically formed groups of CNF routers based on traffic patterns. 3.2 Secure Routing over the CNF Architecture Our focus in terms of secure routing is to look at ways that the selection of secure paths for the propagation of packages can be guided through the internal components of the CNF architecture. Even though packetization of data in the future Internet may still be facilitated by the IP protocol, the CNF architecture operates at higher layers as shown in Fig. 2 by the CNF TP, CNF NP, and the CNF LP. The CNF architecture deals with the propagation of data files or packages between CNF routers. Thus, the CNF routers create an overlay network with different topology as opposed to the actual connectivity of the existing routers. The aspect that needs to be investigated in terms of secure routing at the CNF layers is the content requirements in terms of security. The CNF architecture needs to facilitate the creation of content-based security classification. More specifically, information may have restrictions in terms of how, when, and where it is propagated. Some content may need to propagate the Internet though specific routes. It is evident that certain locations in the world may have more malicious activity compared to others or that specific data should not be disclosed to certain areas. In addition, data may need to propagate within specific time boundaries or within specific periods after which they have to be deleted. Moreover, data exposure requirements in terms of visibility may be different based on the content. These content security requirements create a complex problem as they need to be taken into account while trying to optimize caching and address other functional aspects of the CNF architecture such as QoS. Another aspect of secure routing is trust management. Trust can be used as a means to guide the selection of routing paths for packages. Over the years several methods have been developed that enable the dynamic assessment of the trustworthiness of nodes. This assessment is done by looking at certain functionalities,
Security in the CNF Architecture for the Next Generation Internet
335
such as packet forwarding, and allowing routers to check and balance one another. CNF routers can be graded for trust at the IP layer by counting IP packet forwarding (or generally packet forwarding below the CNF layers) However, not every router on the Internet will have CNF capabilities, which implies that non-CNF routers within one-hop connectivity of CNF routers need to evaluate CNF routers. This mechanism requires communication between the overlay of CNF routers and non-CNF routers, since reporting needs to be communicated and made believable to the CNF routers. If such integration is not feasible, then reporting will need to be executed among CNF routers. In this case, reputation in the CNF overlay will provide granularity at the CNF layer, meaning that assessments have to be made in term of packages and not IP packets (or lower layer packets). More specifically the control mechanism will need to exist among CNF-routers. The checks executed to evaluate trust can be extended beyond forwarded packages. They can be on the correctness of the applied cryptographic algorithms, the correctness of the secure routing operation at the CNF layers, and the compliance with the security policies regarding content requirements. If there is an overwhelmingly large amount of content that needs to be handled, the checks and balances may be confined to randomly chosen packages or to packages with highly sensitive content. Summarizing, there is a need to classify content based on its security and functional requirements, such as QoS, to effectively execute secure path selection over the CNF overlay. Trustworthiness information derived from internal characteristics of the operation of the CNF architecture operation can further guide routing decisions among CNF routers. 3.3 Key and Trust Management A Key Management System (KMS) creates, distributes, and manages the identification credentials used during authentication. The role of key management is important as it provide a means of user authentication. Authentication is needed for 2 main reasons: accountability and continuity of communication. Knowing a person ID during a communication provides an entity to hold accountable for any malicious activity or for any untrustworthy information shared. In addition, it provides a link to utilize to continue communication in the future. This continuity enables the communicating parties to assess and build trust towards one another based on their experiences of collaboration that they share. Thus, the verification of identity is linked with authentication and provides accountability, whereas the behavior grading is linked with continuity of communication. These aspects have been treated separately through key management and trust management. The authors in [10] support that in social networks as trust develops it obtains the form of an identity (i.e., identity-based trust) since two peers know each other well and know what to expect from one another. Thus, it is important for the future Internet that emphasis is placed on the link between the two areas. Verifying an identity’s credentials does not and should not imply that a peer is trustworthy as trust is a quantity that typically varies dynamically over time. Authentication implies some initial level of trust, but individual behavior should fluctuate dynamically that initial trust level. Therefore, trust management needs to be taken into account so as to assess the reliability of authenticated individuals.
336
G.C. Hadjichristofi, C.N. Hadjicostis, and D. Raychaudhuri
In the CNF architecture, global user authentication is required for the multitude of wireless nodes that dynamically connect to the Internet via POs. Thus, in our description of trust and key management we focus on the wireless Internet that the CNF architecture aims to accommodate. More specifically, wireless nodes need to have an identity that is believable across multiple POs enabling secure communication between wireless nodes that may not be in the same wireless network. POs have a vital role in terms of key management because they are the last link on the wired Internet and are responsible for package delivery to MNs. Their location is empowering as it enables them to link wireless nodes to the rest of the Internet. POs deliver the data to wireless nodes by transmitting the files using opportunistic connectivity. Thus, they can act as delegated certificate authorities in a public key infrastructure, verifying the identification of nodes and overall managing certificates of wireless nodes. Their location is also key as it can facilitate the integration with trust management. Until now there has been no global methodology of quantifying trust. Requiring such a global metric desires an Internet architecture that enables the extraction of this information dynamically at a global level. One of the main differences between the existing Internet architecture and the CNF architecture is the paradigm of communication. In the CNF architecture the emphasis is placed on content. Another aspect that is different in this architecture is the usage of POs on the edges of the wired Internet. Using these structural characteristics, the CNF architecture can provide a base on top of which to address trust and key management. More specifically, the coupling of identification and trustworthiness can be achieved by utilizing the aforementioned two characteristics of the CNF architecture: (1) the content of the data (2) location of POs. Since the CNF architecture is content driven, the content can provide some form of characterization of an identity so as to better understand the trust that can be placed on a user. For example, an entity that downloads movies can be placed on a different trust level compared to an entity that downloads documents regarding explosives. Utilizing such a functionality to provide trust for this architecture requires the careful marking of content to indicate certain categories of data classifications that can characterize trust. In addition to this classification, the location of POs within the CNF architecture can further assist in assessing trust. Nowadays certain areas in the world suffer from higher crime tendencies compared to others. Therefore, there is an increasing possibility to utilize the Internet to reflect that behavior. Based on this conjecture, one can approximate that if sensitive data are acquired by POs at specific locations, the level of trust must be adjusted taking into account the locations of the POs as well. Trust metrics based on PO location can be carefully selected to reflect malicious activity in those locations. Those values would have to be monitored and dynamically adjusted over time. (Note that the IP address assignment is handled by IANA and therefore the location of POs can be obtained with some level of accuracy.) Utilizing the above architectural characteristics allows the POs to manage certificates for all the wireless nodes that it services. Based on specific data that nodes acquire POs can introduce some form of trust marking on certificates that they publish that characterizes the behavior of wireless nodes. That information can guide the decision of SA establishment during authentication. Some other criteria that the POs can record are the type of node (e.g., visiting, duration of visit), and activity in
Security in the CNF Architecture for the Next Generation Internet
337
bytes downloaded. In addition, those decisions of trust marking could be further guided by metrics of behavior obtained from local reputation mechanisms within a wireless network (e.g., whether wireless nodes collaborate with their peers by forwarding packets).Overall, this trustworthiness information can guide future interactions among nodes. An overview of previous work that deals with extracting trust information in mobile wireless environments is offered in [11]. At the global scale, there is also a need to grade and monitor the trustworthiness of POs that act on behalf of the wireless network. Similar to grading nodes in the wireless network, the type of content and the location of the POs can be considered. However, to grade a PO, paths of data content to/from that PO can be considered. A distributed mechanism can be introduced in the CNF overlay that uses the CNF architecture to mark the flows of trustworthiness grading of the POs in the Internet. Fig. 4 demonstrates the various flows that may exist. The POs form a circle at the edge of the wired Internet and data paths flow in multiple directions. MN PO MN
PO
PO
Wired Internet
PO
MN
PO
PO
MN
PO PO MN
PO MN
MN
PO
PO PO Data flow
MN
Fig. 4. Directions of flows in the CNF architecture
Based on the integration with trust information, a KMS can provide trust criteria for authenticated users at a global scale. One aspect that needs to be addressed is whether one can predict the behavior of nodes in the future based on their existing trust level. This need for behavior prediction opens up the question of whether trust can be modeled at a global scale and how accurately that can be assessed dynamically. An issue that also needs to be considered is the notion of positive grading for trust. For example, downloading information about helping other human beings or helping the environment may not indicate the presence of a trustworthy destination. If that aspect is considered to improve the trust of a destination, then the issue that needs to be resolved is that certain data may simply be requested to trick the grading mechanism to improve one’s trustworthiness in the Internet environment. Other aspects that need to be considered is the frequency that certain data types get directed to specific POs. Acquiring one file about making explosives may not be the same as acquiring a hundred documents. In terms of the type of information, it is very important that the categories that characterize trust are carefully selected. If a person acquires medical documents about
338
G.C. Hadjichristofi, C.N. Hadjicostis, and D. Raychaudhuri
diseases, and someone else requests physics-related information that does not necessarily translate to a specific level of trust. There is a need to investigate how the type of data or content can represent the trust placed on nodes and indirectly on users. In addition, there is a need to link different types of content that may indicate certain trends of behavior. Such a classification is a complex issue as it lies in the complexity of human societal behavioral patterns. Summarizing, POs in the CNF architecture can provide identification criteria to the multitude of wireless devices and link them to the Internet. Identity verification coupled with behavior at the local and global scale can guide trust placed on the interaction among peers. Further research is required to come up with meaningful ways of assessing trust based on the content-driven paradigm of communication of the CNF architecture.
4 Conclusions In this research, we have looked at security aspects of the CNF architecture, identified strengths and weaknesses, and derived security requirements that need to taken into account. We have discussed possible security solutions for data privacy, secure routing, key management and trust management, while concurrently identifying future issues that need to be addressed. Even though the CNF architecture has functional benefits in terms of overcoming the limitation of intermittent connectivity in the now predominantly wireless Internet it also introduces security challenges. A balance needs to be obtained between data content privacy, caching, and secure routing. We need to ensure content privacy while taking advantage of security features that naturally emerge from the CNF architecture, such as spam control, virus control, and other related attacks. In addition, since the architecture is content driven there is a need to define content security requirements and route based on those requirements. However, for secure routing to exist there is also a need to be able to assess the trustworthiness of CNF routers, which requires additional mechanisms in place to assess the correct operation of CNF routers. The need for trustworthiness implies the existence of authentication as it provides the base to build trust. Authentication of the multitude of wireless devices in the CNF architecture may be facilitated by a key management system operated by the POs. Their location enables the integration of key management with trustworthiness information. Since the CNF architecture is content-driven, trust can be extracted by examining the content flows through POs. This aspect brings the challenge of coming up with meaningful ways of evaluating trust at a global scale to match the requirements of users or applications. These issues need to be carefully assessed in the future keeping also other aspects in mind, such as QoS. Overall, this analysis has brought into light the security requirements and open research issues that exist in the CNF architecture. Aspects of this investigation can serve as guidance for the design of secure future Internet architectures. Acknowledgments. George Hadjichristofi was supported by the Cyprus Research Promotion Foundation under grant agreement (ΤΠΕ/ΠΛΗΡΟ/0308(ΒΕ)/10). Christoforos Hadjicostis would also like to acknowledge funding from the European
Security in the CNF Architecture for the Next Generation Internet
339
Commission’s Seventh Framework Programme (FP7/2007-2013) under grant agreements INFSO-ICT-223844 and PIRG02-GA-2007-224877. Any opinions, findings, and conclusions or recommendations expressed in this publication are those of the authors and do not necessarily reflect the views of CRPF or EC.
References 1. Zhang, Z., Wang, X., Li, Y.: Analyzing the Link Connectivity Characteristic in Intermittently Connected Wireless Networks. In: Second International Conference on Information and Computing Science, vol. 2, pp. 241–244 (2009) 2. Dong, L., Liu, H., Zhang, Y., Paul, S., Raychaudhuri, D.: On the Cache-and-Forward Network Architecture. In: ICC, pp. 1–5 (2009) 3. Dong, L., Liu, H., Zhang, S., Raychaudhuri, D.: Gateway Controlled Content caching and Retrieval for Cache-and-Forward Networks. In: Globecom Workshop EFSOI (2009) 4. Saleem, A.: Performance Evaluation of the Cache and Forward Link Layer Protocol in Multihop Wireless Subnetworks. Master’s Thesis, Rutgers University, WINLAB (2008) 5. Liu, H., Zhang, Y., Raychaudhuri, D.: Performance Evaluation of the Cache-and-Forward (CNF). In: Network for Mobile Content Delivery Services, ICC, pp. 1–5 (2009) 6. Jain, S., Saleem, A., Liu, H., Zhang, Y., Raychaudhuri, D.: Design of Link and Routing Protocols for Cache-and-Forward Networks. In: Sarnoff Symposium (2009) 7. Paul, S., Yates, R., Raychaudhuri, D., Kurose, J.: The Cache-And-Forward Network Architecture for Efficient Mobile Content Delivery Services in the Future Internet. In: Proceedings of the First ITU-T Kaleidoscope Academic Conference on Innovations in NGN: Future Network and Services (2008) 8. Transmission Control Protocol vs. Cache and Forward, http://www.eng.ucy.ac.cy/trust/Applets/BasicApp.html 9. Berman, F.: Got data?: a guide to data preservation in the information age. Communications ACM 51(12), 50–56 (2008) 10. Lewicki, R.J., Tomlinson, E.C.: Trust and Trust Building. Beyond Intractability. In: Burgess, G., Burgess, H. (eds.) Conflict Research Consortium, University of Colorado, Boulder (Posted: December 2003), http://www.beyondintractability.org/essay/trust_building/ 11. Mejia, M., Peña, N., Muñoz, J.L., Esparza, O.: A review of trust modeling in ad hoc networks. Internet Research 19(1), 88–104 (2009)
Characterization of Asymmetry in Low-Power Wireless Links: An Empirical Study Prasant Misra1 , Nadeem Ahmed1 , Diethelm Ostry2 , and Sanjay Jha1 1
School of Computer Science and Engineering, The University of New South Wales, Sydney, Australia {pkmisra,nahmed,sanjay}@cse.unsw.edu.au 2 CSIRO ICT Centre, Sydney, Australia
[email protected] Abstract. Experimental studies in wireless sensor network (WSN) have shown that asymmetry in low-power wireless links have a significant effect on the performance of WSN protocols. Protocols, which work in simulation studies often fail when link asymmetry is encountered in real deployments. Therefore, characterization of link asymmetry is of importance for the design and operation of resilient WSN protocols in real scenarios. This paper details an empirical study to characterize link asymmetry in WSNs. It presents a systematic approach to measure the effects of hardware performance and environmental factors on link asymmetry using off-the-shelf WSN devices. It shows that, for the given hardware platform, transmitter power and receiver sensitivity are the major factors responsible for asymmetry in low-power wireless links, while frequency misalignment in the transmitter and power variations in the antenna are unlikely causes for it. Keywords: Wireless link, low-power wireless measurement study.
1
Introduction
The practical realization of wireless systems are challenged by the assumptions made in the theoretical models, which is primarily due to the complexity of the wireless channel i.e. the total environment between the transmitter and the receiver including reflectors, absorbers, etc., in which the radio signals propagate. As a result, depending on the operating environment, the robustness of wireless systems are affected by the communication channel conditions. Therefore, studies are required to analyze the characteristics of the wireless channel with varying channel conditions, which can provide pointers for accurate modeling. The models used in the simulation study of low-powered wireless systems make simplifying assumptions about the characteristics of the wireless link. The properties of these links, as observed in real-world scenarios, differ from these models, especially in their asymmetric nature. Link asymmetry is a common phenomenon observed in devices utilized in sensor networks, and has significant impact on the performance of various protocols. Protocols, which work in simulation studies often fail, when link asymmetry is encountered in real deployments. M.K. Aguilera et al. (Eds.): ICDCN 2011, LNCS 6522, pp. 340–351, 2011. c Springer-Verlag Berlin Heidelberg 2011
Characterization of Asymmetry in Low-Power Wireless Links
341
Hence, its understanding is vital for designing and developing, reliable and robust, energy efficient protocols with higher throughput that would prolong the network lifetime. In this paper, we provide an empirical study to characterize link asymmetry in WSNs. We address this problem by analyzing various system components (i.e. transmitter, antenna, receiver, transmit/receive modes) and identifying deviations in their functionality, and influence of environmental conditions. In addition to using traditional methods of data collection, and off-line processing and analysis, we capture real-time data by using a spectrum analyzer for many of our experiments. Our results agree with previous work and provide new evidence for the possible reasons for link asymmetry. We show that frequency misalignment in the transmit mode, and the standard antennae used in the WSN motes are unlikely causes of link asymmetry. The major factors found to contribute to it are the transmitter power and the receiver sensitivity. The remainder of the paper is arranged as follows: Section 2 provides details of our experimental methodology in terms of the radio, hardware and experimental setup (topology) used in our study. Section 3 presents an evaluation of the various factors responsible for the asymmetry problem in links. A review of related work for studies on low-powered wireless links has been discussed in Section 4. Finally, a discussion of future work and conclusion is provided in Section 5.
2
Experimental Methodology
This section describes the low-powered radios, hardware platforms, and the testbed (or topology) utilized in our study. Chipcon CC1000 [1] and CC2420 [2] radios were used in our evaluation, however, the majority of the experiments were performed using CC2420. CC1000 is a 868/916MHz radio used in Mica2. It supports frequency shift keying (FSK) modulation scheme with manchester encoding, and is capable of sustaining a maximum data-rate of 38.4KBaud. CC2420 is a ZigBee-complaint 2.4GHz IEEE 802.15.4 radio (used in MicaZ and TelosB) that can be tuned to scan channels from 11 (2.405GHz) to 26 (2.480GHz), each separated by 5MHz. It uses offset quadrature phase shift keying (OQPSK) and direct sequence spread spectrum (DSSS) to send chips at 2MHz. A data-rate of 250Kbps is achieved by encoding a 4-bit symbol with 32 chips. It measures the RF signal strength of the received packet (in dBm) by averaging over eight symbol periods (128µs.). The experiments were performed using Mica2, MicaZ, and TelosB [3] motes as the primary hardware platforms. A detachable monopole antenna of one-quarter wavelength insulated wire is attached to the Mica2 and MicaZ motes through the MMCX connector. The length of the antenna is approximately 3.20inches and 1.20inches respectively for each of these platforms. TelosB has an internal 2.4GHz Planar Inverted Folded Antenna (PIFA) built into the printed circuit board and tuned to match the radio circuitry. Different experimental setups were created for the evaluation.
342
P. Misra et al.
Transmitter
Spectrum Analyzer
(a): Setup-A
RF Cable Antennae Removed Attenuator
Transmitter
Spectrum Analyzer (b): Setup-B Fig. 1. Experimental setup
Setup-A: A sensor mote was configured as the transmitter to send 6000 broadcast packets, each of length 29 Bytes, at the rate of 100 packets/second with an inter-packet interval (IPI) of 10msec, and highest transmission power of 0dBm (Figure 1-(a)). Setup-B: It is the wired version of Setup-A (Figure 1-(b)). The antennae was removed from the respective mote, and was connected to a spectrum analyzer (SA) by a coaxial cable through an attenuator (configured with an attenuation level of 25dB). Setup-C: A straight-line topology was created by using 4 MicaZ motes (numbered 0,3,5,7) (Figure 2-(a)). Mote 0 was configured as the base-station (BS), and the remaining motes 3, 5, 7 were placed at a distance of 1m, 5m, 10m from Mote 0. Mote 3, 5, 7 send data packets containing the sequence number (SN) in the payload to Mote 0, which receives and sends an acknowledgment packet to the respective sender, by copying the SN of the received packet into the payload. A simple TDMA approach was utilized wherein each mote transmitted
Characterization of Asymmetry in Low-Power Wireless Links
343
Two-way Communication
Mote-7
Mote-5
10m
Mote-3 Mote-0 (BS)
5m
1m
(a): Setup-C
One-way Communication
Mote-7
Mote-5
10m
Mote-3 Mote-0 (BS)
5m
1m
(b): Setup-D Fig. 2. Experimental setup
in its respective time-slot only, thereby eliminating the scope for packet collisions from other motes in the same experiment. Also, these motes were programmed to communicate over the ZigBee channel 26 to avoid external interference from WiFi. The RF transmit power was set at its minimum level of -25dBm to create intermediate links [14]. All the motes were connected to a logging workstation to record the received signal strength indicator (RSSI), link quality index (LQI) and SN of each packet exchanged. The motes were interfaced with programming boards, powered by the line current, thereby eliminating any irregularities in observed data due to variation in supply power levels that may crop up with the exhaustion of the battery power. This setup was used with different configuration of IPI of 100msec, 200msec and 1000msec, and packet length (PL) of 29Bytes, 39Bytes and 49Bytes, each for a period of 24 hours. Setup-D: This is the one-way version of Setup-C (Figure 2-(b)), wherein the BS (only) broadcasted packets to the Motes 3,5,7, while all other parameters and configurations remain the same.
344
P. Misra et al.
(a):Mica2
(b):MicaZ Fig. 3. Spectrum envelope of the transmitting Mica2 and MicaZ motes
3 3.1
Performance Evaluation Transmitter
The aim of this study was to measure the transmitter frequency alignment for different motes. The observation data was collected using experimental Setup-A in an indoor office environment. An Anritsu SA, bearing model number MS2721A [4], was used to capture the spectrum envelope of the transmitting mote. The SA was configured to the maximum-hold level, so as to lock the maximum power level of the transmitted packets in every frequency sweep. This exercise was cyclically repeated for 4 different motes of each type. The motes and the SA were tuned to their nominal center frequencies, which is 2.480GHz for MicaZ and TelosB(channel 26) and 916.7MHz for Mica2 motes.
Characterization of Asymmetry in Low-Power Wireless Links
345
(a):TelosB
(b):Mica2-Wired Fig. 4. Spectrum envelope of the transmitting TelosB and Mica2(wired) motes
Figures 3-(a) & (b) and Figure 4-(a) show the peak envelope level (for each mote), observed at each frequency slot, during the process of data transmission for multiple Mica2, MicaZ and TelosB motes, respectively. The Mica2, MicaZ and TelosB motes show a peak envelope variation between [-30dBm,-36dBm], [43dBm,-50dBm] and [-37dBm,-50dBm] respectively. In addition, the motes were observed to transmit at their designated center frequency within their specified bands. These results show that all the motes do transmit in the frequency band allocated for the specified channel, and that the spectrum occupied by each motes transmission are similar with no significant frequency offsets. In addition, power variations were also observed in the (measured) received power for the different motes, which varied within 6-7dB for both Mica2 and MicaZ motes, and ≈ 13dB for TelosB motes. The reason for this variation can be accounted to the manufacturing process [3] as there are discrepancies in the power level of the transmitted packets, which are close to the specified power level, but, do not
346
P. Misra et al.
match exactly. Such significant differences in transmit power would impact the forward and reverse link performance of the respective motes. On the basis of these measurements, we conclude that transmitter power is a significant contributor to link asymmetry; however, with respect to our testbed study, frequency misalignment in the transmit mode cannot be accounted for asymmetric links. Thus, our results differ from Liu et. al. in [5], where it has been reported (using Mica2 motes) that different motes show large differences in transmission power [0dBm,-14dBm], and their respective spectra are shifted to the extent that they overlap partially. 3.2
Antenna
The study, in this section, was conducted to analyze the effect on spectral variations, with the use of the wired medium instead of the wireless antennae, and compare it with the wireless transmission experiments. It was performed using Setup-B in an indoor office environment using four different Mica2 motes. The SA settings were configured as described in the Section 3.1. Figure 4-(b) shows the result obtained from this wired setup. The Mica2 motes show a peak envelope variation between [-27dBm,-34dBm], thereby displaying a variation of ≈ 7dBm. A comparison of the result with Figure 3-(a) shows near identical spectrum results for the wired and wireless setups of the same experiment. The only noticeable difference is that the peak power level has changed from -30dBm (wireless) to -27dBm(wired). Also, we observe that the power level of the transmitted packets with the attenuator at -25dB gets depreciated (≈ 2dB) in its propagation over the short physical wired medium. This indicates that the standard antennae used in the motes does not have a significant effect on the received spectrum envelope, and hence is not a major contributor to link asymmetry. 3.3
Receiver
In this section, we characterize the receiver, using a set of MicaZ motes. We found that the motes transmitting a 0dBm power were able to successfully send and receive packets, without the need of any antennas or wires, when placed at a distance of about 1m from each other. This is due to the energy leakage through the MMCX connector of Mica2 and MicaZ, which was circumvented by placing an attenuator between the transmitter and receiver motes, as mentioned in Setup-B. The attenuation values were varied to control the packet reception rate (PRR) corresponding to the condition of minimum to maximum attenuation or received signal strength (RSS) at the receiver. Figure 5 compares the PRR against RSS for different receiver motes. The results show that the PRR drops from 100% to 0% over a range of about 8dB in attenuation, or decrease in RSS, and in addition, different motes show different response characteristics to the change in received power levels. All motes show a 100% PRR at attenuation values less than 54dB, while none of them are able to receive any packets with attenuation greater than 62dB.
Characterization of Asymmetry in Low-Power Wireless Links
347
Fig. 5. Receiver: PRR vs. RSS
The actual RSS values at the motes can be obtained by considering the mote’s transmission power(-25dBm), power loss over the wired medium(≈ 2dB), and the attenuation level set at the attenuator [-54dB, -62dB]. This implies that the PRR drops from 100% to 0% over the range of RSS from [-81dBm, -89dBm]. The 100% drop in PRR at values close to -89dBm can be explained by the fact that the minimum power required by the mote’s hardware to decode the packets is ≈ -90dBm [2]. The experiment is showing the effect of differences in receiver sensitivity. Even with the same input signal, different receivers can have different error rates, especially when the signal-to-noise ratio is close to the manufacturer’s specified minimum (-90dBm). The result indicates that the PRR is largely dependent on the received signal power at the receiver, and asymmetric links can result if the RSS falls in the receiver’s sensitivity band of ≈ [-80dBm, -90dBm]. Low transmission power, or excessive propagation losses as a result of long links, are responsible for the RSS in this sensitivity band. For a fixed transmission power WSN, link asymmetry can be reduced by selecting short links instead of long links by proper topology control. 3.4
Transmit-Receive Mode
This section studies the mote’s characteristics when it is operating in both transmit and receive states. This is a very likely configuration in a group of wireless sensor nodes, wherein a couple of intermediate nodes may be concurrently relaying packets, back and forth, between the sink and the source nodes. Another common scenario is that the BS may be receiving packets from different sources, and acknowledging them alongside its routine operations (such as data logging
348
P. Misra et al.
(a): 29 Bytes
(b): 39 Bytes
(c): 49 Bytes
(d): RSS trace for link 0− >3
Fig. 6. Transmit/Receive effect on RSS
or transfer). We analyze this characteristic using MicaZ motes in experimental Setup-C & D. Figure 6-(a) shows the difference in RSSI value, with the two-way and oneway operational mode at the BS (Mote 0), for links 0→3, 0→5, and 0→7, for PL of 29 Bytes with different IPI. The RSSI values of the packets sent as acknowledgment by BS to the respective motes (from Setup-C) are compared with their corresponding values obtained from Setup-D (two-way vs. one-way). The result shows that the transmit/receive mode results in higher RSSI for links (0→3, 0→5), and lower RSSI for link (0→7). Note that link (0→7) is operating in the receiver’s sensitivity band < -80dBm (Section 3.3), which supports our observation that links operating in this power region can produce unexpected variations in RSS, and hence, lead to link asymmetry. All the cases show differences in RSS; its just that the trend is reversed for the link 0→7. Figure 6-(d) shows the temporal statistics for the entire observation period, for the short link (of 1m) between BS and mote 3. The link 0→3 shows a higher RSSI value under transmit/receive mode, which is of the order of 2-4dB.
Characterization of Asymmetry in Low-Power Wireless Links
349
Table 1. Environmental Conditions Exp. Time of Environmental Con- Peak No. Day ditions Envelope-A (dBm) 1 7a.m. 21o C, Wind: NW at 13 -56 km/h, Humidity: 60% 2 12noon 19o C, Wind: S at 27 -60 km/h, Humidity: 56% 3 1p.m. 30o C, Wind: SW at 8 -53 km/h, Humidity: 29% 4 3p.m. 25o C, Wind: NW at 21 -53 km/h, Humidity: 28% 5 5p.m. 26o C, Wind: N at 19 -54 km/h, Humidity: 39% 6 8:30p.m. 21o C, Wind: NW at 14 -56 km/h, Humidity: 38%
Peak Envelope-B (dBm) -56
Peak Envelope-C (dBm) -50
Peak Envelope-D (dBm) -53
-60
-53
-57
-53
-56
-56
-50
-53
-55
-52
-56
-56
-53
-49
-51
Similar observations were also noted for PLs of 39Bytes (Figure 6-(b)) and 49Bytes (Figure 6-(c)). We conclude that there is a variation in power level, when a low-powered mote is operating under transmit/receive mode, and can be a cause of link asymmetry if this power difference goes high, and especially, if it is operating in the receiver’s sensitivity band that is close to the manufacturer’s specified minimum level. 3.5
Environmental Conditions
This section details the experimental findings on the effect of environmental factors on the link performance. It was performed using Setup-A, for a set of 4 different sender motes (MicaZ) at different time of the day over a period of one week. The environmental conditions for each set of time is described in Table 1. The peak envelope (dBm) recorded by the 4 different motes (A,B,C,D) has been specified in Table 1. We refer our readers to [6] for the spectrum envelope plots. The spectrum results show that the spectrum envelope is not consistent during all the observations performed at different time of the day, and vary in the range of ≈ 1-10dB. One would expect this difference in RF power to affect WSN transmissions at different time of day under different environmental conditions. Under normal WSN operation, where both forward and reverse links are simultaneously operational, environmental conditions are expected to be similar, and hence, have negligible effect on link performance in both directions. However, if there is a substantial delay between these operations, wherein the environmental conditions change, then link asymmetry may become apparent.
4
Related Work
The problem of asymmetric links in low-powered sensor devices has been widely discussed in existing literature. Liu et. al. in [5] have conducted field trials for the analysis of link asymmetry, and indicated that frequency misalignment between motes is a significant contributor to it, in addition to the motes transmission
350
P. Misra et al.
power. Ganesan et. al. in [8] have presented their studies for different layers in the protocol stack using Rene motes. They have suggested, through empirical data collection that the radio links (in low-powered devices) exhibit bi-directionality, provided statistics to measure the number of asymmetric links as a distance qualifier, and have attributed this behavior to differences in receiver’s sensitivity and mote transmission power. This information has been further verified by Cerpa et. al. in [9]. Woo et. al. in [10], through the packet loss models between the sending and receiving motes, demonstrated that the PRR is uncorrelated with large distances between node pairs, and hence concluded that differences in the mote’s hardware (or radio) is the prime factor responsible for these asymmetries. Zhao et. al. in [11] studied it with respect to different environments, power levels and coding schemes, and suggested multipath as one of the reasons for this behavior. As all of these previous studies had been conducted on older motes such as the Rene, Mica and Mica2, therefore, Zhou et. al. in [12] presented their finding on hardware mis-calibration, by experimenting with the ZigBee-compliant motes, as well as the older ones, and proposed a model to capture this characteristics. Cerpa et. al. in [7] showed that the PRR is inaccurate to define long-term links as it does not account for the underlying loss distribution, and hence has utilized the required number of packets (RNP) as a measure to characterize the links. Raman et. al. in [13] have studied the impact of external antennas for improving the communication range with analysis into the temporal variation of RSSI and the packet error rates. Srinivasan et. al. in [14] consider a link to be asymmetric if the difference in the forward and reverse PRR is more than 40%. They have suggested that asymmetric links are the outcome of the variation in per node RSSI and noise floor that affect the long term PRR. Our work, though similar to previous studies, investigates the performance of physical components of a sensor device (radios, antennae, etc.) and provides new evidence of possible factors that are major contributors to link asymmetry.
5
Conclusion and Future Work
This paper characterizes the link asymmetry behavior in low-power wireless systems, and may provide practical benefits to other researchers who are working on similar problems and projects. The experimental study presented in this paper has highlighted the following characteristics. Frequency misalignment is unlikely to be the cause of link asymmetry in the tested WSN platforms, as there is no major variation in the center frequency for different motes. The power variations due to antenna characteristics are also likely to have negligible affects. The major factors found to contribute to link asymmetry were transmitter power and receiver sensitivity. As future work, we plan to characterize links and their performance with respect to different environmental conditions, as well as, perform long-term temporal measurements with different packet lengths and inter-packetinterval-time. Many of these preliminary results are available in [6]. Additionally, we plan to perform all of these experiments in an anechoic chamber, and benchmark our observations. We believe that an exhaustive study of the asymmetry of
Characterization of Asymmetry in Low-Power Wireless Links
351
low-powered wireless links, and their characteristics would help to design more efficient protocols in the future.
References 1. http://focus.ti.com/lit/ds/symlink/cc1000.pdf 2. http://focus.ti.com/lit/ds/symlink/cc2420.pdf 3. http://memsic.com/products/wireless-sensor-networks/ wireless-modules.html 4. http://www.us.anritsu.com/products/MS2721A Spectrum-Master ARSPG ARQQSidZ654.aspx 5. Liu, P.R., Rosberg, Z., Collings, I., Wilson, C., Dong, Y.A., Jha, S.: Overcoming Radio Link Asymmetry in Wireless Sensor Networks. In: IEEE PIMRC (2008) 6. Misra, P., Ahmed, N., Jha, S.: Characterization of Asymmetry in Low-Power Wireless Links: An Empirical Study. Technical report,UNSW-CSE-TR-1016, UNSW (2010) 7. Cerpa, A., Wong, J.L., Potkonjak, M., Estrin, D.: Temporal properties of low power wireless links: modeling and implications on multi-hop routing. In: ACM MobiHoc, pp. 414–425 (2005) 8. Ganesan, D., Krishnamachari, B., Woo, A., Culler, D., Estrin, D., Andwicker, S.: Complex behavior at scale: An experimental study of low-power wireless sensor networks. Technical report,CS TR 02-0013, UCLA (2002) 9. Cerpa, A., Busek, N., Estrin, D.: Scale: A tool for simple connectivity assessment in lossy environments. Technical report, 0021, UCLA 10. Woo, A., Tong, T., Culler, D.: Taming the underlying challenges of reliable multihop routing in sensor networks. In: ACM SenSys, pp. 14–27 (2003) 11. Zhao, J., Govindan, R.: Understanding packet delivery performance in dense wireless sensor networks. In: ACM SenSys, pp. 1–13 (2003) 12. Zhou, G., He, T., Krishnamurthy, S., Stankovic, J.: Models and solutions for radio irregularity in wireless sensor networks. ACM TOSN 2(2), 221–262 (2006) 13. Raman, B., Chebrolu, K., Madabhushi, N., Gokhale, Y.D., Valiveti, K.P., Jain, D.: Implications of Link Range and (In)Stability on Sensor Network Architecture. In: WiNTECH, A MobiCom Workshop (2006) 14. Srinivasan, K., Dutta, P., Tavakoli, A., Levis, P.: An empirical study of low-power wireless. ACM TOSN 6(2), 1–49 (2010)
Model Based Bandwidth Scavenging for Device Coexistence in Wireless LANs Anthony Plummer Jr., Mahmoud Taghizadeh, and Subir Biswas Michigan State University, USA {plumme23,taghizad,sbiswas}@msu.edu
Abstract. Dynamic Spectrum Access in a Wireless LAN can enable a set of secondary users’ devices to access unused spectrum, or whitespace, which is found between the transmissions of a set of primary users’ devices. The primary design objectives for an efficient secondary user access strategy are to be able to “scavenge” spatio-temporally fragmented bandwidth while limiting the amount of interference caused to the primary users. In this paper, we propose a secondary user access strategy which is based on measurement and modeling of the whitespace as perceived by the secondary users in a WLAN. A secondary user continually monitors and models its surrounding whitespace, and then attempts to access the available spectrum so that the effective secondary throughput is maximized while the resulting interference to the primary users is limited to a pre-defined bound. We first develop analytical expressions for the secondary throughput and primary interference, and then perform ns2 based simulation experiments to validate the effectiveness of the proposed access strategy, and evaluate its performance numerically using the developed expressions.
1
Introduction
1.1
Dynamic Spectrum Access in WLANs
Recent research on Dynamic Spectrum Access (DSA) has paved the way for a set of Secondary Users (SU) to access underutilized spectrum between Primary Users’ (PU) transmissions in space, time and frequency. It has been shown [4] that such SU access can be feasible through studies of the primary user spectrum usage and the protocols that govern them. While this is particularly true for the license bands, it also applies to the unlicensed bands due to the recent proliferation and the subsequent overcrowding of wireless consumer technologies such as Bluetooth, WLAN, cordless phones, and related applications. Of particular interest in DSA, is the issue of prioritized coexistence of various devices utilizing unlicensed bands [6, 5, 1]. An example of such coexistence is the sharing of the 2.4 GHz Industry Scientific and Medical (ISM) band among Bluetooth, WLAN, and cordless devices [6]. Coexistence between these systems can be achieved through spatial or frequency separation, as reported in [2] and [10].
This work was partially supported by grant CMMI 0800103 from the National Science Foundation.
M.K. Aguilera et al. (Eds.): ICDCN 2011, LNCS 6522, pp. 352–363, 2011. c Springer-Verlag Berlin Heidelberg 2011
Model Based Bandwidth Scavenging for Device Coexistence in WLANs
353
We propose in this paper, a secondary user access strategy to utilize temporal separation between primary and secondary devices accessing the same RF spectral segments within a WLAN. A representative application scenario would be the ISM band coexistence among different device types including Voice-over-IP handsets (e.g. Skype phones), cordless phones, and data terminals such as laptops and data-enabled 3G/4G phones. A suitable primary-secondary relationship among those co-existing devices are imposed based on their respective traffic priorities. 1.2
Bandwidth Scavenging Approach
With non-deterministic primary traffic patterns, an SU is not able to deterministically predict when whitespaces will occur and when a detected whitespace will end. As a result, the SU access strategy cannot be deterministic. A reasonable solution is for the SUs to access a given whitespace based on a statistical profile of the previously observed whitespaces. Once an SU identifies a whitespace, it sends a packet in that whitespace only if the estimated chance of completing the transmission before the whitespace ends is high. We term this opportunistic access during those ultra-short and non-deterministic whitespaces as bandwidth scavenging by the secondary users. In other words, the SUs scavenge capacity left over by the PUs. From an application standpoint, this would correspond to data devices within a WLAN scavenging bandwidth that is left over by the VoIP handsets, which are defined as the PUs. In this paper, we focus on dynamic spectrum access by proposing a secondary user channel access strategy for efficient bandwidth scavenging.
2
Related Work
Prioritized device coexistence with CSMA based WLANs can be achieved [11,12] by using different inter frame spacing (IFS) periods for different user and/or traffic classes. MAC protocol 802.11e [11], for instance, uses different Arbitration IFS (AIFS) periods to provide CSMA based prioritized access among different device and/or traffic classes. When a channel is found free, a node waits for a specific AIFS periods depending on the device or traffic class, before it attempts to send a packet. For higher priority primary traffic, a node waits for a smaller AIFS period. This ensures when multiple nodes contends for the channel, the primary users’ nodes (with smallest AIFS) wins. While providing reasonable access differentiation, these approaches rely only on the instantaneous channel status (i.e. free or busy) for granting access. This leads to undesirable disruptions to the PU traffic as follows. Consider a situation in which an SU intends to transmit a packet and it does so after finding the channel free (i.e. a whitespace) for the AIFS specified for the SUs. Now, in the middle of this SU’s packet transmission if a PU in the vicinity intends to send a packet, it needs to wait until the current SU transmission is over. This causes an undesirable delay for the PU traffic, which in turn will affect the PUs’ application performance. Since this is mainly a result of the SUs’ reliance only on the instantaneous channel
354
A. Plummer Jr., M. Taghizadeh, and S. Biswas
state, a more robust approach for the SUs would be to also consider the long term whitespace model. Networks with primary users running 802.11 MAC protocol have recently been investigated for possible dynamic spectrum access by secondary users. The authors in [3, 4] develop a methodology for formally analyzing the whitespace available within 802.11 primary traffic in infrastructure mode. The key idea is to model the whitespace as a semi-Markov process that relies on the underlying 802.11 state model involving DIFS, SIFS, DATA, and ACK transactions. The model describes the whitespace profile in terms of holding times of the idle and busy states of the channel. Building on this whitespace model, the authors further develop [5] an WLAN dynamic spectrum access strategy for secondary users in which the SUs utilize packet size slots for the channel access. At the beginning of each slot, an SU senses the channel and if the channel is free then it transmits with a specified probability that is calculated from previous measurement. The objective is to minimize PU interference and maximize SU throughput. The use of time-slotting in their approach requires time synchronization across the secondary users. In the proposed access mechanism in this paper, the need for such inter-SU time synchronization is avoided via asynchronous whitespace access based on a stochastic whitespace modeling approach.
3 3.1
Whitespace Characterization Whitespace Measurement and Model
The available whitespace can be measured by a secondary user by detecting the received signal strength (RSSI) at a given channel frequency [3]. We denote the sensing period as Tp . Based on these sensed samples, the whitespace is modeled using its probability density function, w(n), which represents the probability that the duration of an arbitrarily chosen whitespace is nTp . The quantity w(n) is periodically computed in order to capture any dynamic changes in the primary user behavior. 3.2
Primary User Topology and Traffic Profile
In order to represent varying WLAN topologies and traffic characteristics, three scenarios are established (see Fig. 1): (TOP1) a linear chain topology with a single multi-hop primary flow, (TOP2) two multi-hop flows in a network on two intersecting linear chains, and (TOP3) two multi-hop flows in a parallel chain network. In each case, the secondary users are within the interference range of all primary users. We model the primary user traffic as bidirectional UDP with a fixed packet size, TCP, and a Video Stream. For the UDP traffic, the packet arrival intervals are modeled as 1) Uniformly distributed with a mean of α ms and variability ±v around the mean, and 2) a Poisson process with mean packet arrival interval of α ms. The packet arrival intervals for each Uniform and Poisson flow is varied from low to high values to represent varying intensity of primary traffic.
Model Based Bandwidth Scavenging for Device Coexistence in WLANs
PU1
PU2
PU3
PU4
SU1
SU2
355
PU5
Multi-hop: TOP (1) Traffic Flow Direction
PU6 SU2 PU7
PU1
SU1 PU1
PU2
PU3
PU4
Intersecting TOP (2)
PU8
PU3 SU1
PU5 PU6
PU9
PU2
PU7
PU8
PU4
PU5
SU2 PU9
PU10
Parallel TOP (3)
Fig. 1. Primary user topologies and traffic flows
3.3
Whitespace Analysis
The whitespace pdf w(n), for varying topology, traffic rate, and traffic distribution are shown in Fig. 2. For this figure, the channel sensing period Tp is kept smaller than the shortest possible whitespace, which is the SIFS period (i.e. 10μs) in 802.11 WLANs. We set the Tp to 5μs. Note that in the legend in the figures use the naming convention is [topology, traffic profile, packet arrival interval ]. Fig. 2(a) shows the w(n) for UDP traffic corresponding to 90 ms average interpacket duration, with uniformly distributed ±20% around the average for each topology. Fig. 2(b) shows the whitespace for primary traffic with Exponentially distribution inter-packet arrivals with 90 ms average. Figs. 2(d) and (e) present the w(n) for TCP and Video Streams. The distributions presented in Fig. 2 represent the whitespace properties for a wide range of topology and traffic variations that were experimented with using ns2. Two key observations can be made from the whitespace statistics in Fig. 2. First, the detected whitespace durations from the measured RSSI trace can vary a great deal across different topologies and traffic patterns. For example, in the multi-hop (TOP1) case, although all the traffic is generated at about 90 ms intervals, the whitespace durations are distributed over the range from 10 μs to 300 ms. This variation results from the dynamic nature of the bidirectional multihop primary traffic. In contrast, as shown in Fig. 2d, for TCP traffic over all the topologies, the whitespace durations are very small, indicating little opportunity for secondary user access to the spectrum. Second, as shown in Figs. 2(c) and (f), regardless of the traffic profile or topology, the vast majority of the whitespaces (above 90%) last for less than 1 ms. This results from the large number of small whitespaces created during the RTS-CTS-DATA-ACK cycle for each packet transmission. Additionally, small whitespaces are generated in-between transmissions during multi-hop forwarding. Given that the small whitespaces are of very small durations and frequently
A. Plummer Jr., M. Taghizadeh, and S. Biswas
10 10 10
Probability
10 10 10 10 10 10
10 TOP1, U90 TOP2, U90 TOP3, U90
−1 −2 −3 −4 −5
0
100
200
10 10 10 10 10
300
Whitespace Duration (nT ms) p a)
0
10 TOP1, TCP TOP2, TCP TOP3, TCP
−1 −2 −3 −4
10 10 10
−5
0
10
4
8
12
16
20
Whitespace Duration (nTp ms) d)
10
0
TOP1, P90 TOP2, P90 TOP3, P90
−1 −2 −3 −4
Probability
10
0
Probability
10
Probability
Probability
10
−5
0
0
100
200
300
400
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0
TOP1, VIDEO TOP2, VIDEO TOP3, VIDEO
−1 −2 −3 −4 −5
0
40
80
120
160
Whitespace Duration (nT ms) p e)
TOP1, TOP2, TOP3, TOP1, TOP2, TOP3, 0.2
0.4
0.6
U90 U90 U90 P90 P90 P90
0.8
1
Whitespace Duration (nTp ms)
Whitespace Duration (nT ms) p b)
c)
Probability
356
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0
TOP1, TOP2, TOP3, TOP1, TOP2, TOP3, 0.2
0.4
0.6
TCP TCP TCP VIDEO VIDEO VIDEO
0.8
1
Whitespace Duration (nTp ms) f)
Fig. 2. Impacts of PU topology with Uniform, Poisson, TCP and Video Stream traffic on whitespace distributions. The graphs (a), (b), (d), and (e) show the pdf, along with (c) and (f) presenting the corresponding cumulative distribution function (cdf) from 0 to 1 ms
occurring, there is a high probability of small whitespaces being accessed by the SUs. Note that this whitespace property is a feature of 802.11 WLAN traffic, and can be heavily exploited by the SUs while accessing this type of networks.
4
Model Based Access Strategy
It was observed that due to the inter-frame spacing at the MAC layer, over 90% of the detected whitespaces are too small to fit the secondary packets. This general observation suggests that an efficient SU access strategy should avoid initiating transmissions during those small whitespaces. Without having the knowledge of the actual duration of a whitespace at the beginning of the whitespace, an SU needs to rely on statistical models in order to maximize the Effective Secondary User Throughput (EST) while keeping the Primary User Interference (PUI) within tolerable bounds. The proposed Model Based Access Strategy (MBAS) attempts to accomplish these goals. 4.1
MBAS Scavenge Algorithm
Without having the knowledge of the actual duration of a whitespace, if an SU transmits at the beginning of the whitespace then a vast majority of transmissions will end up in interferences to the primary users. However, deferring the transmission attempts for a suitably chosen wait-threshold duration μ can potentially reduce such PUI without sacrificing the EST. In line with this logic, we propose a secondary access algorithm that utilizes a minimum wait-threshold duration μ that is dimensioned based on the statistical properties of the primary users within a WLAN. MBAS comprises of three main steps. First, when an SU intends to make a packet transmission it waits until the next whitespace
Model Based Bandwidth Scavenging for Device Coexistence in WLANs
Secondary User
SU Application
S I Y Tp W(n)
Channel Measurement &M Modeling d li
Access Strategy Parameter Extraction
357
μ Jmax
Sensing
SU Access Module (Algorithm 1) Data Packets
Sensing
Wireless Channel PU1
PU2
…
PUn
Primary User Network
Fig. 3. Functional diagram of the MBAS access strategy
is detected. The second component requires the SU to wait for a minimum waitthreshold duration μ. Third, at any given point within a detected whitespace, an SU transmits a packet only if the estimated probability of interference, computed based on the recently measured whitespace pdf w(n) and the time elapsed within the current whitespace, is lower than a pre-defined threshold. 4.2
Functional Overview of MBAS
Fig. 3 shows all the functional components of MBAS. The user-tunable control inputs into the system are secondary users’ packet size (S ), primary interference bound (I ), inter-packet spacing (γ) for secondary user traffic, and the sensing period (Tp ), which is a property of the secondary user hardware. An SU continuously creates the whitespace pdf w(n) through channel sensing with a rate of once in every Tp duration. The MBAS parameters μ and Jmax are computed using the whitespace pdf w(n) and the other system parameters including S, I and Tp . These two parameters are then fed into the access module, which, in addition to instantaneous channel sensing, transmits packets based on the MBAS algorithm. 4.3
MBAS Algorithmic Analysis
Primary User Interference (PUI). Three access scenarios can occur in MBAS. In the first case, if a whitespace duration is less than or equal to μ, the whitespace is not accessed, and therefore, there is no interference to the PUs. The second scenario is when a whitespace duration is in the range between μ and the last SU packet sent in the current whitespace. Let us define a term Jmax , which denotes the maximum number of packets that the secondary user can send before the expected interference reaches a pre-defined bound I. In this case, there may be PUI since a packet will last at most until the end of the whitespace mark. Detection of a collision with a PU can be realized by an SU by
358
A. Plummer Jr., M. Taghizadeh, and S. Biswas
sensing the channel immediately after its packet transmission is completed. If the SUs’ packet size is smaller than that of the PUs’, a collision can be inferred if the PU’s signal is found on the channel after the SU’s transmission is completed [8]. The third scenario corresponds to when a whitespace lasts past the transmission of the Jmax th SU packet. No PUI is caused in this case because the SU will vacate or exit the whitespace after Jmax number of packets are transmitted. Thus, for a given whitespace profile w(n), the PUI can be defined by the probability that the whitespace duration is in the range between μ and μ+Jmax S + (Jmax -1)γ, where S is the SUs’ packet size and γ is the spacing between the SUs’ packets. Therefore, an SU accessing a whitespace characterized by pdf w(n) μ+Jmax S+(J max −1)γ will create primary user interference (PUI): P U I = w(n). (1) μ
Effective Secondary Throughput (EST). The Effective Secondary user Throughput (EST) represents the efficiency of channel usage by the secondary users. It is defined as the average number of SU packets successfully transmitted per whitespace, normalized by the number of packets that could have been transmitted per whitespace for a given whitespace profile w(n). Let Sj , be the probability of sending exactly j packets in a whitespace. This corresponds to the event in which the jth packet from an SU in a whitespace interferes with the PUs. Meaning the whitespace ends during the transmission of the jth packet. μ+jS+(j−1)γ The quantity Sj can be written as: Sj = w(n). (2) μ+(j−1)(S+γ)
For a whitespace that lasts between μ and μ+(Jmax -1)(S +γ), the number of packets sent by an SU is j. For whitespace duration between μ+Jmax S +(Jmax 1)γ and ∞, the number of packets sent by an SU is Jmax , since the secondary user vacates the whitespace after sending Jmax packets. Therefore, the expected number of secondary packets sent for a given whitespace, can be expressed as: Jmax ∞ −1 T hroughput = jSj + Jmax Sj . (3) j=1
j=Jmax
The throughput equation is valid only when the secondary user can capture all available whitespace, using perfect sensing (Tp set to zero). With non-zero Tp , however, the whitespaces smaller than Tp may not be detected and accessed by the SU. The probability of missing a whitespace of length w is equal to (Tp -w )/ Tp . We account for this loss of throughput using this equation: T p /S T /S−j Loss = jSj Tp p /S . (4) j=1
From the SUs’ standpoint, the effective capacity available can be expressed as the average number of packets that can be sent per whitespace. This quantity ∞ can be written as: Cap = jSj (5). j=1
Therefore, the Effective Secondary user Throughput (EST) can be expressed as: EST = T hroughput−Loss . (6) Cap which is computed from Eqns. 3, 4, and 5.
Model Based Bandwidth Scavenging for Device Coexistence in WLANs
359
Dimensioning Wait-threshold µ. The selection of wait-threshold μ is critical since it affects both PUI and EST as shown in Eqns. 1 and 6. A large μ can reduce primary user interference by preventing the SUs from transmitting during very short whitespaces. However, a large μ can also bring down the EST, since some portions of large whitespaces will be lost due to this conservative wait period. The goal is to optimally pre-dimension the parameter μ, based on the measured whitespace pdf w(n), such that a desired balance between PUI and EST can be obtained. Since at least one channel sensing is needed to determine its state, the lowest value of μ is at least a single Tp period. According to the MBAS logic, the wait-threshold period determines the PUI caused by the first SU packet during a whitespace. For a wait-threshold tx , such PUI can be written tx +S w(n). (7) as: f (tx ) = tx
Now, the minimum and the maximum values of tx , are zero and 2S, where S is the secondary packet duration. The maximum wait-threshold is 2S, since waiting for a duration that is greater than or equal to 2S would mean missing out sufficient whitespace that could have been used by the secondary user to send at least one packet. Therefore, for a given whitespace w(n), the optimal wait-threshold μ can be chosen as the smallest tx , over the range 0 to 2S, such that the quantity f(tx ) in Eqn. 7 is minimized. Computing Jmax . According to the MBAS logic, the number of packets an SU is allowed to transmit in a given whitespace, Jmax , can be computed by finding ∞ w(n) ≤ I. (8) the maximum j that satisfies the inequality: 1 − μ+(j−1)(S+γ)
The left side of the inequality represents the PUI when the SU sends Jmax packets in a given whitespace. Therefore, the inequality in Eqn. 8 gives the Jmax that will reduce the interference to the pre-defined bound I. By plugging in this value of Jmax in Eqn. 1, we can find the overall PUI in MBAS.
5
Access Strategy Evaluation
Performance of MBAS has been evaluated experimentally (via simulation) and theoretically in terms of its PUI and EST. The network simulator ns2 was used to create the network topologies shown in Fig. 1. Results in Sections 5.B, C, and E correspond to a sensing period Tp of 5 μs which is smaller than the smallest possible 802.11 whitespaces (i.e. SIFS periods). The results in Section 5.3 depict the impacts of choosing different Tp values. 5.1
Implemented Strategies
We compare MBAS with a sense-and-transmit based protocol, called VX Scheme [7], and a benchmark scheme (Benchmark). The VX (Virtual-transmit-if-Busy) scheme utilizes a strategy that is similar to CSMA, in which the SU senses the channel and transmits when idle [7]. The key idea in this access strategy is to wait for a simulated PU transmission (Virtual Xmit) when the channel
360
A. Plummer Jr., M. Taghizadeh, and S. Biswas
Benchmark Vx Scheme
MBAS
MBAS
Vx Scheme
Fig. 4. Performance for Uniform traffic
becomes busy, and then transmit when the channel is free. This allows the SU to avoid transmissions when the channel is busy. After a Virtual Xmit or actual SU transmission the SU waits for Vs time before the next transmission in order to allow a PU capture the channel. The parameters of the protocol are found using the average busy and idle times of the channel. Algorithmic analysis for this scheme has been presented in [7]. The Benchmark protocol assumes full knowledge of the starting and ending times of every whitespace. Therefore, it transmits SU packets in every available whitespace and stops when the next SU transmission will cause interference. The Benchmark protocol can be run only offline to find the best SU performance. This protocol essentially determines the highest throughput an SU can expect without causing interference to the PUs. 5.2
Comparison with VX Scheme and Benchmark
Fig. 4 demonstrates the impacts of varying PU load on the throughput and interference for TOP:1-3 in Fig. 1. In order to represent different loading conditions, the PU inter-packet arrival is varied from 50 ms to 120 ms. For all the primary packet rates, a ±20% uniformly distributed inter-packet arrival variation was introduced. The secondary packet duration was set to 1 ms, which represents half of the PU packet duration. The user defined interference bound, I was set to 0.05 [5], which indicates that only 5% of the PU traffic was allowed to experience interference from the SU traffic. For the VX scheme, the vacation time Vs was set to 0.5 ms and Virtual Xmit duration was set to 16 ms, which was experimentally shown to deliver the best performance for that access protocol. Observe that in terms of PUI, the mechanism MBAS is able to maintain the required interference bound I across all different topologies and data rates that were experimented with. This is because the μ and Jmax values are set according to the observed whitespace statistics. Conversely, the VX Scheme causes approximately 40% interference to the PU traffic. Although the Virtual Xmit may avoid a PU transmission if an SU finds the channel busy, once the SU starts accessing a whitespace, it will only stop sending if a PU returns during a Vs wait time, which still has a probability of causing interference.
Model Based Bandwidth Scavenging for Device Coexistence in WLANs 0.06
0.74 TOP (1), POI, 60 ms, EXP TOP (3), UNI, 70 ms, EXP TOP (1), POI, 60 ms, THY TOP (3), UNI, 70 ms, THY
0.72
EST
PUI
0.055
361
0.05
0.7
0.045 0.68
0.04 0
0.2 0.4 0.6 0.8 Sensing Period, Tp (ms)
1
0
0.2 0.4 0.6 0.8 Sensing Period, Tp (ms)
1
Fig. 5. Impacts of channel sensing period Tp
It is evident that as the PU data rate increases, the amount of available usable whitespace in the channel for the SU reduces, causing EST to decrease. For all the experimented topologies, the throughput for MBAS remains close to that of the Benchmark protocol. Using the full prior knowledge of the whitespace, the Benchmark protocol is able to evaluate the maximum possible EST with zero PUI. This shows that MBAS is able to maximize throughput while maintain PUI bounds by decreasing the Jmax as necessary due to the increasing PU traffic load. The VX Scheme has a relatively lower throughput since it needs to waits a Vs duration after every transmission. In our experiments with Poisson traffic, trends are similar to that for the Uniform case (i.e. Fig. 4), although the throughput is lower since the MBAS needs to be more conservative with the Poisson traffic in order to maintain the desired interference bound. It is worth noting that the PUI is quite low and, more importantly, it is relatively insensitive to the primary packet rates. This actually indicates that with varying primary packet rates, by appropriately computing μ and Jmax , the secondary user is able to vacate the whitespace judiciously, thus keeping the PUI insensitive to the primary packet rates, which is indicative of the flexibility of the proposed MBAS access strategy. In experiments with Video Stream primary traffic very similar results were observed. With TCP PU traffic, however, it was found that since almost all whitespaces were below 1 ms, it was difficult for an SU to send packets within the whitespaces without causing a large amount of interference to the PUs. 5.3
Impacts of the Channel Sensing Granularity
The channel sensing period Tp can modify the whitespace pdf w(n), which the MBAS solely depends on. In this subsection we investigate the performance of MBAS over a practical range [9] of sensing period from 5 μs to 1 ms. Fig. 5 shows the impacts of varying Tp for TOP1 and 3 with uniform and Poisson distributed PU traffic at 60 ms and 70 ms inter-packet arrival times. As Tp increases, the EST and PUI remain relatively constant since with Tp within the range of 5 μs to 1 ms, an SU can detect most of the whitespaces that are smaller than the SU packet size, which was set to 0.6 ms. With MBAS, an SU does not access a whitespace until the whitespace lasts for at least a Tp period. This ensures that whitespaces that cannot be detected are not accessed. As a
0.05 0.04 0.03 0 20 40 60 80 100120140 Time (s)
400 350 300
MBAS, TOP(2) MBAS, TOP(3)
PU Traffic Change 1
0.06
SU Throughput (pkt/s)
0.07
450 PU Traffic Change 2
0.08
PU Traffic Change 1
0.09
PU Traffic Change 2
A. Plummer Jr., M. Taghizadeh, and S. Biswas
Primary User Interference (PUI)
362
250 200 150 100 0 20 40 60 80 100120140 Time (s)
Fig. 6. Performance of MBAS with time-varying w(n)
result, of the requirement of waiting at least a Tp period, the EST and PUI decreases slightly because the SU needs to wait longer in each whitespace before sending its first packet. Note that for smaller values of Tp , the Loss parameter (Eqn. 4) is zero since most of the whitespace can be detected. This explains as to why the EST values remain similar. Fig. 5 presents both experimentally measured (via ns2 simulation) and theoretical values (using Eqns. 1 through 8), which are marked as EXP and THY respectively. It is evident that the experimental and the theoretical results match well. These results indicate that the proposed MBAS access strategy can work under varying channel sensing rates that are well within the practical range [9] for the currently available hardware for the secondary users’ devices. 5.4
Performance Under Dynamic PU Traffic
MBAS relies on continuous whitespace measurement and characterization in order to adapt with any time-varying w(n) resulted from primary user network dynamics. Performance under such time-varying whitespace profile is shown in Fig. 6, for which the w(n) was updated every 1000 whitespace sample collections. The interference bound I was set to 0.05 [5], and the PU traffic was changed for both TOP2 and 3 in order to create the change in w(n). For TOP2, initially the PU traffic was generated with 60 ms inter-packet time (IPT) with Uniform profile. The pattern was changed after 60 seconds when the IPT was increased to 120 ms. Finally, at 100 seconds the traffic was changed to Poisson distribution with 90 ms IPT. The PUI traffic pattern for TOP 3 was started at 80 ms IPT with Poisson profile, and then after 60 seconds the IPT was changed to 50 ms and finally at 100 seconds it was changed to uniform distribution with 90 ms IPT. Observe that while the PU traffic pattern changes over time, the PUI resulting from the applied MBAS strategy stays consistently within the vicinity of the pre-set interference bound of 0.05. Also, observe that there is very little PUI during the transitions of traffic patterns as indicated in both the frames in Fig. 6. This is mainly because the whitespace pdf for Ad-hoc mode 802.11 traffic has very similar properties across different traffic patterns, i.e. above 90% of the whitespace is below 1 ms (see Fig. 2). The SU throughput increases during
Model Based Bandwidth Scavenging for Device Coexistence in WLANs
363
a traffic pattern transition if the useable whitespace increases due to the increase in IPT of the PU traffic. Additionally, the SU dynamically throttles back its transmissions to protect the PUs when the traffic change reduces the available amount of whitespace. These results show the robustness of MBAS against varying topologies and data rates in the primary network over time.
6
Conclusions
In this paper, we have proposed a novel secondary user access strategy which is based on measurement and modeling of the whitespace as perceived by the secondary network users. Once the surrounding whitespace of a secondary user is modeled by the user, it simply executes the proposed access strategy without having to rely on primary cooperation. Through ns2 simulation experiments and developed analytical expressions, it has been demonstrated that with Adhoc 802.11 traffic, the access strategy is able to provide reasonable amount of secondary throughput while bounding the primary interference to pre-set values.
References 1. Chiasserini, C., Rao, R.: Coexistence mechanisms for interference mitigation between ieee 802.11 wlans and bluetooth. In: INFOCOM (2002) 2. Chiasserini, C., Rao, R.: Coexistence mechanisms for interference mitigation in the 2. 4-ghz ism band. IEEE Transactions on Wireless Communications 2(5) (2003) 3. Geirhofer, S., Tong, L., Sadler, B.M.: Dynamic spectrum access in wlan channels: Empirical model and its stochastic analysis. In: TAPAS (2006) 4. Geirhofer, S., Tong, L., Sadler, B.M.: A measurement-based model for dynamic spectrum access in wlan channels. In: MILCOM (2006) 5. Geirhofer, S., Tong, L., Sadler, B.M.: Cognitive medium access: Constraining interference based on experimental models. IEEE Journal on Selected Areas in Communications 26(1) (January 2008) 6. Golmie, N., Chevrollier, N., Rebala, O.: Bluetooth and wlan coexistence: Challenges and solutions. IEEE Trans. Wireless Communication 10(6) (December 2003) 7. Huang, S., Liu, X., Ding, Z.: Opportunistic spectrum access in cognitive radio networks. In: IEEE INFOCOM (2008) 8. Huang, S., Liu, X., Ding, Z.: Optimal transmission strategies for dynamic spectrum access in cognitive radio networks. IEEE Transactions on Mobile Computing 8, 1636–1648 (2009) 9. Intersil: Direct sequence spread spectrum baseband processor., http://www.datasheetarchive.com/datasheet-pdf/012/DSA00211974.html 10. Jingli, L., Xiangqian, L., Swami, A.: Collision analysis for coexistence of multiple bluetooth piconets and wlan with dual channel transmission. IEEE Transactions on Communications 57(4) (2009) 11. Mangold, S., Sunghyun, C., Hiertz, G.R., Klein, O., Walke, B.: Analysis of ieee 802.11e for qos support in wireless lans. IEEE Wireless Communications 10, 40–50 (2003) 12. Mishra, M., Sahoo, A.: A contention window based differentiation mechanism for providing qos in wireless lans. In: 9th International Conference on Information Technology (ICIT), pp. 72–76 (2006)
Minimal Time Broadcasting in Cognitive Radio Networks* Chanaka J. Liyana Arachchige, S. Venkatesan, R. Chandrasekaran, and Neeraj Mittal Erik Jonsson School of Engineering and Computer Science, The University of Texas at Dallas, 800 West Campbell Road, Richardson, TX 75080, USA {chanaka.liyana,venky,chandra,neerajm}@utdallas.edu
Abstract. This paper addresses time-efficient broadcast scheduling problem in Cognitive Radio (CR) Networks. Cognitive Radio is a promising technology that enables the use of unused spectrum in an opportunistic manner. Because of the unique characteristics of CR technology, the broadcast scheduling problem in CR networks needs unique solutions. Even for single channel wireless networks, finding a minimum-length broadcast schedule is an NP-hard problem. In addition, the multi-channel nature of the CR networks, especially the nonuniform channel availability, makes it a more complex problem to solve. In this paper, we first present an Integer Linear Programming formulation (ILP) to determine the minimum broadcast schedule length for a CR network. We then present two heuristics to construct minimal length broadcast schedules. Comparison of optimal results (found by solving the ILP formulation) with the result of the heuristics through simulation shows that both heuristics produce schedules of either optimal or very closer to optimal lengths. Keywords: cognitive radio networks, time efficient broadcasting, broadcast scheduling.
1 Introduction Cognitive Radio (CR) is a technology that enables the use of spectrum in an efficient manner. CR nodes can sense the spectrum over a wide range of frequency bands and utilize unused or underutilized licensed frequency bands. Since CR nodes do not own these bands, they cannot use these bands indefinitely. When the owners of the frequency bands (primary users) start using their bands, CR nodes need to vacate the band and move to another band. Since the availability of free channels at a node depends on factors such as the physical location of the node and the hardware capability of its transceiver (frequency range supported by the radio hardware), even neighboring CR nodes may have different channels available to them. Broadcasting is a fundamental operation in computer networks. Inherent broadcast nature of the wireless medium makes it more appealing for wireless networks to support broadcast as a primary network operation. In this paper, we focus on constructing *
This work was partly supported by a grant from Research In Motion (RIM) Ltd.
M.K. Aguilera et al. (Eds.): ICDCN 2011, LNCS 6522, pp. 364–375, 2011. © Springer-Verlag Berlin Heidelberg 2011
Minimal Time Broadcasting in Cognitive Radio Networks
365
TDMA-based broadcast schedules that enable a given node to send a message to all other nodes in a CR network. A TDMA-based schedule consists of multiple timeslots. In each time slot of the schedule, every node transmits on some channel, listens on some channel or stays idle in a pre-determined manner. 1.1 Motivation CR technology introduces new complexity to the broadcast scheduling problem due to its non-uniform channel availability and multi-channel nature. Solutions to the broadcast scheduling problem in single channel wireless networks do not need to be concerned about which channel to use in broadcasting. Also traditional multi-channel networks can utilize single channel algorithms, since all the nodes have uniform channel availability. But in a CR network, channel availability is not uniform and the availability of a network-wide common channel is not guaranteed. Therefore any solution to the broadcast scheduling problem in CR networks needs to consider channel availability at each node. To further illustrate the challenges of broadcast scheduling in a CR network, consider a simple star network with node A as the center node and N edge nodes. In the single channel case (figure 1 (a)), node A can broadcast a message to all of its N neighbors using a single time slot. In a traditional multi-channel network where channel availability is uniform, node A can again broadcast to all its neighbors using a single timeslot by transmitting on any channel, provided all its neighbors listen on that channel. However, in a CR network, node A may not be able to broadcast a message using only a single time slot; it may have to transmit the message multiple times using a different channel each time. For example in figure 1 (b) node A needs at least three timeslots to transmit the message to all its neighbors. In the worst case, each neighboring node may have a share a different channel with node A and broadcasting will take N time slots (see figure 1 (c)). Node A
Node A
[Ch 1]
[Ch 1, Ch 2, Ch 3]
Node 1 Node 2 Node 3 [Ch 1] [Ch 1] [Ch 1] (a)
Node N [Ch 1]
Node A [Ch 1, Ch 2, .. , Ch n]
Node 1 Node 2 Node 3 Node N [Ch 1, Ch 2] [Ch 1] [Ch 2, Ch 4] [Ch 3] (b)
Node 1 Node 2 Node 3 [Ch 1] [Ch 2] [Ch 3]
Node N [Ch n]
(c)
Fig. 1. Star network topology. Available channels for each node are shown in the square brackets. (a) Single channel network. (b) CR network 1. (c) CR network 2 (each node has a unique common channel with node A).
1.2 Related Work Time-efficient broadcast scheduling is a widely studied problem for single channel networks. Chlamtac and Kutten [1] proved that finding a time-optimal broadcast schedule is an NP-hard problem. This has led to the development of approximation algorithms and global bounds for minimum-time broadcast scheduling problem [2]. Peleg [2] presented a comprehensive literature review on time-efficient broadcast
366
C.J. Liyana Arachchige et al.
scheduling algorithms for single channel networks under different models and assumptions. The non-uniform channel availability (specially the possibility of unavailability of globally common channel) makes it hard to use single channel broadcast scheduling heuristics in CR networks. Qadir et al. [6] studied the minimum-latency broadcast scheduling problem for multi-channel, multi-hop radio networks. They have proposed a set of centralized heuristic solutions to the problem. But they assume each node has multiple interfaces (or transceivers) and that the number of interfaces is equal to the number of channels. Broadcast scheduling in CR networks is a relatively new problem. To the best of our knowledge, the only published work in this area is due to Kondareddy and Agrawal [8]. They proposed a broadcast algorithm for the exchange of control information based on flooding the message using a limited set of channels. Constructing a time-efficient broadcast schedule was not considered in their work. 1.3 Our Contributions Our focus is on finding a minimal-length schedule for transmitting a message from a node to all other nodes in a multi-hop CR network. To the best of our knowledge, this is the first paper that addresses time-efficient broadcast scheduling problem for CR networks. Assuming time is slotted, we propose to find a schedule that minimizes the number of time slots in the schedule. Our main contributions are as follows: • An Integer Linear Programming solution to find an optimal broadcast schedule for a CR network. • Two polynomial-time heuristic solutions to find a time-efficient broadcast schedule for a CR network. • Simulation study to compare heuristic results with the optimal schedule length. The rest of the paper is organized as follows. Section 2 presents the system model used in our solution. Section 3 presents the ILP formulation and section 4 presents two heuristics. Simulation and results are presented and discussed in section 5. Section 6 concludes the paper.
2 System Model 2.1 Node We consider a CR network consisting of N nodes. Each node is assigned a unique identifier from the range 1 .. N. Each node knows its own ID and is equipped with one wireless transceiver (transmitter and receiver). Thus each node is capable of either transmitting or receiving (but not both) at any given moment. 2.2 Medium The CR network has M channels available for potential use. This is termed as the global channels set (Cglobal = {1, 2, ..., M}). Each node knows the global channel set. A node can operate on a subset of this global channel set depending on the channel availability at that node. We assume that each node knows its channel availability set;
Minimal Time Broadcasting in Cognitive Radio Networks
367
finding and maintaining the channel availability set is outside the scope of this work. We also assume channel availability set to be relatively stable with respect to the algorithm execution time. 2.3 Network Operation We assume a multi-hop wireless network formed by CR nodes that operates in a time slotted manner [9]. Broadcast schedule is calculated by the broadcast initiating node (source node). Source node knows about the network topology and channel availability information at each node. Information about the schedule can be sent to other nodes using either a common signaling channel or other signaling mechanisms. Note that as long as the physical topology and the channel availability sets do not change, the same broadcast schedule can be used. We represent the CR network as an undirected graph. Vertices represent nodes. Edges represent connectivity between nodes. There is an edge between two nodes, if there is at least one common channel between two nodes and if they are within the communication range of each other. All edges are bidirectional. (If A can send a message to B on channel x, B can also send a message to A on channel x.)
3 ILP Formulation In this section, we present an Integer Linear Programming (ILP) formulation to determine the optimal (minimum) broadcast schedule length for CR networks. Consider an undirected graph G = (V, E ), where V = (1, .. , N) is a set of nodes and E is a set of edges. Each node can operate on a subset of Cglobal channels. Available channel set at node i is denoted by Ci, where Ci ⊆ Cglobal. L is the schedule length in terms of number of timeslots. Initially node 1 has the message. Our aim is to determine whether message can be propagated to all the nodes in the network using L timeslots. First we define following variables: For i ∈ V , k ∈ Ci and t ∈ L define yi , k , t as follows: ⎧1 If node i transmits on channel k during timeslot t ⎩0 Otherwise
yi,k ,t = ⎨
For i ∈ V , k ∈ Ci and t ∈ L define
zi ,k ,t
zi , k , t as follows:
⎧1 If node i receives on channel k during timeslot t ⎩0 Otherwise
=⎨
The necessary constraints for the ILP are as follows: 1. Since a node has only one transceiver, it cannot simultaneously transmit and receive during a single time slot. Also a node cannot transmit or receive on more than one channel during a timeslot. This is ensured by the following constraint:
∑ ( yi ,k ,t + zi ,k ,t ) ≤ 1
k∈Ci
∀i and ∀t
368
C.J. Liyana Arachchige et al.
2. A node cannot transmit the message before receiving it, except for the first node.
yi,k ,t ≤ ∑ ∑ zi, k ',t ' k '∈ Ci t '< t
∀i ≠ 1, ∀k ∈ Ci and ∀t
3. A node cannot receive a message unless a neighbor transmits the message.
zi ,k ,t ≤
∑
j:( i , j ) ∈E , i ≠ j
y j , k ,t
∀i , ∀k ∈ Ci I C j and ∀t
4. A node cannot receive correctly if two or more neighbors transmit during the same timeslot on the same channel. This constraint deals with the collisions.
y j ,k ,t + y j ',k ,t ≤ 2 − zi,k ,t
∀ ( i , j ) ∈ E and (i , j ) ∈ E , i ≠ j ≠ j , '
'
∀k ∈ Ci I C j I C j ' and ∀t
5. All nodes except the source node should receive the message.
∑ ∑ zi,k ,t ≥ 1
∀i ≠ 1
k ∈Ci t
This formulation does not directly provide the optimal broadcast schedule length. Instead, for a given schedule length L, it gives the feasibility of the solution. If a solution is feasible, we can get the schedule by observing which of the yi,k,t and zi,k,t values are equal to 1. For any given graph, we cannot produce a broadcast schedule shorter than the minimum hop distance from the source node to the farthest node. Thus the graph radius of the source node acts as a lower bound on the broadcast schedule length. To find the optimal schedule length, we can start from this lower bound and keep incrementing it until ILP gives a feasible solution.
4 Heuristic Algorithm Even though ILP can be used to get the optimal broadcast schedule length, it takes a substantial amount of time to terminate. Therefore we present two centralized heuristics to find a time-efficient broadcast schedule. 4.1 Heuristic 1 The idea behind heuristic 1 is to give priority to nodes that have paths to farthest nodes (from the source node) in the network. Whenever we have to choose between several nodes, we give priority to nodes that have the shortest paths to farthest nodes. The heuristic consists of two phases. In the first phase, we assign a level number to each node based on its distance from the source node and then select transmitting instances (transmitting node, receiving nodes and channels) to serve each level. During the second phase we assign these transmitting instances to the time slots.
Minimal Time Broadcasting in Cognitive Radio Networks
369
When assigning levels, the source node gets level 0, while the farthest node gets the highest level. When selecting transmitting instances, we start from the highest level (farthest from source node) and assign transmitting nodes to serve all the nodes in each level using nodes from the immediate lower level (e.g. level 4 is served using nodes from level 3). Transmitting (Tx) node selection is done as follows. Consider the nodes at levels k and k-1. Our aim is to find a smallest set of nodes from level k-1 to cover all the nodes in the level k. This is equivalent to the set cover problem which is known to be NP-hard [10]. We use the following simple greedy heuristic to solve this problem: 1.
2.
Find a node and a channel from level k-1, which covers the maximum number of nodes at level k. If more than one channel can be used by Tx node to cover the same receiving (Rx) node set, we keep all such channels as possible options when we move to the next phase. Remove the covered nodes from the level k. If no nodes are remaining in level k, selection is complete for the level. Else go to step 1.
To prioritize the nodes that have paths to farthest nodes, we use a rank. Initially, each node has rank 0. Each time we select a Tx instance, rank of the Tx node is updated based on the ranks of nodes it covers (Tx node rank = maximum rank of Rx nodes + 1). For example if a selected Tx node covers a set of nodes with ranks 0,3 and 1, then the Tx node gets 4 as the rank. At the end of the first phase we have a set of Tx instances consisting of Tx node, its rank, the corresponding Rx nodes and possible channels for transmission. Essentially this set forms the broadcasting tree. Also note that a node can appear in more than one Tx instance and can have different rank values. When a node has multiple rank values, maximum rank will be used when computing the Tx node rank. During the second phase, we assign Tx instances to the time slots. For the first time slot, we start with the source node. If we have more than one Tx instance with source node as the Tx node, we select the Tx instance with the highest rank. From the second time slot onwards, we select Tx instances based on the following criteria: 1. From the Tx instances set which already has the message, select a Tx instance with the highest rank. Schedule it to transmit on current timeslot. In case of a tie, we select the Tx instance which has the most number of Rx nodes. If number of Rx nodes is also equal, select one randomly. 2. Try to schedule another Tx instances to the same timeslot. From the Tx instances set which already has the message and not already scheduled to transmit during current time slot, pick a Tx instance with the highest rank. 3. Check whether the selected Tx instance causes collisions with already scheduled transmissions. If it does not cause collisions, schedule it to transmit on current time slot and go back to step 2. If it causes collision go to step 4. 4. Check next highest rank Tx instance which already has the message and not already scheduled to transmit during current time slot. If such Tx instance can be found go to step 3. If such Tx instance cannot be found, scheduling for this current slot is done. Move to the next time slot and start from step 1. Once all nodes are scheduled to receive the message, algorithm will be done.
370
C.J. Liyana Arachchige et al.
Checking for collisions is done by checking the following two conditions: 1. Candidate Tx node’s transmission on selected channel interferes with already scheduled Rx nodes. 2. Candidate Rx node’s receptions on selected channel are interfered by the already scheduled Tx nodes. 4.2 Heuristic 2 The idea behind the second heuristic is to select a set of Tx nodes and channels for each time slot such that the number of Rx nodes served is maximum. While selecting Tx nodes and channels, we make sure collisions are avoided. At each time slot, we consider the set of nodes that have the message (covered), and the set of nodes that do not have the message (uncovered). Then, in each timeslot, we try to cover the maximum number of nodes from the uncovered set using the covered node set, without causing collisions. This is again equivalent to the set cover problem which is NP-hard [10]. We use the following greedy approach to solve the problem. Consider following sets of nodes (i = current timeslot): Ai = {Nodes who has the message at end of timeslot i-1} A1 = S = {Source node} Bi = {Single hop neighbors of nodes in Ai, who do not have the message at the end of timeslot i-1} txSeti = {Nodes supposed to transmit in timeslot i} rxSeti = {Nodes supposed to receive in timeslot i} Scheduling is done as follows: 1. Start from first time slot (i=1). 2. Select a node and a channel (node should be in the set Ai) which covers the maximum number of nodes from Bi (break the ties randomly). Add selected Tx node to txSeti and corresponding Rx nodes to rxSeti. 3. Find another node and a channel (node should be in the set {Ai - txSeti}) which covers maximum number of nodes from the set {Bi - rxSeti} and does not cause collisions. Collisions are identified as described in the heuristic 1. a. If such a node can be found, add it to txSeti and its Rx nodes to the rxSeti. Then repeat step 3. b. If no such nodes can be found, scheduling for this timeslot is done. If all the nodes are covered scheduling is done. Otherwise move to next time slot (i+1), update the sets Ai and Bi, and go to step 2. 4.3 Time Complexity Analysis – Heuristic 1 Heuristic 1 consists of two phases; Phase 1 consists of assigning levels based on the distance from the source node and selecting Tx instances. Level assignment is similar to finding shortest path to all nodes from the source and can be done in O(N2) time. Selection of Tx instances for each level is done using a greedy heuristic. Here we assume each node in a Tx level keeps a list of nodes that it can cover from a Rx level for each available channel. This lists can be constructed in O(MN2) time. Selecting the
Minimal Time Broadcasting in Cognitive Radio Networks
371
node and the channel that cover the maximum number of nodes from Rx level takes O(MN) time. (Recall that M is the number of available channels.) Once a Tx instance is selected, we need to update the Rx node lists of each Tx node based on the already scheduled Tx instance information. Updating one list takes O(N log N) time. (We have two lists of size O(N) and we need to check whether a node in already selected Tx list present in the Rx node’s list. This can be done in O(N log N ) time.) Since we have O(N) Tx instances and O(M) channels, updating all the lists takes O( MN2 log N) time. Therefore the time required to find one Tx instance and update the Tx node details is O(MN + MN2 log N) = O (MN2 log N). For each level, we may have to select up to O(N) nodes. Therefore to complete each level we need O(MN3 log N) time. If the network radius is R, we need O(RMN3 log N) time to complete the Tx instance selection procedure over all levels. Therefore phase one takes O(RMN3 log N) time. In the second phase, we assign selected Tx instances to the time slots. Selection of the Tx instance with highest rank takes O(N) time. Then, we try to schedule another Tx instance for same time slot. Here, we have to check for collisions. The first step in checking for collisions is to check for interference to already scheduled Rx nodes. For each scheduled Rx node, we need to check whether it is a neighbor of candidate Tx node and whether they use the same channel. This can be done in O(N log N) time. The second step in checking for collisions is to check whether candidate Rx node receptions are interfered by the already selected Tx node. Here we check whether any of the Tx nodes, which are scheduled to use the same channel as candidate Rx nodes, are present in the neighbor lists of the candidate Rx nodes. This can take O(N2 log N) time. Hence, the total time for collision checking is O(N2 log N + N log N ) = O(N2 log N) . The total time for scheduling one node is O(N + N2 log N ) = O(N2 log N).We might have to schedule up to O(N) nodes in one timeslot. Therefore, the time required to schedule one time slot is O(N3 log N). If the schedule length is L time slots, the time complexity of phase two is O(LN3 log N). Therefore the time complexity of heuristic 1 is O(RMN3 log N + LN3 log N). 4.4 Time Complexity Analysis – Heuristic 2 Time complexity of the second heuristic follows from the first one. In the second heuristic, we try to pack as many Tx nodes as possible to each time slot while avoiding collisions. This is similar to the greedy method used in heuristic 1 to select Tx instances. Therefore we can use same time complexity here. Also the time complexity of collision checking procedure is the same. Therefore the time complexity of scheduling one timeslot is O(MN3 log N + N2 log N ) = O(MN3 log N). If the schedule length is L time slots, then we need O(LMN3 log N) time. So the total time complexity of the heuristic 2 is O(LMN3 log N).
5 Simulation and Results 5.1 Simulation Setup To evaluate the performance of our heuristics, we compared the optimal schedule length with the lengths of the schedule lengths generated by our heuristics. The
372
C.J. Liyana Arachchige et al.
optimal schedule length was obtained using the ILP formulation presented in the section 3. We implemented the ILP formulation using GNU Linear Programming Kit (GLPK) [11]. Schedule lengths for the two heuristics were obtained using computer simulations. We ran the simulations on a wide range of network topologies and scenarios. Network topologies were generated by creating random graphs as follows: first we generated a predefined number of random points (x,y coordinates) within a 1000 x 1000 square. Each point represents a node in the network. To avoid nodes being too close to each other (clustering problem) we rejected new points that are within a threshold distance to existing nodes. Edges were then added to the graph. This was done using a circle of radius R centered on each point. Whenever another point is found within the circle, an edge is added between them. Since we require a connected graph, whenever no points are found within the circle, we increased the radius by r percentage until we find points within the circle. Once edges are added for each node, there is still a possibility that the graph is disconnected. In such cases, we added an edge between nearest two points of such disconnected components. This way we make sure final graph is connected. We used JGraphT [12] graph library for the implementation of graphs. 9.0 Avg. number of common cha nnels per edge
Avg. number of Cha nnels
8.0
Avg. number of channels per node
7.0 6.0 5.0 4.0 3.0 2.0 1.0 0
2
4 6 8 Degree of the Graph
10
12
Fig. 2. Channel availability among the nodes and average number of common channels per edge. (Total number of available channels = 15)
Once we have the network topology, the next question was how to assign channels for each node. We use following method for assigning channels: First, we assigned k (k = 1..M) channels to each edge randomly. Then based on assigned channels for each edge, the channel set for each node was determined by taking union of the channels assigned to each edge. Note that even though we assign k channels for each edge, the final number of channels available for each edge may be higher than k.
Minimal Time Broadcasting in Cognitive Radio Networks
373
From the simulation results, we have observed that when there are multiple common channels available for edges, the schedule lengths given by both heuristics were either optimal or very close to the optimal. To simulate worst case scenarios, we decided to keep the number of common channels per edge to a small value. Therefore we selected k = 1. Average statistics for the channel distribution among the nodes and edges are given in figure 2. For the final simulation we have used 100 nodes in a 1000 X 1000 area. The total number of available channels was set to 15. By changing the value of R (within the range 30 to 200) we obtained graphs with degree ranging from 2 to 10. 5.2 Results Simulation results (broadcast schedule length) were collected by running both heuristics and the ILP on randomly generated network topologies. We varied the graph topology from sparse to dense by changing the average graph degree from 2 to 10. For each average degree value, we ran both heuristics and ILP on 10 randomly generated graphs. The reason that we limit only to 10 graphs is the time taken to run the ILP. Table 1 summarizes the main results. Figure 3 presents the comparison between optimal and heuristic schedule lengths. Table 1. Results for the simulation and ILP (Note that all the results shown are averaged over 10 graphs) Average Graph Degree 2 3 4 5 6 7 8 9 10
Average Graph Radius 22.7 21.4 19.7 15 12.4 10.8 10 9.1 8.5
Average Schedule Length Optimal Heuristic Heuristic (ILP) 1 2 22.8 22.8 26.2 22.1 22.3 24.8 19.9 20.2 21.7 15.2 15.5 16 12.4 13.1 13.1 11 11.3 11.5 10.3 10.5 10.6 9.1 9.2 9.3 8.5 9 8.6
5.3 Discussion From the results, we can clearly see that the heuristic 1 performed very well compared to the optimal results. The average difference is only 2.41%. Actually we have not seen it deviating more than 2 timeslots from the optimal schedule length. The maximum difference between the optimal length and the length of the heuristic 1 is 5.88% (for degree of 10). One possible explanation for the performance of the heuristic 1 to be very close to the optimal is as follows: Assume a scenario where a node has to deliver a message to n neighboring nodes. Even if the node could not send the message to all n nodes during the timeslot t, there is a high possibility that it can deliver the message in the subsequent time slots (e.g.
374
C.J. Liyana Arachchige et al.
t+1, t+2, etc.) while the recipient nodes of the time slot t can also transmit on the timeslots t+1, t+2, etc. This is because of the availability of multiple channels which enable simultaneous communication among neighboring nodes without causing collisions. (Note that in a single channel network, this is not possible.) Therefore in CR networks, there is a high possibility that a node can transmit a message to all of its neighbors fairly quickly (within successive timeslots). Now the question is to select the order of transmissions. Heuristic 1 takes care of this by assigning priorities to the nodes that have paths to the farthest nodes. Therefore the combination of availability of multiple channels and giving priority to nodes that have paths to the farthest nodes works well to produce faster schedules.
Avg. Schedule Lenght (Time Slots)
28 Optima l Heuristic 1 Heuristic 2
24
20
16
12
8 2
3
4
5 6 7 8 Degree of the Graph
9
10
Fig. 3. Comparison between the optimal schedule length and heuristic schedule lengths (Number of nodes = 100, Number of available channels = 15)
Heuristic 2 also performed very well when the graph was not sparse. But for sparse graphs performance of the heuristic 2 was not as good as heuristic 1. When the graph is sparse the number of available paths from any given node to other nodes is limited. Since heuristic 2 does not give priority to nodes that have paths to farthest nodes, this performance difference is understandable. But when the graph is dense, there are many paths of equal length, from any given node to all other nodes, thus giving priority to nodes that have paths to farthest nodes does not make a big difference. An important observation that we have made is the relationship between the graph radius (hop distance form source node to the farthest node) and the optimal schedule length; these two values were very close. (See table 1.) The average difference between two values was only 1.21%. In terms of time slots, difference was always either 1 or 0. Therefore, we believe that the radius of the network can be taken as a good approximation for the optimal broadcast schedule length in CR networks. This also leads to the conclusion that the characteristics of the CR networks (i.e. availability of multiple channels and non-uniform channel availability) do not hinder achieving a short broadcast schedule.
Minimal Time Broadcasting in Cognitive Radio Networks
375
6 Conclusion In this paper, we addressed the time-efficient broadcast scheduling problem in Cognitive Radio Networks. We have presented an Integer Linear Programming (ILP) formulation for finding an optimal broadcast schedule for a CR network. We have also presented two centralized heuristics to find minimal-length broadcast schedules. In the first heuristic, we start from the farthest nodes (from source node) in the network and select a set of transmitting nodes/channels to cover the entire network. Then scheduling is done by giving priority to the node/channel pairs that cover nodes who have paths to the farthest nodes. In the second heuristic, we start from the broadcast initializing node and pick the best node/channel pair at each stage, considering the number of nodes covered by each potential transmitting node in each available channel as the selection criteria. In both heuristics, we try to pack as many transmitting node/channel pairs as possible to each time slot, while avoiding collisions. We have implemented the ILP to obtain the optimal schedule. Also we implemented our heuristics and performed simulations on different graph topologies. Comparison of optimal results with simulation results show that both heuristics produce schedule lengths that are very close to the optimal schedule lengths.
References 1. Chlamtec, I., Kutten, S.: On broadcasting in radio networks - problem analysis and protocol design. IEEE Transactions on Communications 33, 1240–1246 (1985) 2. Peleg, D.: Time-Efficient Broadcasting in Radio Networks: A Review. In: Janowski, T., Mohanty, H. (eds.) ICDCIT 2007. LNCS, vol. 4882, pp. 1–18. Springer, Heidelberg (2007) 3. Chlamtac, I., Weinstein, O.: The Wave Expansion Approach to Broadcasting in Multihop Radio Networks. IEEE Transaction on Communications 30(3) (1991) 4. Kowalski, D., Pelc, A.: Optimal deterministic broadcasting in known topology radio networks. Distributed Computing 19(3), 185–195 (2007) 5. Gasieniec, L., Peleg, D., Xin, Q.: Faster communication in known topology radio networks. Distributed Computing 19, 289–300 (2007) 6. Qadir, J., Chou, C.T., Misra, A.: Minimum Latency Broadcasting in Multi-Radio MultiChannel Multi-Rate Wireless Mesh Network. In: Proc. of IEEE Sensor and Ad Hoc Communications and Networks (September 2006) 7. Li, L., Qin, B., Zhang, C., Li, H.: Efficient Broadcasting in Multi-radio Multi-channel and Multi-hop Wireless Networks Based on Self-pruning. In: Perrott, R., Chapman, B.M., Subhlok, J., de Mello, R.F., Yang, L.T. (eds.) HPCC 2007. LNCS, vol. 4782, pp. 484–495. Springer, Heidelberg (2007) 8. Kondareddy, Y.R., Agrawal, P.: Selective Broadcasting in Multi-Hop Cognitive Radio Networks. In: IEEE Sarnoff Symposium (April 2008) 9. Krishnamurthy, S., Thoppian, M., Kuppa, S., Chandrasekaran, R., Mittal, N., Venkatesan, S., Prakash, R.: Time-efficient Distributed Layer-2 Auto-configuration for Cognitive Radio Networks. Computer Networks 52(4) (2008) 10. Corman, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to Algorithms, 2nd edn. 11. GNU Linear Programming Kit (GLPK), http://www.gnu.org/software/glpk/ 12. JGraphT, http://www.jgrapht.org
Traffic Congestion Estimation in VANETs and Its Application to Information Dissemination Rayman Preet Singh and Arobinda Gupta Department of Computer Science & Engineering Indian Institute of Technology Kharagpur, WB - 721302, India
Abstract. Traffic congestion estimation in vehicular ad hoc networks can have many interesting applications. In this paper, we propose two measures of traffic congestion around a vehicle suited for different applications. A scheme for a vehicle to estimate congestion around it is proposed and evaluated through simulations. We also show an application of the estimated congestion for information dissemination to achieve high coverage while sending much less number of redundant messages compared to a flooding based scheme.
1
Introduction
A vehicular ad hoc network (VANET) is an ad hoc wireless communication system setup between multiple vehicles (vehicle-to-vehicle or V2V) or between a vehicle and some roadside infrastructures (V2I). Many applications have been proposed on VANETs for different purposes such as safety, infotainment, financial, navigational aid etc. [3]. Traffic congestion has been studied extensively in traffic flow theory for various reasons such as road capacity planning, estimating average commute times etc. Congestion information can be useful for many VANET applications also, such as for route planning or traffic advisories. Typically, congestion information is collected as the number of vehicles passing a point per unit time by some roadside equipment, and transmitted to other places for broadcasting to vehicles. However, in the absence of such roadside infrastructure, congestion information is not available. Moreover, the congestion information is usually available only at a single macroscopic level for all vehicles, and is not customized for the requirements of each vehicle. Many VANET scenarios discussed in the literature assume the presence of periodic beacon messages from vehicles broadcasting kinematic information such as position, velocity, heading etc. In the presence of such beacons, a vehicle may be able to estimate traffic congestion in its immediate neighborhood by counting the number of beacons received from different vehicles. However, all beacons may not be of equal relevance to a vehicle for congestion estimation depending on parameters such as the distance of the vehicle, relative velocity etc. Moreover, depending on the application, the notion of congestion may also be different. In this paper, we first propose two measures of local traffic congestion around a vehicle, Instantaneous Congestion and Stabilized Local Congestion, arguing M.K. Aguilera et al. (Eds.): ICDCN 2011, LNCS 6522, pp. 376–381, 2011. c Springer-Verlag Berlin Heidelberg 2011
Traffic Congestion Estimation in VANETs and Its Application
377
that each has its own use in certain applications. A method by which a vehicle can estimate both types of congestion around it based on beacon messages is then proposed and evaluated. The method uses a novel scheme for computing the relevance of beacons for computing each type of congestion value. We then show that the two notions of congestion proposed can be effectively used in an information dissemination application to achieve high coverage with much less number of redundant messages compared to a flooding based scheme. Congestion has been studied earlier in traffic flow theory [1,2]. However, there has been hardly any work on congestion estimation in VANET without using any infrastructure. A traffic congestion detection and estimation application is described by Ghazy et al. [7] and by Padron et al. [8]. However these approaches present congestion detection at a macroscopic level, using extensive collaborative processes among the vehicles.
2
Measures of Congestion
We identify two types of congestion around a vehicle that are of primary relevance to VANET applications. – Instantaneous Congestion: This gives the instantaneous picture of the traffic in the vicinity of a vehicle u at any instant, measured as the set of vehicles in the communication range of u at that instant. – Stabilized Local Congestion: This is measured as the set of neighboring vehicles of a vehicle u, which have been stable members of the instantaneous congestion u for a certain amount of time. The two congestion measures are relevant in different applications. Short-term applications, such as information broadcasting may employ instantaneous congestion. For example, in a probabilistic flooding algorithm where a node floods a message with a certain probability, the probability can be adjusted based on the congestion around the node; a larger congestion means the probability can be low. Likewise, long-term applications such as collaborative misbehavior detection require nodes to collaborate, and thus choose vehicles in its stable local congestion, as they are more likely to be within communication range for the rest of the collaboration.
3
Estimating Congestion Using Beacon Relevance
We assume that each vehicle sends a beacon containing its position and velocity periodically. A receiving vehicle first assigns a relevance value to each beaon, and then estimates its congestion based on the beacons received. A beacon received at a vehicle may convey different types of information with respect to congestion around the vehicle: beacons from nearby vehicles convey more information with regards to vehicular congestion, a continuous stream of beacons from the same sender signifies close proximity of the sender, beacons
378
R.P. Singh and A. Gupta
from a vehicle with a lower relative velocity indicates it is expected to stay close for a longer duration of time. In order to quantitatively classify the beacons received at a vehicle, a weight is assigned to each received beacon representing its relevance. Formally, the relevance of a beacon BM (v) sent by a node v and received at a node u, denoted by Rel(BM (v)), is defined as follows: Rel(BM (v)) =
pv vrel × dv
(1)
where dv is distance of v from u (obtained from the position sent in the beacon and the receiving vehicle’s own position), pv is the transmission power level at which the beacon was broadcast (estimated from the received signal strength and dv , assumed tobe the same for all vehicles), and vrel is the relative velocity between u and v (estimated from the velocity of the sender sent in the beacon and the current velocity of the receiver). Using the beacon relevance, we now propose a simple scheme to measure the two types of congestion as specified in Section 2. A vehicle u, on receiving a beacon, computes its relevance. If the beacon relevance is above a threshold, it stores the sender node’s identity, the time the beacon was received at, and its relevance. Instantaneous and stabilized local congestions are computed periodically with a certain update period. Instantaneous congestion estimate IC(u) is computed as the set of those vehicles which have sent beacons bearing relevance at least RelT H , and have sent a beacon to u at least once in the last tT H time units. Stabilized local congestion estimate SLC(u), comprises of vehicles which have been a part of IC(u) for at least the last K update steps performed at node u, and measures the set of stable neighborhood members of u.
4
Simulation Results
The congestion estimation algorithm is simulated using an integrated trafficcum-network simulator designed by integrating the traffic simulator VanetMobiSim [6] with the network simulator ns-2 [5] (essentially a network simulation using ns-2 where the mobility model is supplied online by VanetMobiSim for realistic VANET simulation). The number of vehicles (n) simulated is 100. The other algorithm parameters used are as follows. time between successive beacons from a node = 5s, Update Period Up (time between successive update of congestion at a node) = 15s, Threshold Time tT H = 20s, Window Size K = 3 and Threshold Relevance RelT H = 0.0022544 (calculated from Equation (1) assuming pv = 0.2818dBm (transmission range = 250m), vrel = 1m/sec, and dv = 125m (half the transmission range). Three traffic scenarios are considered corresponding from the traffic theory results [2], namely Scenario 1: Wide Moving Jam, Scenario 2: Synchronized Traffic Flow, and Scenario 3: Free Flow Traffic. These scenarios are simulated by varying maximum speed and number of lanes on different road segments across the scenario. The metrics are averaged over all vehicles present in the scenario. Two performance measures are studied for both SLC and IC, namely Precision and Recall of the reported congestion, defined as follows. For any node u,
Traffic Congestion Estimation in VANETs and Its Application Precision of SLC v/s Update period
70 60 50 0 1 2 3 4 5 6 7 8 9 10 11 Update Period(in multiples of Beacon period)
80 70 60 50 40 30 Number of nodes(n)=100 Scenario 1 20 Scenario 2 Scenario 3 10 0 1 2 3 4 5 6 7 8 9 10 11 Update Period(in multiples of Beacon period)
Precision(in percent) of IC
100
90
Number of nodes(n)=100 Scenario 1 90 Scenario 2 Scenario 3 80
Recall(in percent) of SLC
Precision(in percent) of SLC
Precision of IC v/s Update period
Recall of SLC v/s Update period
100
379
Number of nodes(n)=100 Scenario 1 90 Scenario 2 Scenario 3 80 70 60 50 0 1 2 3 4 5 6 7 8 9 10 11 Update Period(in multiples of Beacon period)
Fig. 1. Precision and Recall of SLC and IC vs. the update period Precision of SLC v/s Window size
Recall of SLC v/s Window Size 100 90
90
Recall(in per cent)
Precision(in percent)
100
80 70 60
Number of nodes(n)=100 Scenario 1 Scenario 2 Scenario 3
50 40 0
1
2 3 4 Window size(K)
80 70 60 50 Number of nodes(n)=100 Scenario 1 Scenario 2 Scenario 3
40 30 20
5
6
0
1
2 3 4 Window size(K)
5
6
Fig. 2. Precision and Recall of SLC vs. the window size (K)
let (u) denote the set of all nodes within its transmission range which would produce a beacon with relevance greater than RelT H with respect to u if they were to send a beacon to u at the instant at which the set is computed. Similarly, let ϕ(u) be the set of nodes in the reported congestion (either SLC or IC). Then the Precision is the percentage of nodes in ϕ(u) that are also members of (u). Similarly, Recall is the percentage of nodes in (u) that are also members of ϕ(u). Figure 1 shows the precision and recall of the Stabilized Local Congestion (SLC) and Instantaneous Congestion(IC) versus the update period (Up ) for the three traffic scenarios. For Scenario 1, the precision increases as Up is increased. This is because as more and more time is allocated for the congestion to converge, the accuracy is expected to increase. Similarly, the recall decreases with increasing Up , because “newer” nodes entering the congestion would be un-reported as they spent less time in the neighbourhood. For both Scenarios 1 and 2, choosing Up properly is important to trade off between precision and recall. In case of Scenario 3, which is a low density traffic, a reduced gradient of increase and decrease in the precision and recall values respectively, is observed. Figure 2 shows the precision and recall of SLC versus the window size (K) for the three traffic scenarios. For Scenario 1, the precision increases as K is increased. This is because as more instances of IC are considered for estimating congestion, only the “true” neighbor nodes form a part of the reported congestion. However, since Scenario 1 traffic is clustered in nature, there exists an amount of time for which any node stays in the vicinity and hence forms a part of the congestion, and a value of K corresponding to this optimal time value yields the maximum precision. For Scenario 2, owing to no distinct clusters of traffic, no such tentative value of “time” that a node spends in the vicinity of another exists, and hence the precision increases with increased k. Likewise, the recall
380
R.P. Singh and A. Gupta
value decreases because a greater K leads to the exclusion of “newer” nodes from the reported congestion. For Scenario 3, a reduced gradient of increase and decrease in the precision and recall values is observed, which can be attributed to the same reason as for the variation with the update period.
5
Congestion-Based Information Dissemination
The estimate of stabilized local congestion can be used in many applications, such as beacon frequency control and transmission power control, as suggested in [4]. In this section, we show that the two types of congestion estimated can be used effectively for information dissemination. In a local danger warning (LDW) application, vehicles exchange information about current road conditions and dangerous situations to warn their drivers about upcoming dangers. Any vehicle can initiate dissemination of the relevant information and the aim is to achieve maximum node coverage with least possible redundancy. Using a flooding based approach causes too many messages to be sent. However, in a high congestion environment, broadcast by a few nodes suffices to provide acceptable levels of coverage. In a p-persistent broadcast, each node broadcasts with a fixed probability p, thereby restricting the number of broadcasts. However, the probability p is fixed and hence, may cause either too many messages to be sent (high p in a high congestion scenario) or less node coverage (low p in a low congestion scenario). In our approach we introduce a measure (bcast) for each node, which should be greater than a given threshold (bcastT H ) for the node to propagate information received by the broadcast. The bcast value of the sender is propagated along with the information message, and can be used to better estimate the extent to which the information has already been propagated. If a node u receives a message from a node v, the bcast value of u is set as bcastu = q |SLC| + (1 − q)bcastv , where q is a constant. |IC| Node u re-broadcasts, if and only if, bcastu ≥ bcastT H . The initiator broadcasts the message with a bcast value of 1.0. q provides a tradeoff between controlled flooding using the local congestion information, and the traditional flooding strategy. A q value of 0 corresponds to full flooding, whereas a value of 1 corresponds to controlled flooding using only the local congestion information. The information dissemination scheme is simulated with the same system model and setup as earlier. We analyze a sample highway scenario (4 km stretch with 2 lanes per direction) with 200 vehicles. All results are averaged over a randomly selected set of 20 nodes which initiated broadcast at independent instances. Figures 3 and 4 show the observed variations in the number of nodes reached and number of redundant messages with q and bcastT H respectively. Shown in red and green are the results for flooding and the proposed congestion governed broadcast respectively. We observe that the number of redundant messages decreases as q is increased from 0 to 1 as expected. However, the number of nodes reached remained very high and almost independent of q. This shows that
Traffic Congestion Estimation in VANETs and Its Application Number of nodes reached v/s q
381
Number of redundant messages v/s q
50
Number of redundant messages
Number of nodes reached
550
40
500 450
30
400
20
350
Flooding Deterministic flooding with Broadcast threshold=0.5
10
Flooding Deterministic flooding with Broadcast threshold=0.5
300
0
250
0
0.2
0.4
0.6
0.8
1
0
0.2
0.4
q
0.6
0.8
1
q
Fig. 3. Variation of coverage and no. of redundant messages with q (bcastT H = 0.5) Number of nodes reached v/s Broadcast threshold
Number of redundant messages v/s Broadcast threshold 600
Number of redundant messages
Number of nodes reached
50
500
40
400
30
300
20
200
Flooding Deterministic flooding with q=0.5
10
100
0 0
0.2
0.4 0.6 Broadcast threshold
0.8
1
0
Flooding Deterministic flooding with q=0.5 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Broadcast threshold
Fig. 4. Variation of coverage and no. of redundant messages with bcastT H (q = 0.5)
even a high q that reduces redundant messages greatly can be used to achieve the same degree of node coverage as flooding. Both the number of nodes reached and the number of redundant messages decrease as bcastT H is increased. This is expected, as with higher bcastT H , less and less nodes broadcast, and in the extreme case of bcastT H = 1, no nodes forward the message.
References 1. Haberman, R.: Mathematical models: Mechanical vibrations,population dynamics and traffic flow: An introduction to applied mathematics. Prentice-Hall, Englewood Cliffs (1977) 2. Kerner, B.S.: The Physics of Traffic: Empirical Freeway Pattern Features, Engineering Applications, and Theory. Springer, Heidelberg (2004) 3. Bai, F., ElBatt, T., Holland, G., Krishnan, H., Sadekar, V.: Towards Characterizing and classifying communication-based Automotive Applications from a Wireless Networking Perspective. In: IEEE Workshop on Automotive Networking and Applications, AUTONET (2006) 4. Torrent-Moreno, M., Santi, P., Hartenstein, H.: Fair Sharing of Bandwidth in VANET. In: 2nd ACM International Workshop on Vehicular Ad Hoc Networks (VANET), Cologne, Germany (September 2005) 5. Network Simulator ns-2, http://www.isi.edu/nsnam/ns/ 6. VanetMobisim, http://vanet.eurecom.fr/ 7. Ali, G., Tarik, O.: Design and simulation of an artificially intelligent VANET for solving traffic congestion. Journal of Basic and Applied Sciences 1(2) (September 2009) 8. Padron, M.F.: Traffic congestion detection using VANET. Master’s Thesis, Florida Atlantic University, USA (2009)
A Tiered Addressing Scheme Based on a Floating Cloud Internetworking Model Yoshihiro Nozaki1, Hasan Tuncer1, and Nirmala Shenoy2 2
1 College of Computing and Information Science, Networking Security and Systems Administration Dept, Rochester Institute of Technology, 1 Lomb Dr, Rochester, NY USA 14623 {yxn4279,hxt3948,nxsvks}@rit.edu
Abstract. Scalability in inter-domain routing is becoming stressful as routing table sizes grow at very high rates. In this paper, we present a tiered addressing scheme to be implemented over a Floating Cloud Tiered internetworking model which uses tier based aggregation to address routing scalability. Analysis of the HD-ratio of the addressing scheme and its possible implementation on the AT&T network is presented. Keywords: Internetworking model, Tiered architecture, Scalable inter-domain routing, Tiered addressing scheme.
1 Introduction The current Internet architecture has exhibited a remarkable capability for evolution and growth. However, this architecture was based on design decisions made in the 1970’s resulting in the TCP/IP protocol suite which was intended to handle relatively fewer networks as compared to the huge networked system it is currently supporting. One attempt to accommodate for the increase in the number of computing devices that connect to the Internet resulted in the decisions to develop IPv6 and provide a transition path from IPv4 to IPv6. This may not solve the routing scalability problem faced by the current Internet routing protocols which has been of increasing concern over the last few decades, as the routing table sizes in the core routers experienced very high growth rates [1]. Over these years as the problem escalated, several research efforts have been directed to address the serious scalability issues. However these research efforts were constrained by the fact that they had to operate in the existing highly meshed internetwork architecture, the Internet Protocol, its logical addresses and its forwarding and routing mechanisms. The research outcomes thus resulted in incremental and point solutions which introduced new vulnerabilities in some cases in the evolving Internet [2]. The underlying premise of our solution is that the logical IP addresses and the address assignment process adopted in the Internet are mainly the reasons for the routing scalability problem. We need a flexible addressing scheme that is amenable to growth and provides better aggregation capabilities than that offered by IPv4 or IPv6. The M.K. Aguilera et al. (Eds.): ICDCN 2011, LNCS 6522, pp. 382–393, 2011. © Springer-Verlag Berlin Heidelberg 2011
A Tiered Addressing Scheme Based on a Floating Cloud Internetworking Model
383
fact that address aggregation is a key solution is vetted by interim solutions such as Classless Inter-Domain Routing and the hierarchical and geographical aggregation recommended in IPv6. Besides basing our solution on the above premise, our project followed the clean slate Future Internet Design (FIND) [3] initiatives by National Science Foundation (NSF) and hence the solution was designed with a greater degree of freedom in terms of consideration given to the current internetwork architecture and the Internet protocol. The Floating Cloud Tiered (FCT) internetwork model which is the outcome of this project is different from any prior work in this area as for the first time the tiered ISP topological structure is being leveraged to distribute the routing load across the routing domains and thus solve the routing scalability problem. This approach uses a tiered addressing scheme and a new and very efficient form of tier based address aggregation is now possible. The rest of the paper is organized as follows. In section 2, we discuss the some background and related work on the Internet scalability issues. In section 3, we introduce the FCT model and the communications and packet forwarding using the tiered addressing scheme. Section 4 describes ‘nesting’ concept in the FCT model. Section 5 provides evaluation of the tiered addressing scheme using the AT&T topology as an example and section 6 concludes this work.
2 Background and Related Work In the Internet, the route discovery process is essential to establish communication links and maintain information flow between devices and networks. The discovery process uses the IP addresses as the location identifier. However, the process becomes difficult because the IP address is a logical address that is allocated dynamically to a node and does not have any relation to the actual location of the node. Further, the route itself is a path through a huge mesh of networks. In the event of failure of a network or device that connects networks, the connectivity information for thousands of networks and networked devices can be impacted which in turn may cause very long network convergence delays. The complex IP address allocation and the highly meshed topology has resulted in the huge routing table sizes leading to routing scalability problems. BGP routing table size at the core routers today have exceeded 304,500 entries [6]. This high load in the core routers is indicative of an imbalance in the ‘routing information handling’, which could adversely impact the advantages of the meshed structure, by making the routers a potential bottleneck. Management of IPv6 address space has been discussed in Internet Assigned Number Authority (IANA) and Regional Internet Registries (RIRs). It has been recommended that IPv6 address allocation should be done in a hierarchical manner to avoid fragmentation of address space and to better aggregate routing information. However, at the same time, the IPv6 policy tries to avoid unnecessary and wasteful allocation [5]. It is difficult to achieve both fragmentation avoidance and wasteful address allocation at the same time because future address requirements from organizations and end sites are unpredictable [5]. We will now look at some of the solutions that were proposed to address the routing scalability problem under the current Internet architecture. Hybrid Links State Protocol (HLP) used the AS structures to provide a solution to excessive route
384
Y. Nozaki, H. Tuncer, and N. Shenoy
churning through route information aggregation within an AS hierarchy [7]. The New Intern-Domain Routing Architecture (NIRA) used a provider-rooted hierarchy and showed improvements in the number of forwarding entries and convergence times [8]. A routing research group at the Internet Engineering Task Force proposed ‘coreedge separation’ to temporarily solve the routing table size problem by ‘address indirection’ or ‘Map-and-Encap’ which keep the de-aggregated IP prefixes out of the global routing table [9]. Routing Architecture for Next Generation Internet uses locator/identifier split ideas [10]. Routing on Flat Labels uses flat routing to separate location and identity for both inter and intra-domain routing [11]. Enhanced Mobility and Multi-homing Supporting Identifier Locator Split Architecture is a hybrid design of ID/locator split and core-edge separation [12]. Among the recent and more revolutionary efforts are the projects funded by NSF in the United States as it initiated the clean slate Future Internet Design (FIND) program [3]. Meanwhile in Europe under the Seventh Framework Program (FP7), the European Future Internet Initiative [14] funded projects focused on several key areas, one of them being routing scalability.
3 The Floating Cloud Tiered Internetwork Model The Internet is comprised of more than 30,000 ASes and ISPs that operate the major flow of Internet communication and the current IP traffic represents in a way their business relationships. In general, ASes have either a customer-provider or a peer-topeer relationship with neighboring ASes. A customer pays its provider for transit and peers provide connectivity between their respective neighbor ASes. Based on the AS relationships, the tiered structure and the hierarchy in the AS topology becomes obvious when looking at the Internet. In the US, there are several tier 1 ISPs, who connect several tier 2 ISPs, as their customers, and the tier 2 ISPs connect the tier 3 ISPs as their customers. Inside of an ISP, there are several Point of Presence (POPs) which form the backbone of that service provider. Each POP has several routers, some of which are backbone routers that are primarily meant to connect to other backbone routers in other POPs. An interesting observation to be made at this point is the tiered structure that is also noticeable inside of an ISP POP. Inside an ISP POP there is a set of backbone routers that can be associated tier 1 within the POP. The BB routers connect to the distribution routers (DR), that can be associate at tier 2. The DRs provide redundancy and load-balancing between backbone and connect access routers (AR) that connect to customer or stub networks. The ARs and the stub network can thus be associated to tier 3. Note that each set of routers identified above is considered as a network cloud. Thus, in the FCT internetwork model, we define a network cloud as a set of routers that have a specific purpose. A cloud can also have several clouds within itself, such as the ISP cloud can have POP clouds. Thus the tiered internetwork model exhibits some very interesting ‘nesting’ and modularity properties. 3.1 Inter-cloud Communications In Fig. 1, we show a simplified version of an ISP topological structure. ISPs A and B are tier 1 ISPs, while ISP C is a tier 2 ISP and ISP E is a tier 3 ISP. In the figure, we
A Tiered Addressing Scheme Based on a Floating Cloud Internetworking Model
385
also show stub AS D. Note that each ISP or AS is presented as a network cloud. The broad arrows are indicative of multiple connections between any two ISPs or between ISPs and the AS. We use the ‘relative position’ among the ISP clouds across the tiers to introduce structured packet forwarding. To achieve this, we associate a ‘tiered cloud address’ (CloudAddr) with each ISP or AS cloud, which will serve as an identifier for the cloud. Clouds can associate or disassociate from a tier via the acquisition or release of one or more CloudAddrs. A CloudAddr is thus a function of its tier, and the other clouds associated with it. For instance, ISP C has two addresses, address 2.1:1 based on its connection to ISP A and address 2.2:1 based on its connection to ISP B. This property allows a cloud to have more than one CloudAddr to be used for multi homing.
Fig. 1. The Concept of Floating Clouds across Tiers
As noted in Fig. 1, the FCT model allows clouds to change their service providers by simply giving up one CloudAddr and acquiring another. Let us illustrate the movement with a further example. Assume AS D is first connected to ISP B with the address of 2.2:2 and then it desired to change its service provider to ISP C. It can then relinquish address 2.2:2 and acquire an address of 3.1:1:2 under ISP C. What changes is however, only the CloudAddr, and not the internal addressing of the cloud which will be explained in detail later. During the change in AS D’s service provider and CloudAddr, ISP B and ISP C will both be informed of the change. However, the movement or address change is not required to be disseminated to all clouds in the network, as only those that are directly related to the moving cloud need to be informed. The TierValue, the first field in the CloudAddr is used to forward the packets across clouds. The decision to forward in a particular direction, up, down or sideways across the tiers, depends on the relative positions of the source (SRC) and destination (DST) clouds in the tier structure and the links between sibling clouds in a tier. To illustrate packet forwarding, we use another simple example from Fig. 1. Assuming the SRC cloud is 3.1:1:1 and DST cloud is 2.2:2. The source compares the two addresses to determine the tier of a common parent (or grandparent) cloud for the SRC and DST. In this case, it will be tier ‘1’ as there are no common address components after the TierValue, in the SRC and DST CloudAddrs. The remaining fields in the DST address (after the common part) are then appended to the TierValue to provide the forwarding address; in this case it will be 1.2:2. All intermediate clouds between 3.1:1:1 and 1.2 will forward the packet upwards, using the tier value until it reaches cloud 1.2. Cloud 1.2 then identifies that the destination is at tier 2 because of the two
386
Y. Nozaki, H. Tuncer, and N. Shenoy
address fields following the tier value. Thus, it replaces the TierValue with 2 and forwards the packet down to the DST cloud. However, if there were a link between ISP C and AS D (as shown by the dashed arrows), the border routers in ISP C could be made aware of the sibling cloud connection and then ISP C could forward the packet directly to the cloud 2.2:2. Packets will be forwarded to the appropriate cloud based on their CloudAddr, which has global visibility. The forwarding and routing within the cloud can adopt either the tiered approach, or any other mechanism such as OSPF and RIP based on IP (which could be useful during transition too). We thus decouple the inter- and intracloud dynamics, such that a change in CloudAddr will not impact the internal structure or addresses within a cloud. This decoupling will allow for easy movement (or floating) of network clouds across tiers.
4 The Nesting Concept We use the AT&T topology to explain nesting with the tiered addresses. Let us assume that ISP A in Fig. 1 represents the AT&T cloud in the US. As part of our initial study, we abstracted the AT&T topology from the Rocketfuel database [15] using the Cytoscape tool [16]. We implemented the cloud concept on the AT&T topology using the following assumptions. Through the IP address information obtained from the Rocketfuel database, we identified routers that connect across POPs in the AT&T topology. These routers were designated as the backbone routers. The set of backbone routers belonging to a POP were then assigned to a ‘backbone network cloud’ (see Fig. 2). We then considered the edge routers to be access routers belonging to an ‘access network cloud’. Routers that connected backbone and edge routers were the ‘distribution routers’ and each set of distribution routers connecting to a backbone router, was considered as a ‘distribution network cloud’. AT&T CloudAddr= 1.7018
Seattle POP BB CloudAddr= 1.7 2.7:1 3.7:1:1
BB cloud
DB cloud 1
DB cloud 2
2.7:2
AC cloud 1
AC cloud 2 3.7:2:1
Fig. 2. Tiered Addresses in an ISP POP; BB=Backbone, DB=Distribution, AC=Access
4.1 Nested Tiered Address Inside an ISP POP We now illustrate the tiered address use in the POPs inside the AT&T network in the US. AT&T is a tier 1 ISP, where using the AS number for the whole AT&T network
A Tiered Addressing Scheme Based on a Floating Cloud Internetworking Model
387
cloud in the US, we assigned the CloudAddr as 1.7018. The AT&T network has several POPs. Each POP in the AT&T network is viewed as a cloud as stated earlier and we show the Seattle POP in Fig. 4. Without loss of generality let us assume that the Seattle POP is the 7th POP in the AT&T network. We view the backbone cloud in a POP to be at tier 1 and hence assigned a CloudAddr 1.7 to the Seattle POP backbone cloud. Based on our assumptions and as per our studies based on the Rocketfuel data, there were 17 distribution clouds in the Seattle POP. The CloudAddrs for the Seattle distribution clouds hence were assigned as 2.7:1, 2.7:2 (see Fig. 2) and so on with the last distribution cloud getting a CloudAddr of 2.7:17. The clouds of access routers connecting to the first distribution cloud were given CloudAddrs that started with 3.7:1:1. The packet forwarding across POPs in the AT&T network can now follow a similar process outlined for the ISP clouds in Fig. 1. If a packet has to be forwarded to another ISP, then the globally visible CloudAddr for AT&T network has to be used. This would require the internal address to be nested behind the globally visible AT&T ISP address – example 1.7018{3.7:1:1}, where the second part is shown in curly braces (a notation we adopt for nesting). This is the address that would be used in a packet that has to be forwarded by a device in the access cloud 3.7:1:1 to a device in another ISP. Note that unlike tunneling, in this approach we are not encapsulating one address inside another, but an outer cloud address will be prepended when the packets have to be forwarded outside of a given cloud. Furthermore, if AT&T CloudAddr changes from 1.7018 to 1.2, it does not affect the internal addressing of POPs. Only when the packet leaves the AT&T network they will have the address 1.2{3.7:1:1}. 4.2 Tiered Addresses in a Stub AS We now extend the above nesting concept to a stub network connected to an access cloud (let us say the Seattle POP) as shown in Fig. 3. The backbone clouds, distribution clouds, and access clouds have the addresses as noted for the Seattle POP example discussed above. The stub AS is now considered as a cloud and has a global CloudAddr of 4.7:1:1:1 as it is now in tier 4. All addresses marked in red as shown in Fig. 3 have visibility within the AT&T network and can be considered as having global visibility within the AT&T network in the US. Inside the stub AS, a new tiered address has been started and the BB cloud has CloudAddr 1.1. The DB clouds have CloudAddrs 2.1:1 and 2.1:2, and the AC clouds similarly have addresses 3.1:1:1 and 3.1:2:1, depending on their connectivity to the DB clouds. To forward packets within the stub AS the internal tiered-address can be used in a similar fashion as explained earlier. However, when a node inside the cloud wishes to communicate with external networks and devices, the outer CloudAddr has to be prepended to the internal address. For example when a device in the Seattle POP wishes to communicate with a device in the Chicago POP, the packets would have an address such as 4.7:1:1:1{3.1:2:1}, if originating from AC cloud 3.1:2:1, where the second part in curly braces is the internal address of a device in the stub AS. If the device has to communicate with another ISP’s network, then as the packet leaves the AT&T network, it will have the address 1.7018{4.7:1:1:1{3.1:2:1}}. This nesting capability allows for easy network cloud movement and attachment to multiple other network clouds, without interfering with the structure or address used inside the cloud, a very powerful feature in an internetworking architecture.
388
Y. Nozaki, H. Tuncer, and N. Shenoy
Global CloudAddr in the AT&T 1.7 Backbone 2.7:1 3.7:1:1
Distribution Access cloud Stub AS
4.7:1:1:1 BB cloud 2.1:1 3.1:1:1
DB cloud 1 AC cloud 1
1.1 DB cloud 2
2.1:2
AC cloud 2 3.1:2:1
Fig. 3. The Nesting Concept
5 Evaluation of the Tiered Address Scheme To further evaluate the proposed tiered address scheme in a real network, we used the Rocketfuel data which has router-level topologies and maps routers to ASes [15]. From the Rocketfuel database we imported the AT&T topology using a tool called Cytoscape to visualize it [16]. Using the topological information from the Cytoscape tool, we then imported the AT&T topology into Opnet [17] for conducting simulation studies. We indentified a total of 11,403 routers with 13,689 links interconnecting them. Fig. 4 shows the entire AT&T network in the US after the topology was imported into the Opnet simulation tool including the geographical locations of the different POPs. There were a total of 110 POPs in the AT&T network which can be seen as dots in the Fig. 4. The numbers on the links represent the number of physical connections between POPs.
Fig. 4. AT&T network in OPNET imported from Rocketfuel data
A Tiered Addressing Scheme Based on a Floating Cloud Internetworking Model
389
5.1 Applying Tiered Addresses After we imported the entire US AT&T topology into the Opnet tool, we assigned tiered addresses to the different POPs as explained earlier. As explained briefly, the backbone (BB) routers in a POP were identified by the links between different POPs i.e. if a router has a link to another POP, the router was categorized as a backbone router. The shortest path between any of the BB routers and each router except BB routers was then determined within a POP. If a router’s shortest path to a BB was longer than all immediate neighbor routers in terms of hop count, then the router was categorized as an Access Router (AR), which is an edge router in the POP. Finally, any nodes between BB routers and AR routes were categorized as Distribution Routers (DR) within a POP. After categorizing the 11,403 routers, we assigned tiered address to every router in entire AT&T network as explained in the sections above. The next step was in identifying tiers and clouds inside every POP in the AT&T network. At tier 1 within a POP, all BB routers are assumed to belong to a single cloud, which means each POP has single tier 1 cloud. At tier 3, each AR router is recognized as a cloud because AR routers may be connected to other ASes (stub or otherwise) and networks. In this presentation, we limit our connectivity and address allocation study to the POP level within an ISP i.e. the AT&T ISP. At tier 2, however, DR routers, which provide connectivity between BB and AR routers, should have redundancy and hence each set of DR routers is considered as a cloud. Since we did not have link weight information, the shortest path knowledge between BB and AR routers was used to identify a cloud of DR routers. Based on the shortest path between BB and AR routers, DR routers, which are on the shortest path to the same BB router, were assumed to belong to one cloud. For example, if DR router A and B are on the shortest path to BB router C, DR router A and B will belong to one DR cloud. If a DR router is on the path to different BB routers, then the DR router chooses the shorter hop to a BB router, and is considered to belong to the distribution cloud under that BB router.
Fig. 5. The AT&T Seattle POP Topology with the FCT model
Fig. 5 shows result of actual tiered address allocation to the Seattle POP in the AT&T network. There are 393 routers and 437 links in the Seattle POP and each dot in Fig. 5 represents a single router. In the Seattle POP, 6 BB routers were identified based on their connections to other POPs and all BB routers thus belong to the cloud, which has CloudAddr: 1.7. As per our study, we used integers between 1 and 110 to uniquely indentify each POP in the AT&T network and 7 is the Seattle POP ID
390
Y. Nozaki, H. Tuncer, and N. Shenoy
assigned by us (one could use any other numbering strategy). At tier 2, there are 94 DR routers and 17 clouds as identified. Each block of dots (i.e. routers) at tier 2 in the figure represents a cloud. At tier 3, there are 293 AR routers and hence 293 clouds because each AR router is recognized as a cloud. Table 1. AT&T network statistics based on tiered addresses Total routers Total links Total POPs Total BB Total DR Total AR Max CloudAddr at tier 1 Max CloudAddr at tier 2 Max CloudAddr at tier 3 Max POP size Max BB cloud size Max DR routers in a POP Max distribution cloud size Max distribution cloud size
11392 13689 110 389 6395 4608 110 429 (New York) 99 (Seattle) 1007routers (Chicago) 44 routers (New York) 542 routers (New York) 56 routers (Dallas) 13689
Table 1 shows interesting address statistics for the entire AT&T network using the tiered addressing scheme with only 3 tiers. There are a total of 110 POPs, 11,392 routers and 13,689 links in the AT&T network in the US. From this we identified 389 BB routers, 6,395 DR routers, and 4,608 AR routers. The Chicago POP is the largest POP based on the number of routers in the POP. The New York POP has the largest BB cloud and the highest number of DR routers. The Dallas POP has the largest distribution cloud with 56 routers. The Seattle POP has the maximum number of AR routers which are connected to the same distribution cloud. These statistics are provided to show that they can be used to identify and optimize the proper size of a single cloud, help in nesting cloud decision, and decide the type of intra-domain routing protocol that will be best suited in a cloud for future optimizations studies. 5.2 Tiered Address vs. IP Address For the implementation of the tiered addresses, it is essential to have a delimeter field that identifies the boundaries of addresses is a given tier. For this purpose, we introduced a 2 bit type field before each of the MyCloudID. As per the preliminary studies based on the type field, the MyCloudID can be 4, 8 or 12 bits long. The pie chart in Fig. 6 shows the length of the addresses that will be required if using the tiered addresses. Due to the flexibility in address sizes, less than 1 percent of addresses would exceed 32 bits and 83.93% of addresses would be less than or equal to 28 bits. Moreover, current IPv4 and IPv6 based routers requires a different address on each of its routing interfaces. In contrast, the tiered address will use only one address per router similar to Network Service Access Point (NSAP) addresses in Intermediate System to Intermediate System (ISIS) [13]. The bar graph in Fig. 6 compares the number of addresses required for all the routers in the AT&T using IP (v4 or v6) addresses and the tiered addresses.
A Tiered Addressing Scheme Based on a Floating Cloud Internetworking Model
391
Fig. 6. Tiered Address Length distribution across AT&T network
5.3 HD Ratio Analysis The tiered addressing scheme allows a maximum of 212n addresses at tier level n. The addresses can include ISP, AS cloud, network or device addresses within a network, and the maximum address length can be calculated by Equation 1. AL = 14n+6 . (1) where AL is maximum address length and n is total number of tiers in the network. In the current Internet, the efficiency of the IP address assignment was analyzed with the H-ratio as given by Equation 2 [18]. (2) H ratio = log10(NAO)/NAB . where NAO is number of allocated objects and NAB is number of available bits. However, since Equation 2 did not count the multiplicative affect of the loss of efficiency at each level of a hierarchical plan, we decided to use the Host Density ratio (HD-ratio), which is adopted to analyze IPv6 address allocation efficiency by IETF [19] and is given in Equation 3, where x is any integer value bigger than 0. (3) HD-ratio = logx(NAO)/logx(MAX NAO) . In [20], a HD-ratio of 0.94 is identified as the utilization threshold for IPv6 address space allocations. Equation 3 can be rewritten to actually find out the NAO as in Equation 4. (4) NAO = (MAX NAO)HD-ratio . According to Equation 4, IPv6 reaches HD-ratio of 0.9 when 1.65931591357356E+36 addresses are allocated to the objects. At this point, new address space will be required for the new nodes. In Table 2, for the tiered address scheme, the maximum number of entities, such as ISPs, POPs, networks or devices that can be accommodated, and in turn the available address space at a given tier is given in the column 3. The first column gives the TierValue. The second column gives the maximum address length as calculated using Equation 1 at any given tier assuming maximum address fields of 12 bits each. The total number of supported addresses, for a given TierValue including all of the addresses within the tiered hierarchy is given in the fourth column.
392
Y. Nozaki, H. Tuncer, and N. Shenoy Table 2. Number of nodes in each tier level
Tier value 1 2 3 4 5 6 7 8 9 10 11 12 13
Max address length
Max address Total capacity of capacity at the tier network
Network capacity at HD: 0.94
20 34 48 62 76 90 104 118 132 146 160 174 188
4096 16777216 68719476736 2.81475E+14 1.15292E+18 4.72237E+21 1.93428E+25 7.92282E+28 3.24519E+32 1.32923E+36 5.44452E+39 2.23007E+43 9.13439E+46
2486.671123 6184952.337 15379943237 3.82449E+13 9.51024E+16 2.36488E+20 5.88069E+23 1.46233E+27 3.63634E+30 9.04239E+33 2.24854E+37 5.59139E+40 1.3904E+44
4096 16781312 68736258048 2.81544E+14 1.1532E+18 4.72352E+21 1.93475E+25 7.92475E+28 3.24598E+32 1.32955E+36 5.44585E+39 2.23062E+43 9.13662E+46
Let us explain this with an example: at tier 2 we have a maximum address space given by 16777216 (=212n, where n = 2). However, there are addresses supported in tier 1 under which we have tier 2. So the total number of addresses that can be supported in a system that has 2 tier levels will be given by 16781312, which is 4096 (at tier 1) + 16777216 at tier 2. So the values in the column 4 are a cumulative count of addresses from all tiers above a given tier, including that tier. The total number of addresses that can be supported by the network till it reaches the HD ratio of 0.94 was calculated using Equation 4 and is given in the last column. As it is seen in Table 2, the tiered addresses reach the IPv6 address allocation threshold capacity at tier 11 with 160 bits of address length at most. However, the threshold in the tiered address is not fixed as for IPv6; it is flexible and can be extended as needed by increasing the tier value. The only restricting factor could be the address length. As explained earlier, with the nested concepts, the maximum address length that any router has to deal with is determined by the first address field in the tiered addresses, which knows how to direct or forward a packet. Another major concern that can arise if the address length increases, is use of tiered addresses in wireless networks which are bandwidth constrained. However, in such case only the nested address used within the wireless network will be used for forwarding within that network. With a tier 3 address, this would be 48 bits maximum. At this point, it has to be further noted that the use of the tiered address would preclude MAC addresses and that all forwarding whether inter or intra-cloud can be supported by the tiered address.
6 Conclusions This paper introduced a new tiered addressing scheme which was invented to work with a new internetworking communications model based on the tiers in the ISP topological structures called the Floating Cloud Tiered internetworking model. The main goal was to address routing scalability and support for future growth in an unrestricted manner, whether it is in terms of address space or networks. We highlighted the efficient use of address space with the tiered address scheme. We provided some operational aspects of the internetworking model to explain the application of the tiered
A Tiered Addressing Scheme Based on a Floating Cloud Internetworking Model
393
addresses. We introduced a novel property of the tiered addresses in the form of ‘nesting’ and have explained its recursive use through examples. We also illustrated tier based address aggregation with examples and applied the same to the AT&T network in the US. Using this application and the HD-ratio we then analyzed some performance characteristics of the tiered addressing scheme.
References 1. Internet Usage Statistics, http://www.internetworldstats.com/stats.htm (retrieved on July 13, 2010) 2. Anderson, T., Blumenthal, D., Casey, D., Clark, D., Estrin, D.: GENI: Global environment for network innovations, Version 4.5 (April 23, 2007) 3. NSF NeTS Future Internet Design Initiative, http://www.nets-find.net 4. Grubesic, T.H., O’Kelly, M.E., Murray, A.T.: A Geographic Perspective on Commercial Internet Survivability. Telematics and Informatics (2002) 5. ARIN, Number Resource Polivy Manual (NRPM), Version 2010.1 (April 2010), http://www.arin.net/policy/ 6. Oppenheimer, P.: Top-down Network Design. Cisco Press, Lebanon (1999) 7. Subramanian, L., Caesar, M., Ee, C.T., Handley, M., Mao, M., Shenker, S., Stoica, I.: HLP: A next generation inter-domain routing protocol. In: Proceedings of ACM SIGCOMM, pp. 13–24 (2005) 8. Yang, X.: NIRA: A new Internet routing architecture. In: Proceedings of the ACM SIGCOMM Workshop on Future Directions in Network Architecture, Karlsruhe, Germany, pp. 301–312 (August 2003) 9. Internet Research Task Force Routing Research Group (2008), http://trac.tools.ietf.org/group/irtf/trac/wiki/ RoutingResearchGroup 10. Xu, X., Jain, R.: Routing Architecture for the Next Generation Internet (RANGI) (March 2009), http://tools.ietf.org/id/draft-xu-rangi 11. Caesar, M., Condie, T., Kannan, J., Lakshminarayanan, K., Stoica, I.: ROFL: Routing on flat labels. SIGCOMM CCR 36(4), 363–374 (2006) 12. Pan, J.L., Jain, R., Paul, S., Bowman, M., Xu, X., Chen, S.: Enhanced MILSA Architecture for Naming, Addressing, Routing and Security Issues in the Next Generation Internet. In: Proceedings of IEEE ICC, Dresden, Germany (June 2009) 13. Subharthi, P., Jianli, P., Raj, J.: Architectures for the Future Networks and the Next Generation Internet: A Survey, http://cse.wustl.edu/Research/Lists/ Technical%20Reports/Attachments/891/I3SURVEY.pdf 14. The European Future Internet Initiative (EFII), http://initiative.future-internet.eu/ 15. Spring, N., Mahajan, R., Wetherall, D.: Measuring ISP topologies with Rocketfuel. ACM SIGCOMM (August 2002) 16. Cytoscape2.6.0, http://www.cytoscape.org 17. OPNET Technologies. OPNET Modeler. Commercial, http://www.mil3.com/products/modeler/home.html 18. Huitema, C.: The H Ratio for Address Assignment Efficiency. RFC 1715 (November 1994) 19. Durand, A., Huitema, C.: The Host-Density Ratio for Address Assignment Efficiency: An update on the H ratio. RFC 3194 (November 2001) 20. IPv6 address allocation and assignment policy, APNIC -089, http://www.apnic.net/policy/drafts/ipv6-address-policy
DHCP Origin Traceback Saugat Majumdar1 , Dhananjay Kulkarni2 , and Chinya V. Ravishankar3 1
3
Cisco Systems, Inc., USA
[email protected] 2 Boston University, USA
[email protected] University of California - Riverside, USA
[email protected] Abstract. Imagine that the DHCP server is under attack from malicious hosts in your network. How would you know where these DHCP packets are coming from, or which path they took in the network? This paper investigates the problem of determining the origin of a DHCP packet in a network. We propose a practical method for adding a new option field that does not violate any RFC’s, which we believe should be a crucial requirement while proposing any related solution. The new DHCP option will contain the ingress port and the switch MAC address. We recommend that this new option be added at the edge so that we can use the recorded value for performing traceback. The computational overhead of our solution is low, and the related network management tasks are low as well. We also address issues related to securing the field in order to maintain privacy of switch MAC addresses, fragmentation of packets, and possible attack scenarios. Our study shows that the traceback scheme is effective and practical to use in most network environments.
1
Introduction
Though the Dynamic Host Configuration Protocol (DHCP) protocol is a widely used network protocol, it is well-known that attacks [5,6] are possible if bad packets are transmitted. Determining the origin of such bad packets is important to defend a network without impacting other network services. If the origin of the bad DHCP packet is known then the administrator can take a reactive security measure to configure the ingress port, example, dropping DHCP packets, maintaining statistics/log, or turning on other available security features. Proactive approaches such as adding ACL’s to the firewalls or switching TCAM’s to filter certain types of DHCP packets are also applicable, but proactive techniques are not adequate because they require a priori knowledge of the signature of bad packets. Moreover, adding general filters to drop DHCP packets is not recommended because it may impact the connectivity of the network. In our work, we assume that the network is honest, but the end-hosts or external traffic cannot be trusted. Hence, there are two points of origin that are of interest, (1) a dhcp-packet is received from an internal host in the network, M.K. Aguilera et al. (Eds.): ICDCN 2011, LNCS 6522, pp. 394–406, 2011. c Springer-Verlag Berlin Heidelberg 2011
DHCP Origin Traceback
395
in which case the point of origin of the packet is the physical port of the switch to which the host is connected, and (2) a packet received from an external host, in which case the point of origin of the packet is the port through which the packet entered the network. Determining the origin of DHCP packets can be challenging because packets may be forged by malicious hosts. Attacks when Packet Origin is Not Known: Following are some attacks where it is important to know the origin of the packet. Attack 1: An attacker may send a single DHCP packet that causes a switch or a server to go to an unexpected state. To stop such attacks, certain DHCP security solutions [5] and [6] are designed only to filter out BOOTP replies at certain ports. But, there can be unexpected use cases or design gaps in switch software/hardware [8] or mis-configurations due to which the replies can bypass the security guards. It is not a good idea to filter BOOTP requests. In an event of such attacks, the administrator needs to drop such packets until a patch for the vulnerability is available. If the origin of the packet were known, the administrator can configure the network to block such packets at the origin. Attack 2: A rogue DHCP server may send a DHCP reply or a NACK that causes a switch or a client to go to an unexpected state. To stop a rogue server from sending such packets, the administrator needs to know the point of origin of the packet in the network. Attack 3: An adversary can send a DHCP packet that gets fragmented on its way to the destination. As it is hard to handle packet fragmentation, fragments can cause a switch or a end-host to go to an unexpected state. To stop these packets from entering the network, it is important to know the origin and then configure the port appropriately. We refer the reader to [4] for examples of such attacks on other networking protocols. Contributions: In this paper we propose a mechanism to determine the origin of a DHCP packet by adding a new DHCP option field that stores the MAC address and the ingress port of the switch through which the packet entered the network. The edge switch that receives the packet would add this information to the DHCP Packet. Adding such an option may increase packet size sufficiently to cause packet fragmentation, thus adding vulnerability to the scheme. Our proposes solution inserts an option such that it is harder to avoid the traceback mechanism through packet fragmentation. (reword the last statement) Our mechanism has several advantages. First, it is consistent with current RFC’s on L3-L5 protocols, and has low computational overhead. It is also simple and easy to use, and we do not make assumptions about fragmentation. We use the following terminology. Given the common functionality between switches and routers, we use these terms interchangeably in this paper. Both have routing and switching capabilities. We rather differentiate these devices as being edge or core devices. Edge devices usually do not have routing turned on. Core switches have routing turned on.
396
2
S. Majumdar, D. Kulkarni, and C.V. Ravishankar
Background
The Dynamic Host Configuration Protocol (DHCP) is an application layer protocol used to dynamically assign IP address to hosts in a network. DHCP is also used to provide addional information to the client, such as the location of the boot file. DHCP runs over UDP, so an attacker can just forge a DHCP packet and send it to the client or server to change the state of their DHCP state machine. DHCP has the following types of messages: DISCOVER, INFORM, OFFER, ACK, NACK, RELEASE, REQUEST and DECLINE. DHCP Servers send only OFFER, ACK and NACK messages, which are collectively referred to as the BOOTP reply packets. DHCP clients only send DISCOVER, INFORM, REQUEST, DECLINE and RELEASE messages, which are collectively referred to as BOOT request packets. An attacker can modify the packet type from OFFER to ACK. This will force the client to interpret the packet in a different way. We do not address such man-in-the-middle type of attacks because our network is trusted. The DHCP part of the packet starts with a DHCP header which contains the fields from the deprecated BOOTP protocol. The header contains the fields like the client MAC address, the server IP address, the client IP address and the requested IP address. The header is followed by a set of DHCP options. The options are specified as TLV units, and provide information to the client or the server. A option field may contain the lease time or the DHCP server IP address or the type of the DHCP message. DHCP standards do not limit the length of the value. Each option has a specific opcode that represents the type. The length specifies the length of the value. The length field is followed by the value. The set of options are appended to the DHCP packet after the BOOTP part of the packet. The options can be present in any order. An Overview of DHCP Protocol: DHCP setup involves three entities: server, client and a relay agent. If the server is not located in the same broadcast domain as the client then the DHCP relay agent routes the packet to the server. Relay agents are not required if the server is in the same broadcast domain as the client. Next, we describe message exchanges in different scenarios. In each of these scenarios, it is easy to inject packets to launch an attack. We assume that the client and the server are in the same broadcast domain. We also assume that there is only one DHCP server in the network. Client lacks IP address: The client broadcasts a DISCOVER packet. The server replies with an IP address in an OFFER packet. If the client accepts the IP address, it broadcasts a request for that IP address in a REQUEST
Physical
Ethernet
IP
UDP
DHCP
Header
Header
Header
Header
Header
DHCP Option 1
DHCP Option 2
Fig. 1. DHCP packet
..........
DHCP Option n
DHCP Origin Traceback
397
packet. After the server receives the request, it assigns the address to the client and sends an ACK packet to the client. The OFFER and the ACK are Layer-2 unicasts. Client renews IP address: The client requests the same IP address by unicasting the REQUEST packet to the server. After the server receives the request, it assigns the address to the client for another period and sends an ACK packet to the client. Both the REQUEST and the ACK packets are Layer-2 unicasts. Client moves or state changes: Say that a client moves from a different subnet and requests the same IP address. If the DHCP server receives the requests then it rejects it by sending a NACK packet. ISP Providers may provide an IP address to a new client for few minutes before they authenticate to the network. If the clients have authenticated successfully, the server sends a NAK and forces the client to send a DISCOVER packet. Now the server can provide an IP address with long lease period to the authenticated client. Client releases IP address: Before booting off, a device may release the IP address by sending a RELEASE packet to the server. Client information request: A client may want to know the DNS server or other configuration information of the network. A client sends an INFORM packet to the server. The server sends an ACK packet with the requested information and with a null IP address. Multiple DHCP server issues: A synchronization issue may arise when there are multiple DHCP servers. To address this issue, the servers must have nonoverlapping address ranges or be statically configured with same DHCP-address and client MAC-address mappings. Further, to avoid conflicting IP address allocations, a server can send a NAK packet in response to a broadcasted REQUEST packet sent for another server. However, misconfiguration of DHCP-servers cannot be ruled out in a large ISP. So, knowing the origin of the packet can help in quick troubleshooting.
3
Related Work
In this section we discuss some current techniques that provide DHCP Security. Given that DHCP servers and packet spoofing tools are easily available, securing DHCP is now a basic requirement in every network. Any host running a DHCP server can listen for DISCOVER messages, and is then able to offer IP addresses to clients. Moreover, rogue servers may send NACK to valid REQUEST packets and make the network unavailable to the client. The rogue servers may also forge a NACK packet and send it to a client, forcing a client disconnect. With packet generating tools, attackers can easily generate all types of packet to force the server or client to go into an unexpected state. In another attack, a malicious host can exhaust a DHCP server (which may not be configured properly) of its address range by continuously requesting for IP addresses. This prevents legitimate hosts from connecting to the network, since they are unable to lease a
398
S. Majumdar, D. Kulkarni, and C.V. Ravishankar
valid IP address from the DHCP server. Thus, securing DHCP server is a basic requirement in any routing-enabled network. Techniques to Stop Rogue Servers at Edge: Rogue Servers can be prevented from sending replies by using DHCP-snoop security feature [5,6] that is available in most routers and switches. The feature is turned on in a network by assigning trust to all the physical ports that are directly connected to a valid server and all the uplink ports. If a physical port is trusted, then every BOOTP REPLY packet is forwarded, otherwise the packet is dropped. BOOTP REQUEST packets are always forwarded. This feature, however, is prone to mis-configuration, since it requires an administrator to configure the switches and routers in the network. If there are multiple virtual LANs and trunking is enabled, then configuring DHCP-snoop can become more cumbersome. Techniques to Stop Forging of Client Packets at the Edge: We can stop malicious hosts in the network from forging BOOTP request packets by configuring the switches learn source MAC address in the ingress ports. Once the table for the MAC address for a port is full, no more new MAC addresses can be added. So, the switch drops any packet with a new MAC address. This approach thwarts the malicious host from sending a lot of forged BOOTP packets, but as the MAC table ages out the entries, the attacker may eventually be able to forge a packet for every host in the network. Techniques to Prevent Forging of BOOTP Replies: Currently, switches can be configured with authorized DHCP servers. If a BOOTP reply has a source IP address that does not belong to the set of authorized servers then the packet is dropped. However, this method does not address the case when the source or server IP address itself is forged. Techniques to Prevent Server Address Exhaustion Attack: There are few ways to configure the DHCP server to prevent a malicious host from continuously requesting new IP addresses. ISPs can store static mappings of the MAC-address of customer devices and IP address in dhcpd.conf file. So, anytime a device requests as IP address, it will be given the same address. Further, the lease time can be set to a small period, say, 30 mins. This solution is very restrictive because it requires the administrator to know the set of valid MAC addresses in the network. DHCPv6 allows servers to maintain the state of leases, which include the MAC address of the client and the assigned IP address. So, a server can prevent the multiple IP address allocation to the same device. DHCP Packet Authentication: Authenticating a packet enables the server or the client to filter out forged packets. RFC 3118 describes the details of authenticating a DHCP packet. It requires the clients to be configured with a secret key. The client or the server adds the authentication information as a DHCP option in the packet. This solution is hard to implement, however, because it is difficult to install keys in the clients, and it also requires management of the keys.
DHCP Origin Traceback
4
399
Current Techniques for DHCP Traceback
While the above techniques are useful in preventing attacks, it is still desirable to determine the origin of the packets. There are unexpected use cases for all the security measures in section 3. In such situations, it is helpful to know the origin of the packet, to allow the network administrator to configure the network properly to drop such packets at the edge. Leveraging DHCP Relay Agent Information Option 82: Options are added to DHCP packets to provide more information to clients or servers. DHCP option 82 contains information about the relay agent. Usually, the option 82 value contains the MAC address and the physical port on which the packet was received by the relay agent. Option 82 values are only added by the switch to the DHCP-discover and request packets. Some DHCP servers copy the option82 value from the BOOTP-requests to the BOOTP-reply packets. So, option 82 value in a BOOTP-reply packet cannot help in tracebacking. The relay agents forward the packet to the DHCP server when the server is not present in the same broadcast domain as the client. If the client is connected immediately to the switch that is acting as a relay agent then the option-82 values will allow the administrator to determine the origin of the packet. In general, any edge switch can add option 82 values to the packet. If the option is not used by the DHCP server then it is ignored. Option 82 does not solve our problem, however. First, if the client is not directly connected to the relay agent then the option 82 values cannot point out the client. Second, since option 82 values can be used by the DHCP server to allocate IP address, administrators can configure the switches to drop packets containing untrusted option 82 values. Sometimes, option 82 values can be replaced by an upstream switch or router. Third, it is not desirable to overload the option 82 field for other purposes. Fourth, if a DHCP packet is received from an outside network then option 82 values present in the packet cannot be useful in determining the entry point. IP packet Traceback: IP traceback is a general approach, and can solve the DHCP traceback problem. Traceback methods in [9,10,11,1,3] seek to determine the path taken by the packet and determine the origin of the packet in the network. In this paper, we only seek to determine the origin of a DHCP packet in the network. There are two categories of solutions to IP traceback. The first category marks packets at the IP header level. The second category stores information about the packets at the routers. We next discuss these, and explain why they are not suitable. Packet Marking and Traceback for DOS Flooding. Packet marking techniques covered in [9,11,3] and [1] are used for determining the source of a DOS flooding attack. These techniques overload the IP identification field and the fragment offset with routing information. These methods are effective in locating the source of packets in a DOS attack by flooding packets. However, they cannot traceback a single packet. Overloading the IP identification field with routin g
400
S. Majumdar, D. Kulkarni, and C.V. Ravishankar
information is also an IPv4 RFC violation. Also, overwriting the packet’s offset field with non-zero values may cause the router to treat the packet as a fragmented packet. Packet marking also introduces delay in forwarding a packet. Finally, it is not practical to assume that packets are not fragmented. Such an assumption may introduce vulnerabilities in the system.Storing routing information in IP options is undesirable because it causes the packet to be forwarded in software. Appending the routing information at the end of the packet is undesirable because it can cause the packet to be fragmented, and may also trigger some drop policy in the firewall. Single Packet Traceback. The approach presented in [10] stores approximate information about packets in a bloom filter [2]. However, this solution is impractical because it requires lot of storage in the switch, which is prohibitive in devices with limited storage. Also, execution of sophisticated hash functions like SHA1 or MD5 are expensive in hardware. Using simple hash functions will add more false positives or false negatives. If the hash functions are computed in software then we might be swamping the switch CPU with too much load, causing delay. If the CPU is busy computing hash functions for traceback, then performance of other functionalities like routing will be severely impacted.
5
Our Approach to Add a New DHCP Traceback Option
We propose to add a new DHCP traceback option that contains the MAC address of the switch or the router that received the packet and the ingress port number. We define the values of the fields in the option in figure 3. Figure 2 shows how the option is used in a network. For sake of brevity, we omit the code used to add the traceback option. Source code used in experiments is available at [7]. Adding traceback values to the option. The rules for adding the option to the packet are as follows. Insertion Rule: Only edge ports may add the traceback option to a DHCP packet. Replacement Rule: If a DHCP Packet containing a traceback option is received at the edge port, then the switch replaces the option with a new value. The Replacement Rule prevents any attacker from successfully sending a packet with wrong values in the traceback option. Performance Issue: Currently, the DHCP options is inserted into the L5 part of the packet by software, so the packet cannot be forwarded in hardware. Our approach will hence add some forwarding delay. However, we expect the traceback option to be used in conjunction with features like DHCP-Snooping. Currently, L5 packet inspections (like DHCP-Snooping) are usually done in software. So, inserting the option will increase the forwarding delay somewhat. Infrastructure Requirements: The traceback option is only expected to be added at the entry point to the network. So, the switch only adds the option if
DHCP Origin Traceback
401
Network DHCP Packet + Traceback Option DHCP Packet + Traceback Option DHCP Reply + Traceback Option Switch Switch
Switch
Switch
DHCP Packet
DHCP Packet + Traceback Option
DHCP Reply + Traceback Option
DHCP Reply
Prime rgy
Client
DHCP Server
Fig. 2. DHCP Traceback Option
tracebackOption.type = A new Opcode (Yet to be assigned) tracebackOption.Length = 6 bytes for Mac address + Length of the port number tracebackOption.value = Macaddress|Port number Fig. 3. Traceback Option Fields
1
1
B O O Op1 Op2 TP
Switch B O O Op1 Op2 TP
2
3
010101−020202
type = xx len = 7 val = 010101020202|2
2
3
Fig. 4. Adding Traceback Option at an edge port
the ingress port is an edge port. Thus, we must configure the switches so that it can determine if a port is an edge port. Currently, features like DHCP-snoop or Dynamic ARP-protection exist, which require the administrator to configure the switch on a per-port basis. We thus assume that our infrastructure requirements also have acceptable overhead. Privacy Issue: BOOTP-request packets are generally broadcasted. If we add the traceback option, any host or server can read the option to determine the location of the client. BOOTP replies can also be sniffed, so the adversary can tell the location of the server, which becomes a privacy concern. We will address this issue in section 5.3. Nevertheless, our solution is simple and easy for the administrators to manage. 5.1
Addressing Fragmentation Attacks
Packet fragmentation and packet reassembly cases are not completely addressed in network devices. Hence, an attacker can exploit fragmented packets to make
402
S. Majumdar, D. Kulkarni, and C.V. Ravishankar
1 B O O type = xx TP len = 7 val = 050505020202|3
1 Switch
2
3
010101−020202
B OO TP
type = xx len = 7 val = 010101020202|2
2
3
Fig. 5. Replacing an untrusted Traceback Option
a device to go to an unexpected state. We only consider packet fragmentation in the following scenario. An adversary sends a large DHCP packet to the network. The routers or switches then add the traceback option or Option 82 to the packet. The adversary can craft the packet such that after adding the options, the IP packets gets fragmented into multiple packets. These fragmented packets might make a device to go to an unexpected state. Let us say that we use the algorithm to add the traceback option. So, the adversary knows that the traceback option is always present at the end of the packet. Now the adversary can always craft a packet such that any of the last but one fragment can crash the switch or the end host. This way the attack cannot be traced back to the origin if the traceback option is present in the last fragment. We first observe that the DHCP packet is composed of two sections, the BOOTP and the DHCP options. The size of the BOOTP section is fixed. The set of DHCP options can vary. So, there can be two ways of constructing a large DHCP packet, namely, (1) by adding a lot of DHCP options or add options with long values, and (2) by adding a pad to the L3 or L4 packet. We design a solution to address the first method of constructing the large packet. For the second method, any DHCP packet with extra padding in L3 or L4 layer can be considered as an anomalous packet and can be dropped. So, we do not present a traceback solution for this method. Location-Randomized Traceback Insertion: Let there be n DHCP options in the packet. We randomly pick a location i from 0 to n, and insert the option into the ith position in the list. Details of the algorithm are available here [7]. We argue that adding the option to a random location will challenge the adversary in determining which fragment will contain the traceback option. So, the adversary has lesser advantage in sending bad packets, inorder o become not traceable. Let us assume that MTU is 1500 bytes. Of these, the L3 header takes 20 bytes and the l header takes 8 bytes. The remaining 1472 bytes are used by DHCP. The BOOTP header takes 240 bytes, so the adversary may use the remaining 1232 bytes to craft the packet by adding DHCP options. Let us assume that the adversary sends a 1500-byte L3 packet. Now, adding the traceback option to the packet will cause fragmentation. There is no standard way of fragmenting packets. So, we proceed by assuming that the fragmentation will try to keep the L4 header within one fragment. We now have two fragments, (1) the 1-1500th bytes as the first fragment, and (2) the 1501th to the last byte as the second fragment. Inserting the traceback option at a random location will reduce the chances of success for the adversary.
DHCP Origin Traceback
5.2
403
Addressing Full Path Traceback
In this section we extend the idea of DHCP traceback option to determine if we can perform full path traceback. We consider two cases, traceback for single packet and traceback for DOS flood-attack. For Single Packets: All the routers or switches that forward the packet may append their MAC address and the ingress port into value of the traceback option. So at the end, the destination (DHCP server) can determine the route taken by the packet. However, adding the route information in every switch will cause the packet to be forwarded in software. This will add delay in packet forwarding and overall throughput of the network. So, we do not extend the traceback option solution to solve the problem of full path traceback. For Distributed DOS Flooding Attack: Traceback techniques for DOS Flooding attack do not mark every packet. Based on a probability distribution, the packets are marked with information about the router. When a stream of packet is received, the markers are used to construct the path taken by the stream. Current switches have the capability (e.g. sflow, netflow) to decide in hardware if a packet needs to be forwaded to the CPU. The switches only forward the selected packet to the switch CPU. If a packet is not chosen to be marked then it is forwarded in hardware. So, it is reasonable to allow non-edge switches to append their MAC address and the ingress port into the value of the traceback option. Thus, the traceback option can be extended to support full-path traceback of a stream. 5.3
Private Traceback Option
The solution in earlier section has the following drawbacks. First, the DHCP traceback option discloses the port number to which the client or the server is connected and the MAC address of the switch. Second, based on the presence of the traceback option one can tell if the source MAC address is of a nonedge device or a edge device. Malicious hosts may misuse this information when packets are broadcast over the network. We solve the first problem by hiding the value of the traceback option. We do not address the second problem, since MAC addresses of switches can be determined easily by using traceroute or listening to network configuration protocols. We design the solution such that it is easy to configure and use for the system administrators. 5.4
Approach to Secure the Traceback Options
We require the administrator to configure every edge switch with the same 128bit secret key k. The switch encrypts M acAddress|port with the key k. So, the value traceback option is Encryptk (M acAddress|port). Since the encryption is performed in software, we can use any encryption algorithm. We choose AES because it is efficient, has smaller key and cipher-text. AES-CBC assures cipher-text security. To determine the MAC address and the port number, the
404
S. Majumdar, D. Kulkarni, and C.V. Ravishankar
(a) DHCP packet with traceback option (b) DHCP packet with secure traceback option Fig. 6. DHCP packet with 2 different options
administrator decrypts the traceback value. The key k can be stored in a MIB variable. So, the administrator can use network management tools that use secure SNMP-v3 protocol to change the secret key for all the devices. Such a process can be easily automated. A switch may get compromised, and in the above scheme compromising a single switch will compromise the whole network. We can address this issue in the following way. A different key is assigned to each switch. The network management station will hold a secret master key km . Each switch will be assigned a key ki = KGF (M acAddr, km ), where KGF is a key generating function. The decryption process will involve decrypting with every key and then selecting the ones that outputs a valid MAC address. Since number of valid keys in small, the administrator will obtain the decrypted text in a reasonable amount of time.
6
Experiment
As shown in Figure 7, our topology consists of an end host connected to one of the ports of a Procurve Switch. We have a DHCP server also connected to one of the ports of the switch. Both the end-host and the DHCP server are present in same VLAN. The switch firmware contains the code changes to add the new DHCP option. We also implemented the code to encrypt the new DHCP option. The pcap files were captured and are displayed in Figure 6
End Host
HP Procurve Switch (Implemented the
Dhcp Server
new DHCP option)
Fig. 7. Topology used in experiments
DHCP Origin Traceback
405
Cipher Type Cipher Type
0 0
2 bits
3 bits
6 bytes for Mac address
Physical Port 0 0
Version
Secure or Unsecure
2 bits
3 bits
1 Encrypted value of Mac address and port number
Version
(a) Bits reserved for layout
(b) Bit set to indicate secure options
Fig. 8. Format of Traceback Value
We design the format of the traceback option. We identify the following issues. First, there can be multiple versions of the traceback option. Second, we can have various encryption ciphers. Third, we need to distinguish between secure and unsecure traceback option. We reserve the first byte of the value to store meta-data. We reserve, 2 bits for version, 3 bits for Cipher algorithm, 1 bit for secure or unsecure and remaining 2 are unused bits. Figure 8 shows the layout of the value. We have assigned 141 as the opcode for the traceback option. First, we add a traceback option to a DHCP DISCOVER packet. Here, the MAC-address of the switch is AAABBB-CCCDDD and the ingress physical port is 4. The version number is 1, the cipher type is 0 and it is not secured. So, the first byte is 10 in hexadecimal. Now, the value of the traceback option is “10AAABBBCCCDDD04”. Figure 6(a) shows the pcap file in a wireshark window. Second, we use a 128-bit AES encryption key to encrypt the MAC address and the ingress port. We encode the cipher-text into printable characters using PEM standards. Next, we add the traceback option into the packet. Figure 6(b) shows the pcap file in a wireshark window. Here, the version number is 1, the cipher type is 1 for AES and the security bit is 1. So, the first byte of the value is 13 in hexadecimal. The remainder of the value is the PEM encoded cipher text.
7
Conclusion
We provide a proactive solution to traceback such bad DHCP packets to determine the origin. We have shown how our new traceback option can be added to determine the origin of a DHCP packet. Our approach provides single, as well as full path traceback of DHCP packets that are forwarded by an edge switch or router. Our solution is easy to use by the system administrators, light-weight and easy to implement. Additionally, we address the packet fragmentation, and privacy concerns related to our approach. We secure the traceback option by using of a simple encryption scheme which can be easily be managed using existing networking management techniques.
References 1. Belenky, A., Ansari, N.: On deterministic packet marking. Comput. Netw. 51(10), 2677–2700 (2007) 2. Bloom, B.H.: Space/time trade-offs in hash coding with allowable errors. Communications of the ACM 13(7), 422–426 (1970), citeseer.ist.psu.edu/bloom70spacetime.html
406
S. Majumdar, D. Kulkarni, and C.V. Ravishankar
3. Burch, H.: Tracing anonymous packets to their approximate source. In: LISA 2000: Proceedings of the 14th USENIX Conference on System Administration, pp. 319– 328. USENIX Association, Berkeley (2000) 4. CA-2002-03, C.A., http://www.cert.org/advisories/ca-2002-03.html 5. CISCO, http://www.cisco.com/en/us/docs/switches/lan/catalyst4500/12.1/ 13ew/configuration/guide/dhcp.html 6. HP-Procurve, http://h40060.www4.hp.com/procurve/uk/en/pdfs/ application-notes/an-s12 procurvedhcpsnoopingfinal.pdf 7. Majumdar, S., Kulkarni, D., Ravishankar, C.V.: http://people.bu.edu/kulkarni/scode.pdf 8. Markopoulou, A., Iannaccone, G., Bhattacharyya, S., Chuah, C., Diot, C.: Characterization of failures in an ip backbone (2004), http://citeseer.ist.psu.edu/markopoulou04characterization.html 9. Savage, S., Wetherall, D., Karlin, A., Anderson, T.: Practical network support for ip traceback. SIGCOMM Comput. Commun. Rev. 30(4), 295–306 (2000) 10. Snoeren, A.C., Partridge, C., Sanchez, L.A., Jones, C.E., Tchakountio, F., Schwartz, B., Kent, S.T., Strayer, W.T.: Single-packet ip traceback. IEEE/ACM Trans. Netw. 10(6), 721–734 (2002) 11. Song, D.X., Song, D.X., Perrig, A., Perrig, A.: Advanced and authenticated marking schemes for ip traceback. In: Proceedings of IEEE INFOCOM Conference (2000)
A Realistic Framework for Delay-Tolerant Network Routing in Open Terrains with Continuous Churn Veeramani Mahendran, Sivaraman K. Anirudh, and C. Siva Ram Murthy Department of Computer Science and Engineering Indian Institute of Technology Madras Chennai-600036, India
Abstract. The conventional analysis of Delay-Tolerant Network (DTN) routing assumes that the terrain over which nodes move is closed implying that when the nodes hit a boundary, they either wrap around or get reflected. In this work, we study the effect of relaxing this closed terrain assumption on the routing performance, where a continuous stream of nodes enter the terrain and get absorbed upon hitting the boundary. We introduce a realistic framework that models the open terrain as a queue and compares performance with the closed terrain for a variety of routing protocols. With three different mobility scenarios and four different routing protocols, our simulation shows that the routing delays in an open terrain are statistically equivalent to those in closed terrains for all routing protocols. However, in terms of cost some protocols differ widely in two cases while some continue to demonstrate the statistical equivalence. We believe that this could be a new way to classify routing protocols based on the difference in their behavior under churn. Keywords: Delay-tolerant network, routing, mobility model, open terrain.
1
Introduction
Delay-Tolerant Networks (DTNs) are a class of networks that are characterized by intermittent connections, long variable delays, and heterogeneous operating environments over which an overlay or bundle layer works [5,6,7]. DTNs find potential applications in many areas such as satellite networks, vehicular networks, and disaster response systems. Routing protocols in DTN work on the basis of the store, carry, and forward paradigm [5], where a node carries the message until it encounters the destination node or any other node that has high probability of meeting the destination node. Based on this paradigm various DTN routing protocols have been proposed. Naive approaches such as epidemic routing [16] amount to flooding the network with copies of the message and more sophisticated approaches such as utility based routing [12,15] forward the message to the encountered node that is found M.K. Aguilera et al. (Eds.): ICDCN 2011, LNCS 6522, pp. 407–417, 2011. c Springer-Verlag Berlin Heidelberg 2011
408
V. Mahendran et al.
to be a good message-carrier (based on heuristic estimates) for the destination node. DTN routing protocols are not constrained by the closed terrain assumption, however, the theoretical analysis and simulation based results [8, 14] are marked by the fundamental assumption that the terrain over which the nodes move is closed, i.e., they make one of the two assumptions: nodes either wrap around or are reflected at the terrain boundaries. For instance, one of the best performing protocols in this class ‘spray-n-wait’ routing protocol [13] computes the total number of copies to spray from a system of recursive equations that assume the number of node infections at any time as a monotonically increasing function bounded by the total number of nodes. In this work, we look at the performance analysis of DTN routing protocols when the assumption of closed terrain is relaxed. In particular, we assume that once the node reaches the boundary of the simulation area, it is “absorbed” and hence no longer participates in routing since it is effectively out of range. This situation simulates a realistic scenario where nodes that move past a particular boundary of a region such as a stadium or an open air theater are very likely to move out through one of the exits. The physical obstacles such as the boundary wall disable the nodes from participating in routing once they cross a boundary. We also explicitly model churn in the network, by having these nodes enter the terrain as a Poisson process and exit by hitting one of the boundaries. In summary, the contributions of this paper are threefold: 1. We introduce a novel and realistic open terrain framework that explicitly models the influx and outflux of nodes in the terrain. This framework also provides a way to compare the open and closed terrains in terms of their routing performance. 2. We simulate a variety of routing protocols under different mobility scenarios using the above framework. We conclude that the open terrains are statistically equivalent to the closed ones in terms of routing delay. 3. We also observe that some protocols exhibit the statistical equivalence in terms of send cost, however, some do not. This could potentially provide us a new way to classify routing protocols. The structure of the rest of this paper is outlined as follows: Section 2 presents the related work in this area. Section 3 describes the framework for studying open terrains. Section 4 explains the simulation setup for the experiments using the new framework. In Section 5, we interpret the simulation results and explain their trends. We conclude our work in Sect. 6 and describe some future directions in Sect. 7.
2
Related Work
Open terrains have been considered in the analysis of cellular networks where users dynamically enter and leave hexagonal cells according to some mobility model. The time spent by the users in the range of a tower (serving any one
A Realistic Framework for Delay-Tolerant Network Routing
409
cell), or the time spent by the users in the overlapping region of two towers is considered as an important derived attribute for that mobility model. The distribution of these times is useful in designing appropriate hand-off schemes for cellular networks [9]. However, the scenario considered there is starkly different since the towers serve as an infrastructure linked by cables forming a stable wired backbone for communication. On the other hand, we model a completely opportunistic and cooperative routing where there is no infrastructure support. The work that is closest to ours in terms of being applicable to a DTN is the work on modeling Pedestrian Content Distribution on a network of roads [17]. Every street segment is modeled as an M/G/∞ queue, where every node picks a speed from the entry point to the exit point on the street that is uniformly distributed in a range [vmin , vmax ]. The road network is treated as a network of such queues. However, the scenario we consider here is routing and not content dissemination. Additionally, the mobility on the street is one dimensional.
3
A Framework for Studying Open Terrains with Continuous Churn
In this section, we describe the framework to model open terrain as a queue that enables it to have a fair comparison with the closed terrain in studying their routing performance. 3.1
Terrain Model
In our framework, the node enters a square terrain at some point on the boundary, the point being chosen uniformly over the perimeter of the terrain. The node continues to move, according to some mobility model, until it hits one of the boundaries where it is absorbed. The essential difference between a closed terrain and an open terrain under this framework is: 1. When a node hits the boundary it either reflects or wraps around in case of a closed terrain, but is merely absorbed and dies in the case of an open terrain. 2. The continuous churn implies that there will at any time, be an influx of nodes into the terrain and an outflux of nodes getting absorbed. The influx process is described in the next section. We believe this model of an open terrain to be realistic since it represents many day-to-day scenarios, where nodes that move towards the boundary of an enclosed space such as a theater are likely to move out through an exit. Beyond that, the boundaries render any peer-to-peer communication between nodes impossible. 3.2
Open Terrains as Queues
The nodes are assumed to arrive as a Poisson process. This Markovian arrival process models continuous churn too. Thus the terrain taken as one system
410
V. Mahendran et al.
behaves like an infinite server queue with a Markovian arrival process and a service time given by a general distribution. So, in the Kendall notation it is an M/G/∞ queue. Continuing with this queuing model, the sojourn time E(t) of a node is the time it stays inside the terrain boundaries. 3.3
Equalizing the Open and Closed Terrains
This framework would not be complete unless there was a way to compare the open and closed terrains for any given application. We “equalize” the two terrains as follows: 1. Let N be the total number of nodes in the closed terrain. 2. The expected sojourn time or service time E[t] of the open terrain is computed as a function of the dimensions of the terrain and the mobility model of the nodes inside the terrain. E[t] = f (T errain dimension, M obility model)
(1)
This can also be computed empirically via simulations (as we have done in this work). 3. Little’s Law is applied to determine the arrival rate λ such that the average number of nodes in the open terrain E[n] is same as the total number of nodes N , thereby equalizing with that of the closed terrain. λ =
E[n] E[t]
(2)
Note that the arrival rate in closed terrain case is immaterial since the simulation in the closed terrain starts only once all nodes are inside. The equality of the open and closed terrains as per the above procedure is shown in Fig. 1 for three different mobility models. The straight line at 100 represents the number of nodes in the closed terrain. 120
120
MIT
100
80 60 40
80 60 40
20
20
0
0
0
200
400
600
800
1000
Time (seconds)
(a) MIT Trace
1200
RW
100
Number of nodes
Number of nodes
100
Number of nodes
120
VANET
80 60 40 20
0
200
400
600
800
1000
1200
0
0
200
400
(b) VANET Trace
600
800
1000
1200
Time (seconds)
Time (seconds)
(c) RWMM Trace
Fig. 1. Variation of number of nodes versus time
1400
A Realistic Framework for Delay-Tolerant Network Routing
4
411
Simulation Setup
We used a well known discrete event network simulator ns-2 [1] to simulate various routing protocols under the open and closed terrain scenarios using the framework described in Section 3. The simulation parameters are shown in Table 1. We have not considered any traffic model as such, since the goal was to study the propagation of a single message through the open terrain and the closed terrain. Table 1. Simulation parameters Parameter Value Number of nodes 50 − 250 Buffer size ∞ Velocity of nodes ‘V ’ 5m/s Terrain size 1000m × 1000m MAC protocol IEEE 802.11 Transmission range 20m Carrier sense range 40m Number of simulation runs 100 Total simulation time 1000s
4.1
Routing Protocols
The four routing protocols used for simulation are described below: 1. Epidemic routing: In this scheme [16], every node forwards the message to every other node it encounters and due to this it is optimal in terms of delay, but at the expense of very high cost1 . To preserve the common denominator across the routing protocols under test, no explicit recovery mechanisms are considered to stop the replication. Rather, we have considered the costs and delay incurred only till the message reaches the destination node. 2. Two-hop routing: The source node forwards the message to either the relay node or the destination node and the relay nodes in turn forward the message only to the destination node. 3. Spray-n-wait routing: Here, as described in [13] a fixed number of copies of the message, say L, is handed over to the source node, which in turn hands over half the message copies to any node it encounters until the source node runs out of message copies (Binary spray). Every node in turn that receives the message copies, does the same thing: i.e., if it has say x copies and encounters a node that has no copies, it hands over x/2 copies to that node and keeps the rest with itself. This process continues for any node until that node has only one copy of the message left with itself, in which case it switches to the second phase of the routing: Waiting. In this phase, the node that has only one copy of the message waits until it comes into direct contact with the destination node. 1
Cost in this context is the message passing overhead.
412
V. Mahendran et al.
4. Direct transmission: In this scheme the source node just waits until it comes into contact with the destination node and transmits only then. It has the minimum cost, but consequently has the smallest delivery ratio and largest delivery delay. 4.2
Mobility of Nodes
We use two forms of mobility models for our simulations as described below: 1. Random Walk Mobility Model (RWMM): In this model, the nodes are assumed to move at a constant speed V , throughout the simulation duration with no pause time. The choice of speed only affects the scaling of the time required for the simulation. On every epoch the node chooses a flight from an exponentially distributed random variable of a given mean. This mean is a scaled down version of the terrain side (assuming a square terrain). Next, the node picks a direction from the uniform distribution with values between [0, 2π] and then executes a flight in that direction. Once that flight terminates, the node repeats the same process over again until one of the flights takes the node to the terrain boundary in the open case where it simply gets absorbed. We use a variant of the mobility model described in [4] and modify the code provided at [3] for this purpose. 2. Time Variant Community Model (TVCM): In this model [11], the terrain is divided into many sub terrains each of which is called a community. At every point in time, a given node can be in any one of the communities. Nodes move from one community to another (at a fixed global velocity V ) using transition probabilities, akin to a Markov Chain. This whole structure of communities and their associated transition probabilities remains fixed for one time period of some duration. The node executes a sequence of different time periods and then comes back to its original starting time period again. Every node can have an independent TVCM model for itself or like the vanilla models they can be iid. The essential components that this model captures are skewed location preferences (people like to stick to their home or office) and recurrent behavior of human mobility (i.e., one community structure for the time during the weekday, one for the time during weeknights, one for weekend mornings, one for weekend nights, and so on. Also observe that the same thing would repeat itself every week). An open terrain is modeled in this case by assuming that if an epoch falls out of the boundary of the terrain it has been absorbed. If on the other hand it merely falls outside the boundary of the current community, it is reflected as in the closed terrain case. This model has the advantage of being more realistic and has been shown to successfully capture real world traces through an appropriate choice of parameters in the model. We use two types of real world models to generate traces. One is representative of VANETs and is implemented in [10]. The second model is used to generate traces based on parameters derived by matching the TVCM model to the trace observed in [2]. We use the TVCM model to simulate the two traces just cited, and will call the traces MIT and VANET henceforth for the purpose of discussion.
A Realistic Framework for Delay-Tolerant Network Routing
413
As mentioned above, we use three mobility models: a vanilla RWMM model and two models based on TVCM called MIT and VANET. 4.3
Handling Transients
Particular care must be taken to ensure that the simulation in the open terrain case is carried out only over the stationary2 phase. The simulation plot in Fig. 1 shows how the number of nodes in the terrain builds up to a particular point and then oscillates around a mean for the most part. The relatively oscillatory phase corresponds to the stationary phase and the average value around that time is the average number of nodes in the simulation area in the open case. To take care of transients, we use the mean sojourn time of that particular terrain which can be found by running the mobility model over a large set of nodes (10000 in our case) and empirically computing the mean. The mean sojourn time is large enough for the queue to stabilize to the average value. The empirical estimation is required as mentioned earlier in the framework to equalize the two terrains. Once the transient time has passed, the source begins transmitting its message. 4.4
Handling the Source and the Destination
Since we intend to compare the delivery delay between the open terrain and the closed terrain, we must ensure that with enough simulation time the message must be delivered from the source to the destination with high probability. To guarantee this, the source and destination are treated in the open terrain as “closed” terrain nodes so that there is little chance that either of them wanders off. Our work focuses on the effect of open terrains on intermediate relay nodes and we seek to avoid the premature termination of messages because of the absence of source or destination nodes in the terrain. 4.5
Performance Metrics
1. Delivery ratio: The average fraction of message transmissions that actually reach the destination. 2. Delivery delay: The average time taken to send a message from the source node to the destination node. 3. Send cost: Average number of message copies sent by all the nodes for sending a message from the source node to the destination node.
5
Simulation Results
The four routing protocols epidemic routing, two-hop routing, direct transmission, and spray-n-wait routing were simulated over 100 runs for each of the 2
Stationary in this context is the period of time the average number of nodes in the terrain remains essentially constant and same as the closed terrain.
414
V. Mahendran et al.
60
80
Send cost (messages)
Send cost (messages)
SnW-Open Direct-Closed
40
Direct-Open 30 20 10
60
SnW-Closed
50
SnW-Open Direct-Closed
40
Direct-Open
30 20 10
30 20 10 0
-10
-10
150
200
250
50
100
150
200
Direct-Open
40
-10 Number of nodes
Direct-Closed
50
0 100
SnW-Open
60
0
50
SnW-Closed
70 Send cost (messages)
SnW-Closed 50
250
50
100
Number of nodes
(a) MIT trace
150
200
250
Number of nodes
(b) VANET trace
(c) RWMM trace
Fig. 2. Send cost for spray-n-wait and direct transmission 400 SnW-Closed
300
Direct-Closed
250
Direct-Open
200 150 100 50
1000
SnW-Closed
350
SnW-Open
Delivery delay (seconds)
Delivery delay (seconds)
350
SnW-Open
300
Direct-Closed
250
Direct-Open
Delivery delay (seconds)
400
200 150 100 50
0
SnW-Open 800
100 150 200 Number of nodes
250
Direct-Open
400 200 0
50
(a) MIT trace
Direct-Closed
600
0 50
SnW-Closed
100 150 200 Number of nodes
250
50
(b) VANET trace
100 150 200 Number of nodes
250
(c) RWMM trace
Fig. 3. Routing delay for spray-n-wait and direct transmission
900 800
ER-Closed
ER-Open
700
ER-Open
600
2Hop-Closed
2Hop-Closed 2Hop-Open
300 200 100
500
800 Send cost (messages)
400
ER-Closed Send cost (messages)
Send cost (messages)
500
2Hop-Open
400 300 200 100
0 100 150 200 Number of nodes
(a) MIT trace
250
600
ER-Closed ER-Open 2Hop-Closed 2Hop-Open
500 400 300 200 100
0 50
700
0 50
100 150 200 Number of nodes
(b) VANET trace
250
50
100 150 200 Number of nodes
250
(c) RWMM trace
Fig. 4. Send cost epidemic routing and two-hop routing
following traces, (i) VANET profile for the TVCM, (ii) MIT profile for the TVCM, and (iii) Vanilla RWMM model. All performance metrics are plotted with 95% confidence. In order to verify our framework, we pick a mobility trace for each of the models at random and plot the evolution of the number of nodes inside the
A Realistic Framework for Delay-Tolerant Network Routing
415
140
ER-Closed
120
ER-Open
120
ER-Open
2Hop-Closed
100
2Hop-Open 80 60 40 20 0
2Hop-Closed
100
2Hop-Open 80 60 40 20
100 150 200 Number of nodes
(a) MIT trace
250
ER-Open
600
2Hop-Closed
500
2Hop-Open
400 300 200 100
0 50
ER-Closed
700 Delivery delay (seconds)
ER-Closed Delivery delay (seconds)
Delivery delay (seconds)
800 140
0 50
100 150 200 Number of nodes
(b) VANET trace
250
50
100 150 200 Number of nodes
250
(c) RWMM trace
Fig. 5. Routing delay for epidemic routing and two-hop routing
terrain with time. The relatively stable part corresponds to the stationary phase of the queue. As seen in Fig. 1, the average number in the open case, is more or less the same as the number of nodes as in the closed terrain case. We depict performance in terms of two metrics i.e., delivery delay and send cost. The delivery ratio was also computed across all the protocols, but since it was close to one in all cases except direct transmission, we choose not to include those plots due to space constraints. The direct transmission protocol depends only on the source and destination nodes for the delivery of packets and hence as expected, Fig. 2 and Fig. 3 show no difference in terms of any of the metrics between the two terrains, since the source and the destination are still “closed” nodes. Figure 2 shows very little difference in send cost between the two terrains when the protocol used is either spray-n-wait or direct transmission. The only place where a little discernible, but not so vast difference shows up is the VANET trace. The same statistical equivalence in terms of the send cost of routing seems to hold for two-hop transmission too as seen in the Fig. 4. However, the epidemic routing in Fig. 4 shows the vast difference between the two mobility scenarios that gets more and more pronounced as the number of nodes increases. For instance, in the MIT trace the epidemic routing cost difference between the open and closed terrain is a factor of 2 with 50 nodes but increases to a factor of 3 when the number of nodes is 250. A similar trend seems to hold for the other two mobility traces as well, where the epidemic routing cost in an open terrain is between 3 to 4 times that of the closed terrain. This is primarily because epidemic routing does not replicate if the other node already has a copy of the message thereby saving a lot more in the case of closed terrains than in the open terrain case. Most probably this property should hold for any probabilistic version of epidemic routing as well. We expect that the difference between the open and closed cases would be significant when the probability of forwarding is high. This allows us the interesting possibility of classifying various routing protocols based on whether or not they show a difference in routing cost.
416
V. Mahendran et al.
As seen in the Fig. 3 and Fig. 5, there is little or no statistical difference in terms of the routing delay between the open and the closed terrains, which allows us to conclude that they are statistically equivalent.
6
Conclusions
This work proposed a novel and realistic framework that explicitly represents open terrains in a DTN and exhaustively compares the open and closed terrains in terms of delivery ratio, routing delay, and send cost across a range of mobility models and routing protocols. We conclude that, for all protocols, the routing delay lie in roughly the same range. Hence as far as routing delay is concerned, the open and closed terrains are statistically equivalent. The significant increase (3 to 4 times) in terms of send cost for epidemic routing in an open terrain points to the possibility of designing an intelligent routing protocol, depending on the likelihood of a node moving out and also provides a way of classifying the routing protocols based on the statistical equivalence in their performance across the open and closed terrains.
7
Future Work
The expected service time E[t] for a given simulation configuration is determined empirically by plotting the service times observed over a large ensemble of entering and subsequently exiting nodes. Future work would extend the computation of sojourn time distribution E[t] analytically. This work considers the fact that all nodes that hit the boundary gets absorbed. Future work would also consider the scenario in which a fraction of nodes that move out of the terrain are injected back. This would allow us to study the performance with a mixture of nodes.
References 1. The Network Simulator, http://www.isi.edu/nsnam/ns 2. Balazinska, M., Castro, P.: Characterizing Mobility and Network Usage in a Corporate Wireless Local-Area Network. In: MobiSys 2003: Proceedings of the 1st International Conference on Mobile systems, Applications, and Services, pp. 303– 316 (2003) 3. Camp, T.: Toilers (2002), http://toilers.mines.edu/Public/CodeList 4. Camp, T., Boleng, J., Davies, V.: A Survey of Mobility Models for Ad hoc Network Research. Wireless Communications and Mobile Computing: Special Issue on Mobile Ad hoc Networking: Research, Trends, and Applications 2(5), 483–502 (2002) 5. Cerf, V., Burleigh, S., Hooke, A., Torgerson, L., Durst, R., Scott, K., Fall, K., Weiss, H.: RFC 4838, Delay-Tolerant Networking Architecture. IRTF DTN Research Group (2007)
A Realistic Framework for Delay-Tolerant Network Routing
417
6. Fall, K.: A Delay-Tolerant Network Architecture for Challenged Internets. In: SIGCOMM 2003: Proceedings of the Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications, pp. 27–34 (2003) 7. Fall, K.R., Farrell, S.: DTN: An Architectural Retrospective. IEEE Journal on Selected Areas in Communications 26(5), 828–836 (2008) 8. Garetto, M., Leonardi, E.: Analysis of Random Mobility Models with PDE’s. In: MobiHoc 2006: Proceedings of the 7th ACM International Symposium on Mobile Ad hoc Networking and Computing, pp. 73–84 (2006) 9. Hong, D., Rappaport, S.: Traffic Model and Performance Analysis for Cellular Mobile Radio Telephone Systems with Prioritized and Nonprioritized Handoff Procedures. IEEE Transactions on Vehicular Technology 35(3), 77–92 (1986) 10. Hsu, W.: Time Variant Community Mobility Model (2007), http://nile.cise.ufl.edu/~ weijenhs/TVC_model 11. Hsu, W., Spyropoulos, T., Psounis, K., Helmy, A.: Modeling Time-Variant User Mobility in Wireless Mobile Networks. In: INFOCOM 2007: Proceedings of the 26th IEEE International Conference on Computer Communications, pp. 758–766 (2007) 12. Ip, Y.K., Lau, W.C., Yue, O.C.: Performance Modeling of Epidemic Routing with Heterogeneous Node Types. In: ICC 2008: Proceedings of the IEEE International Conference on Communications, pp. 219–224 (2008) 13. Spyropoulos, T., Psounis, K., Raghavendra, C.S.: Spray and Wait: An Efficient Routing Scheme for Intermittently Connected Mobile Networks. In: WDTN 2005: Proceedings of the ACM SIGCOMM Workshop on Delay-Tolerant Networking, pp. 252–259 (2005) 14. Spyropoulos, T., Psounis, K., Raghavendra, C.S.: Performance Analysis of Mobility-Assisted Routing. In: MobiHoc 2006: Proceedings of the 7th ACM International Symposium on Mobile Ad hoc Networking and Computing, pp. 49–60 (2006) 15. Spyropoulos, T., Turletti, T., Obraczka, K.: Routing in Delay-Tolerant Networks Comprising Heterogeneous Node Populations. IEEE Transactions on Mobile Computing 8(8), 1132–1147 (2009) 16. Vahdat, A., Becker, D.: Epidemic Routing for Partially Connected Ad hoc Networks. Tech. Rep. CS-2000-06, Duke University (2000) 17. Vukadinovi´c, V., Helgason, O.R., Karlsson, G.: A Mobility Model for Pedestrian Content Distribution. In: SIMUTools 2009: Proceedings of the 2nd International Conference on Simulation Tools and Techniques, pp. 1–8 (2009)
Author Index
Acharya, H.B. 251 Agarwal, Shivali 143 Ahmed, Nadeem 340 Alistarh, Dan 41 Al-Mousa, Yamin 315 Anirudh, Sivaraman K. Attiya, Hagit 1, 83
Jaekel, Arunita 293 Jayanti, Prasad 119 Jha, Sanjay 340 Joshi, Saurabh 143 407
Kalyanasundaram, Bala 269 Kangasharju, Jussi 77 Kielmann, Thilo 155 King, Valerie 203 Kogan, Alex 65 Krepska, Elzbieta 155 Krumov, Lachezar 77 Kulkarni, Dhananjay 394 Kumar, Naga Praveen 167 Kuznetsov, Petr 191
Bal, Henri 155 Baldellon, Olivier 215 Bandyopadhyay, Subir 293 Bari, Ataul 293 Bhatt, Vibhor 119 Bhattacharya, Sukanta 263 Biswas, Subir 352 Bradler, Dirk 77 Braginsky, Anastasia 107 Chakaravarthy, Venkatesan T. Chandrasekaran, R. 364 Chen, Jianer 281 Choudhury, Anamitra R. 53 Crowcroft, Jon 29 Datta, Anwitaman De, Tanmay 263 Fokkink, Wan Friedman, Roy
227
155 65
Gafni, Eli 191 Garg, Vijay K. 53 Gilbert, Seth 41 Gouda, M.G. 251 Guerraoui, Rachid 41 Gupta, Arobinda 376 Hadjichristofi, G.C. 328 Hadjicostis, C.N. 328 Hillel, Eshcar 83 Hong, Theodore 29
53
Liyana Arachchige, Chanaka J. Lonargan, Steven 203 Lubowich, Yuval 131 Madhavapeddy, Anil 29 Mahendran, Veeramani 407 Majumdar, Saugat 394 Martin, Nicholas 315 Mehrotra, Ankit 239 Misra, Prasant 340 Mittal, Neeraj 364 Mortier, Richard 29 Most´efaoui, Achour 215 M¨ uhlh¨ auser, Max 77 Murthy, C. Siva Ram 407 Narang, Ankur 167 Natu, Maitreya 239 Nozaki, Yoshihiro 382 Ostry, Diethelm
340
Pal, Ajit 263 Patel, Chandrakant D. 12 Patil, Sangameshwar 239 Peri, Sathya 95
364
420
Author Index
Petrank, Erez 107 Pham, Congduc 303 Plummer Jr., Anthony
Shenoy, Nirmala 315, 382 Shyamasundar, Rudrapatna K. Singh, Rayman Preet 376 Srivastava, Abhinav 167
352
Radeva, Tsvetomira 281 Ravishankar, Chinya V. 394 Raychaudhuri, D. 328 Raynal, Michel 215 Sabharwal, Yogish 53 Sadaphal, Vaishali 239 Saia, Jared 203 Sastry, Srikanth 281 Schwarzkopf, Malte 29 Sharma, Rajesh 227 Sheerazuddin, Shamimuddin
143, 167
Taghizadeh, Mahmoud 352 Taubenfeld, Gadi 131 Travers, Corentin 41 Trehan, Amitabh 203 Tuncer, Hasan 382 Velauthapillai, Mahendran 269 Venkatesan, S. 364 Vidyasankar, Krishnamurthy 95 179
Welch, Jennifer L.
281