Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen
Editorial Board David Hutchison Lancaster University, UK Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA Josef Kittler University of Surrey, Guildford, UK Jon M. Kleinberg Cornell University, Ithaca, NY, USA Alfred Kobsa University of California, Irvine, CA, USA Friedemann Mattern ETH Zurich, Switzerland John C. Mitchell Stanford University, CA, USA Moni Naor Weizmann Institute of Science, Rehovot, Israel Oscar Nierstrasz University of Bern, Switzerland C. Pandu Rangan Indian Institute of Technology, Madras, India Bernhard Steffen University of Dortmund, Germany Madhu Sudan Microsoft Research, Cambridge, MA, USA Demetri Terzopoulos University of California, Los Angeles, CA, USA Doug Tygar University of California, Berkeley, CA, USA Gerhard Weikum Max-Planck Institute of Computer Science, Saarbruecken, Germany
5860
Sunggu Lee Priya Narasimhan (Eds.)
Software Technologies for Embedded and Ubiquitous Systems 7th IFIP WG 10.2 International Workshop, SEUS 2009 Newport Beach, CA, USA, November 16-18, 2009 Proceedings
13
Volume Editors Sunggu Lee Pohang University of Science and Technology (POSTECH) Department of Electronic and Electrical Engineering San 31 Hyoja Dong, Nam Gu, Pohang, Gyeongbuk 790-784, South Korea E-mail:
[email protected] Priya Narasimhan Carnegie Mellon University Electrical and Computer Engineering Department 5000 Forbes Avenue, Pittsburgh, PA 15213-3890, USA E-mail:
[email protected] Library of Congress Control Number: 2009937935 CR Subject Classification (1998): C.2, C.3, D.2, D.4, H.4, H.3, H.5 LNCS Sublibrary: SL 3 – Information Systems and Application, incl. Internet/Web and HCI ISSN ISBN-10 ISBN-13
0302-9743 3-642-10264-6 Springer Berlin Heidelberg New York 978-3-642-10264-6 Springer Berlin Heidelberg New York
This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. springer.com © IFIP International Federation for Information Processing 2009 Printed in Germany Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India Printed on acid-free paper SPIN: 12793907 06/3180 543210
Preface
The 7th IFIP Workshop on Software Technologies for Future Embedded and Ubiquitous Systems (SEUS) followed on the success of six previous editions in Capri, Italy (2008), Santorini, Greece (2007), Gyeongju, Korea (2006), Seattle, USA (2005), Vienna, Austria (2004), and Hokodate, Japan (2003), establishing SEUS as one of the emerging workshops in the field of embedded and ubiquitous systems. SEUS 2009 continued the tradition of fostering cross-community scientific excellence and establishing strong links between research and industry. The fields of both embedded computing and ubiquitous systems have seen considerable growth over the past few years. Given the advances in these fields, and also those in the areas of distributed computing, sensor networks, middleware, etc., the area of ubiquitous embedded computing is now being envisioned as the way of the future. The systems and technologies that will arise in support of ubiquitous embedded computing will undoubtedly need to address a variety of issues, including dependability, real-time, human–computer interaction, autonomy, resource constraints, etc. All of these requirements pose a challenge to the research community. The purpose of SEUS 2009 was to bring together researchers and practitioners with an interest in advancing the state of the art and the state of practice in this emerging field, with the hope of fostering new ideas, collaborations and technologies. SEUS 2009 would not have been possible without the effort of many people. First of all, we would like to thank the authors, who contributed the papers that made up the essense of this workshop. We are particularly thankful to the Steering Committee Co-chairs, Peter Puschner, Yunmook Nah, Uwe Brinkschulte, Franz Rammig, Sang Son and Kane H. Kim, without whose help this workshop would not have been possible. We would also like to thank the General Co-chairs, Eltefaat Shokri and Vana Kalogeraki, who organized the entire workshop, and the Program Committee members, who each contributed their valuable time to review and discuss each of the submitted papers. We would also like to thank the Publicity Chair Soila Kavulya and the Local Arrangements Chair Steve Meyers for their help with organizational issues. Thanks are also due to Springer for producing this publication and providing the online conferencing system used to receive, review and process all of the papers submitted to this workshop. Last, but not least, we would like to thank the IFIP Working Group 10.2 on Embedded Systems for sponsoring this workshop. November 2009
Sunggu Lee Priya Narasimhan
Organization
General Co-chairs Eltefaat Shokri Vana Kalogeraki
The Aerospace Corporation, USA University of California at Riverside, USA
Program Co-chairs Sunggu Lee Priya Narasimhan
Pohang University of Science and Technology (POSTECH), Korea Carnegie Mellon University, USA
Steering Committee Peter Puschner Yunmook Nah Uwe Brinkkschulte Franz Rammig Sang Son Kane H. Kim
Technische Universit¨ at Wien, Austria Dankook University, Korea Goethe University, Frankfurt am Main, Germany University of Paderborn, Germany University of Virginia, USA University of California at Irvine, USA
Program Committee Allan Wong Doo-Hyun Kim Franz J. Rammig Jan Gustafsson Kaori Fujinami Kee Wook Rim Lynn Choi Minyi Guo Paul Couderc Robert G. Pettit IV Roman Obermaisser Tei-Wei Kuo Theo Ungerer Wenbing Zhao Wilfried Elmenreich Yukikazu Nakamoto
Hong Kong Polytech, China Konkuk University, Korea University of Paderborn, Germany Malardalen University, Sweden Tokyo University of Agriculture and Technology, Japan Sunmoon University, Korea Korea University, Korea University of Aizu, Japan INRIA, France The Aerospace Corporation, USA Vienna University of Technology, Austria National Taiwan University, Taiwan University of Augsburg, Germany Cleveland State University, USA University of Klagenfurt, Austria University of Hyogo and Nagoya University, Japan
VIII
Organization
Publicity and Local Arrangements Chairs Soila Kavulya Steve Meyers
Carnegie Mellon University, USA The Aerospace Corporation, USA
Table of Contents
Design and Implementation of an Operational Flight Program for an Unmanned Helicopter FCC Based on the TMO Scheme . . . . . . . . . . . . . . . Se-Gi Kim, Seung-Hwa Song, Chun-Hyon Chang, Doo-Hyun Kim, Shin Heu, and JungGuk Kim
1
Energy-Efficient Process Allocation Algorithms in Peer-to-Peer Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ailixier Aikebaier, Tomoya Enokido, and Makoto Takizawa
12
Power Modeling of Solid State Disk for Dynamic Power Management Policy Design in Embedded Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jinha Park, Sungjoo Yoo, Sunggu Lee, and Chanik Park
24
Optimizing Mobile Application Performance with Model–Driven Engineering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chris Thompson, Jules White, Brian Dougherty, and Douglas C. Schmidt
36
A Single-Path Chip-Multiprocessor System . . . . . . . . . . . . . . . . . . . . . . . . . . Martin Schoeberl, Peter Puschner, and Raimund Kirner
47
Towards Trustworthy Self-optimization for Distributed Systems . . . . . . . . Benjamin Satzger, Florian Mutschelknaus, Faruk Bagci, Florian Kluge, and Theo Ungerer
58
An Experimental Framework for the Analysis and Validation of Software Clocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Andrea Bondavalli, Francesco Brancati, Andrea Ceccarelli, and Lorenzo Falai
69
Towards a Statistical Model of a Microprocessor’s Throughput by Analyzing Pipeline Stalls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Uwe Brinkschulte, Daniel Lohn, and Mathias Pacher
82
Joining a Distributed Shared Memory Computation in a Dynamic Distributed System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Roberto Baldoni, Silvia Bonomi, and Michel Raynal
91
BSART (Broadcasting with Selected Acknowledgements and Repeat Transmissions) for Reliable and Low-Cost Broadcasting in the Mobile Ad-Hoc Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ingu Han, Kee-Wook Rim, and Jung-Hyun Lee
103
X
Table of Contents
DPDP: An Algorithm for Reliable and Smaller Congestion in the Mobile Ad-Hoc Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ingu Han, Kee-Wook Rim, and Jung-Hyun Lee Development of Field Monitoring Server System and Its Application in Agriculture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chang-Sun Shin, Meong-Hun Lee, Yong-Woong Lee, Jong-Sik Cho, Su-Chong Joo, and Hyun Yoe On-Line Model Checking as Operating System Service . . . . . . . . . . . . . . . . Franz J. Rammig, Yuhong Zhao, and Sufyan Samara Designing Highly Available Repositories for Heterogeneous Sensor Data in Open Home Automation Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Roberto Baldoni, Adriano Cerocchi, Giorgia Lodi, Luca Montanari, and Leonardo Querzoni Fine-Grained Tailoring of Component Behaviour for Embedded Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Nelson Matthys, Danny Hughes, Sam Michiels, Christophe Huygens, and Wouter Joosen
114
121
131
144
156
MapReduce System over Heterogeneous Mobile Devices . . . . . . . . . . . . . . . Peter R. Elespuru, Sagun Shakya, and Shivakant Mishra
168
Towards Time-Predictable Data Caches for Chip-Multiprocessors . . . . . . Martin Schoeberl, Wolfgang Puffitsch, and Benedikt Huber
180
From Intrusion Detection to Intrusion Detection and Diagnosis: An Ontology-Based Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Luigi Coppolino, Salvatore D’Antonio, Ivano Alessandro Elia, and Luigi Romano
192
Model-Based Testing of GUI-Driven Applications . . . . . . . . . . . . . . . . . . . . Vivien Chinnapongse, Insup Lee, Oleg Sokolsky, Shaohui Wang, and Paul L. Jones
203
Parallelizing Software-Implemented Error Detection . . . . . . . . . . . . . . . . . . Ute Schiffel, Andr´e Schmitt, Martin S¨ ußkraut, Stefan Weigert, and Christof Fetzer
215
Model-Based Analysis of Contract-Based Real-Time Scheduling . . . . . . . . Georgiana Macariu and Vladimir Cret¸u
227
Exploring the Design Space for Network Protocol Stacks on Special-Purpose Embedded Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hyun-Wook Jin and Junbeom Yoo
240
Table of Contents
HiperSense: An Integrated System for Dense Wireless Sensing and Massively Scalable Data Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Pai H. Chou, Chong-Jing Chen, Stephen F. Jenks, and Sung-Jin Kim Applying Architectural Hybridization in Networked Embedded Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Antonio Casimiro, Jose Rufino, Luis Marques, Mario Calha, and Paulo Verissimo Concurrency and Communication: Lessons from the SHIM Project . . . . . Stephen A. Edwards
XI
252
264
276
Location-Aware Web Service by Utilizing Web Contents Including Location Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . YongUk Kim, Chulbum Ahn, Joonwoo Lee, and Yunmook Nah
288
The GENESYS Architecture: A Conceptual Model for Component-Based Distributed Real-Time Systems . . . . . . . . . . . . . . . . . . . Roman Obermaisser and Bernhard Huber
296
Approximate Worst-Case Execution Time Analysis for Early Stage Embedded Systems Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jan Gustafsson, Peter Altenbernd, Andreas Ermedahl, and Bj¨ orn Lisper
308
Using Context Awareness to Improve Quality of Information Retrieval in Pervasive Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Joseph P. Loyall and Richard E. Schantz
320
An Algorithm to Ensure Spatial Consistency in Collaborative Photo Collections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Pieter-Jan Vandormael and Paul Couderc
332
Real-Sense Media Representation Technology Using Multiple Devices Synchronization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jae-Kwan Yun, Jong-Hyun Jang, Kwang-Ro Park, and Dong-Won Han
343
Overview of Multicore Requirements towards Real-Time Communication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ina Podolski and Achim Rettberg
354
Lifting the Level of Abstraction Dealt with in Programming of Networked Embedded Computing Systems (Keynote Speech) . . . . . . . . . . K.H. Kim
365
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
377
Design and Implementation of an Operational Flight Program for an Unmanned Helicopter FCC Based on the TMO Scheme Se-Gi Kim1, Seung-Hwa Song2, Chun-Hyon Chang2, Doo-Hyun Kim2, Shin Heu3, and JungGuk Kim1 1
Hankuk University of Foreign Studies {undeadrage,jgkim}@hufs.ac.kr 2 Konkuk University
[email protected], {chchang,doohyun}@konkuk.ac.kr 3 Hanyang University
[email protected] Abstract. HELISCOPE is the name of a project support by MKE (Ministry of Knowledge & Economy) of Korea to develop flying-camera services that transmits the scene of a fire by an unmanned helicopter. In this paper, we introduce the design and implementation of the OFP (Operational Flight Program) for the unmanned helicopter’s navigation based on the well-known TMO scheme. Navigation of the unmanned helicopter is done by the commands on flight mode from our GCS (Ground Control System). As the RTOS on the FCC (Flight Control Computer), RT-eCos3.0 that has been developed based on the eCos3.0 to support the basic task model of the TMO scheme is being used. To verify this navigation system, a HILS (Hardware-in-the-loop Simulation) system using the FlightGear simulator also has been developed. The structure and functions of the RT-eCos3.0 and the HILS is also introduced briefly. Keywords: unmanned helicopter, on-flight software, TMO.
1 Introduction The HELISCOPE [1] project is to develop an unmanned helicopter and its on-flight embedded computing system for navigation and real-time transmission of the motion video of the scene of a fire using the HSDPA or Wibro communication scheme. The unmanned helicopter shall be used especially in the phase of disaster response and recovery intensively. In this paper, we introduce the design and implementation of the OFP (Operational Flight Program) subpart of the HELISCOPE project based on the well-known TMO scheme [3]. Navigation of the unmanned helicopter is done by the commands on flight mode from our GCS (Ground Control System). As the RTOS on the FCC (Flight Control Computer), RT-eCos3.0 [2], [5] that has been developed based on the eCos3.0 to support the basic task model of the TMO scheme is being used. The reason why we used the TMO model and RT-eCso3.0 is because an OFP must provide a well-structured real-time control mechanism with various sensors, actuators S. Lee and P. Narasimhan (Eds.): SEUS 2009, LNCS 5860, pp. 1–11, 2009. © IFIP International Federation for Information Processing 2009
2
S.-G. Kim et al.
and communication devices connected to the FCC. The OFP must process real-time sensor inputs and commands coming from GPS/INS (GPS/Inertial Navigation System), AHRS (Attitude and Heading Reference System) and the GCS in a deadline based manner. A main FC (Flight Control) task of the OFP calculates real-time control signals with the inputs and must send them to actuators of the helicopter in the pre-given deadline. Also, the FCC must report the status of the aircraft to the GCS periodically or upon requests from the GCS. Our OFP consists of one time-triggered task and several message-triggered tasks. Collecting of sensor data and commands is done by several message-triggered tasks. Upon receiving data, these tasks store them in to the ODS (Object Data Store) of the OFP-TMO. The ODS also contains some parameters on the capability of the aircraft. Periodic calculation and sending of control outputs with the data in ODS is performed by the main time-triggered task. All these tasks are scheduled by the RT-eCos3.0 based on their timing constraints. To verify the navigation system, a HILS (Hardware-in-the-loop Simulation) system using the open-source FlightGear simulator also has been developed. In section 2, the TMO model and the RT-eCos3.0 kernel is introduced briefly as related works and in section 3, design and implementation of the OFP are described. In section 4, the HILS system for verification will be discussed and in section 5, we will conclude.
2 The TMO Model and the RT-eCos3.0 Kernel In this section, a distributed real-time object model, TMO, and the RTOS that has been used in implementing the OFP are introduced briefly as related works. 2.1 TMO Model [2], [3] The TMO model is a real-time distributed object model for timeliness-guaranteed computing at design time. A TMO instance consists of three types of object member: an ODS (Object Data Store), time-triggered methods (SpM: Spontaneous Method) and message-triggered methods (SvM: Service Method). An SpM is actually a member-thread that is activated by a pre-given timing constraint and must finish its periodic executions within the given deadline. An SvM is also a member-thread that is activated by an event-message from a source outside a TMO. Main differences between the TMO model and conventional objects can be summarized as follows. - TMO is a distributed computing component and thus TMOs distributed over multiple nodes may interact via distributed IPC. - The two types of method are active member threads in the TMO model (SpMs and SvMs). - SvMs cannot disturb the executions of SpMs. This rule is called the BCC (Basic Concurrency Constraint). Basically, activation of an SvM is allowed only when potentially conflicting SpM executions are not in place. 2.2 RT-eCos3.0 Scheduling and IPC [2], [5] RT-eCos3.0 which is a real-time extension of the eCos3.0 supports multiple perthread real-time scheduling policies. The policies are the EDF(Earliest Deadline
Design and Implementation of an Operational Flight Program
3
First)/BCC, LLF(Least Laxity First)/BCC and the non-preemptive FTFS (First Triggered First Scheduled) scheduler. The non-preemptive FTFS scheduler is used when an off-line scheduling scenario is given by our task serializer [7] that determines the initial offsets of time-triggered tasks so as that all task instances can be executed without overlap and preemption. On the other hand, the EDF/BCC and the LLF/BCC are normally used when there is no pre-analysis tool or when the task serialization is not possible. This means that the EDF and LLF schedulers are ones for the systems with dynamic execution behaviors. With these real-time schedulers, two basic types of real-time task; time-triggered and message-triggered tasks; are supported by the kernel. SpMs and SvMs of the TMO model are mapped to these tasks by the TMOSL (TMO Support Library) for TMO programmers. The timing precision that is used for representing timing constraints such as start/stop time, period and deadline has been enhanced into micro second unit. Although the scheduling is performed every milli-second, the kernel computes current time in micro-second unit by checking and compensating the raw PIC clock ticks from the last clock interrupt. With these this schedulers, the kernel shows task-switch overhead of 1.51 micro-sec and scheduling overhead of 2.73 micro-sec in a 206MHz ARM9 processor environment. Management of message-triggered tasks is always done in conjunction with the logical multicast distributed IPC of the RT-eCos3.0. Once a message is arrived via the network transparent IPC at a channel associated with a message-triggered task, the task is activated and scheduled to finish its service within the pre-given deadline. The IPC subsystem consists of two layers. The lower layer is the intra-node channel IPC layer and the upper is the network transparent distributed IPC layer. This layering is for flexible configurations of the RT-eCos3.0 IPC based on various protocols. Besides supporting this basic channel IPC, the TMOSL has been enhanced to support the Gate and RMMC (Real-time Multicast Memory replication Channel) that is a highly abstracted distributed IPC model of the TMOSM of U.C. Irvine [4]. Since the role of an SvM is to handle external asynchronous events, the channel IPC has been extended so that external devices such as an IO devices or a sensor-alarm device can be associated to a channel. In this case, a message-triggered task can be activated when an asynchronous input occurs and is scheduled to be finished within the predefined deadline.
3 Design and Implementation of OFP for Unmanned Helicopter FCC Based on the TMO Scheme In this section, control points and OFP of a helicopter is mainly described. 3.1 Helicopter Mechanics Being different from the fixed wing aircraft, a helicopter makes a stable flight by using the thrust and upward force generated by the fixed speed rotation of the engine and the angles of the main and tail rotor blades. For change of heading, vertical flight and forward movement, it uses the tilt rotor disk and tail rotor.
4
S.-G. Kim et al.
Fig. 1. Main components of a helicopter [1]
A helicopter can maintain the powerful hovering state that is not possible in fixed wing aircraft but it has a difficult problem in maintain a stable attitude caused by its complicated lifting mechanism. Fig. 1 shows the main components of a helicopter and table 1 shows the major control points and their motion effects. Table 1. Helicopter controls and effects [1] Control Points
Directly controls
Cyclic lateral
Varies main rotor blade pitch left/right
Cyclic longitudinal
Varies main rotor blade pitch fore/aft
Collective
Tail rotor: Rudder
Collective angle of attack for the rotor main blades via the swashplate Collective pitch supplied to tail rotor blades
Secondary effect
Used in forward flight
Used in hover flight
Induces roll in direction moved
To turn the aircraft
To move sideways
Induces pitch nose down or up
Control attitude
To move forwards/ backwards
Inc./dec. pitch angle of rotor blades causing the aircraft to rise/descend
Inc./dec. torque and engine RPM
To adjust power through rotor blade pitch setting
To adjust skid height/ vertical speed
Yaw rate
Inc./dec. torque and engine RPM (less than collective)
Adjust sideslip angle
Control yaw rate/ heading
Primary effect Tilts main rotor disk left and right through the swashplate Tilts main rotor disk forward and back via the swashplate
Design and Implementation of an Operational Flight Program
5
3.2 Flight Modes of the Unmanned Helicopter The unmanned helicopter receives commands from the GCS and the OFP on FCC calculates the values of control signals to be sent to control points with current sensor values from GPS/INS, AHRS and SWM (Helicopter Servo Actuator Switching Module). Theses control signals are sent to control points via the SWM. The OFP also sends information on location and attitude to the GCS for periodical monitoring. In case of losing control, the SWM is set to a manual remote-control mode. Fig. 2 describes the control structure. Actually, Fig. 2 describes the whole structure of the HELISCOPE project including the MCC (Multimedia Communication Computer) board and its communication mechanism however, the description of this part has been excluded in this paper. Auto-flight modes of our unmanned helicopter are as follows (Fig. 3). -
Hovering Auto-landing Point navigation Multi-point navigation
In hovering mode, the helicopter tries to maintain current position and attitude even if there is a wind. And hovering is always the end of all flight modes except for autolanding and is the most difficult mode when auto-flight mode is used. In auto-landing mode, landing velocity is controlled according to the current altitude. In point-navigation mode, the helicopter moves to a target position given by the GCS. In this mode, a compensation algorithm is used when there is a breakaway from the original track to the target caused by a wind. When the craft arrives at the target, it turns its mode to hovering. Multi-point navigation is a sequential repetition of point- navigations.
Fig. 2. Structure of the HELICOPE
6
S.-G. Kim et al.
Fig. 3. Flight Mode Transition
Fig. 4. Structure of the OFP-TMO
3.3 Design of the OFP Based on the TMO Scheme The OFP basically consists of a TMO instance containing with one time-triggered task, four message-triggered tasks and ODS. The ODS contains data that are periodically read from GPS/INS, AHRS and SWM. Followings are descriptions on the rolls of these five tasks and data contained in the ODS. - GPS/INSReader (IO-SvM): This task collects data sent from the GPS/INS device that periodically sends temporal and spatial data in 10 Hz. The data mainly consists of information on altitude, position and velocity for each direction.
Design and Implementation of an Operational Flight Program
7
- AHRSReader (IO-SvM): This task collects data sent from the AHRS device that periodically send attitude information of the aircraft in 20 Hz. - SWMReader(IO-SvM): This task collects data sent from the SWM in 10 Hz. The data consists of current values of four control points (Cyclic-lateral, Cycliclongitudinal, Collective, Rudder). - GCSReader(SvM): This task receives various flight commands from the GCS sporadically. Commands from the GCS include information on flight mode, target positions, switch-to-manual-mode, etc.. - FC(SpM) : FC means Flight Control and this task runs with 20Hz period and 40 milli-second deadline. This task calculates the next values for four control points with the ODS and send control values back to the SWM for controlling the helicopter. - ODS: Besides the data collected from the read tasks, the ODS data also contains information on the current flight mode, the next values for control points and capability parameters of the aircraft. The frequencies of the SvMs and the FC-SpM can be changed according to the actual frequencies of real devices and the capability of the CPU used. The frequencies of the IO-SvMs above are set to the ones of the physical devices that will be used in the field test. The calculations being performed by the FC for the four flight modes consist of four basic auto-flight rule-operations. They are SetFowardSpeed, SetSideSpeed, SetAltitude and SetHeading. SetFowardSpeed and SetSideSpeed generate values of twoaxis-Cyclic for forward and side speeds. SetAltitude generates a Collective value to rise, descend or preserve the current altitude. Finally SetHeading generates a value to change heading. For each flight mode, an appropriate combination of these four basic operations is used. For example, in the point navigation mode, the SetHeading operation generates a value for heading (rudder) to the target position and the SetForwadSpeed operation generates values for two Cyclic control points for the maximum speed to the target. To avoid too rapid changing of attitude of an aircraft, some upper and lower limits are imposed to the values generated by the four basic operations because the maximum and minimum values for Cyclic Lateral/Longitudial, Collective, Rudder are dependent on the kinds of aircraft.[6] 3.4 GCS (Ground Control System) A GCS is an application tool with which a user can monitor the UAV status and send control messages to FCC. Our GCS provides a graphical interface for easy control. A protocol for the wireless communication devices (RF modem) between the UAV and GCS has been designed. There are two types of packet, to-GCS-packet for monitoring and from-GCS-packet for control. Both packets consist of header, length, status or control data, and checksum in the tail. - to-GCS-packet
8
S.-G. Kim et al.
The FCC system transmits to-GCS-packets containing the UAV status data that are described in table 2 to the GCS at 20Hz. Table 2. UAV status data Attitude Location Velocity Control signal FCC system
Roll/Pitch/ Yaw angle North, East, Altitude Forward/Sideward/Upward velocity Lateral cyclic/Longitudinal cyclic/ Main rotor collective pitch/Main rotor collective pitch/Tail rotor collective pitch angle Battery voltage
- from-GCS-packet A from-GCS-packet contains a command message to control the UAV flight mode described in section 3.2. - User Interface The user interface has been designed for prompt monitoring and convenient control. Fig. 5 and 6 shows the user interface of our GCS. Various status data including the
Fig. 5. Main interface window
Fig. 6. Control signal window
Design and Implementation of an Operational Flight Program
9
attitude of an UAV contained in downlink packets are displayed in 2D/3D/bar graphs and gauges of the GCS for easy analysis and debugging the FCC control logic. The GCS has six widgets for control operations such as landing, taking off, hovering, moving, self-return and update status.
4 Experiments with the Hardware-in-the-Loop Simulation System To test and verify the OFP in the TMO scheme, we used a HILS system with the open-source based FlightGear-v0.9.10 simulator. The FlightGear simulator supports various flying object models, 3-D regional environments and model-dependent algorithms. In the HILS environment, the OFP receives GPS/INS, AHRS and SWM information from the FlightGear simulator. Fig. 7 shows the HILS architecture. For the FCC, a board with the ST Thomson DX-66 STPC Client chip has been used to enhance floating point operations. Among various testing of stability of flying that have been performed, test results of hovering and heading in the point navigation mode are shown in Fig. 8 and 9. In the hovering test, the helicopter takeoffs at altitude 7m, maintains hovering at altitude 17m for 10 seconds and does auto-landing. Fig. 8 shows the desired references and actual responses of the HILS system in this scenario and we can see that there is a deflection of almost 0.5 meter in maximum when hovering. This result is tolerable in our application.
Fig. 7. Structure of the HILS
10
S.-G. Kim et al.
Fig. 8. Stability of hovering control
Fig. 9. Heading-control in the point navigation mode
5 Conclusions and Future Works In this paper, we introduced the design and implementation of the OFP (Operational Flight Program) for Unmanned Helicopter’s navigation based on the well-known TMO scheme and the RT-eCos3.0 and verified the system using HILS. Since the OFP can naturally be composed of time-triggered and message-triggered tasks, using of the TMO mode that supports object-oriented real-time concurrent programming is a very well-structured and easily extendable scheme in designing and implementing the OFP. Moreover, we could also find out that the RTOS, RT-eCos3.0,
Design and Implementation of an Operational Flight Program
11
is a very suitable for this kind of application because of its accurate timing behavior and small size. By finishing the testing of HILS with the flying object supported by the FlightGear, the job to be done further is to do some minor corrections on parameters and detail algorithms that are dependent on a real flying object model. Our plan is to start field testing with a real aircraft in this year. Acknowledgement. This work was supported by the MKE(Ministry of Knowledge Economy), Korea, under the ITRC(Information Technology Research Center) support program supervised by the NIPA(National IT Industry Promotion Agency" (NIPA2009-C1090-0902-0026) and also partially supported by the MKE and the KETI, Korea, under the HVI(Human-Vehicle Interface) project(100333312) .
References 1. Kim, D.H., Nodir, K., Chang, C.H., Kim, J.G.: HELISCOPE Project: Research Goal and Survey on Related Technologies. In: 12th IEEE International Symposium on Object/Component/Service-Oriented Real-Time Distributed Computing, pp. 112–118. IEEE Computer Society Press, Tokyo (2009) 2. Kim, J.H., Kim, H.J., Park, J.H., Ju, H.T., Lee, B.E., Kim, S.G., Heu, S.: TMO-eCos2.0 and its Development Environment for timeliness Guaranteed Computing. In: 1st Software Technologies for Dependable Distributed Systems, pp. 164–168. IEEE Computer Society Press, Tokyo (2009) 3. Kim, K.H., Kopetz, H.: A Real-Time Object Model RTO.k and an Experimental Investigation of Its Potentials. In: 18th IEEE Computer Software & Applications Conference, pp. 392–402. IEEE Computer Society Press, Los Alamitos (1994) 4. Jenks, S.F., Kim, K.H., et al.: A Middleware Model Supporting Time-triggered Messagetriggered Objects for Standard Linux Systems. Real-Time Systems – J. of Time-Critical Computing systems 36, 75–99 (2007) 5. Kim, J.G., et al.: TMO-eCos: An eCos-based Real-time Micro Operating system Supporting Execution of a TMO Structured Program. In: 8th IEEE International Symposium on Object/ Component/ Service-Oriented Real-Time Distributed Computing, pp. 182–189. IEEE Computer Society Press, Seattle (2005) 6. Kim, S.P.: Guide and Control Rules for an Unmanned Helicopter. In: 2nd Workshop on HELISCOPE, pp. 1–12. ITRC, Konkuk University, Seoul (2008) 7. Kim, H.J., Kim, J.G., et al.: An Efficient Task Serializer for Hard Real-time TMO Systems. In: 11th IEEE International Symposium on Object/Component/Service-Oriented Real-Time Distributed Computing, pp. 405–413. IEEE Computer Society Press, Orlando (2008)
Energy-Efficient Process Allocation Algorithms in Peer-to-Peer Systems Ailixier Aikebaier2, Tomoya Enokido1, and Makoto Takizawa2 1
Rissho University, Japan
[email protected] 2 Seikei University, Japan
[email protected],
[email protected] Abstract. Information systems are composed of various types of computers interconnected in networks. In addition, information systems are being shifted from the traditional client-server model to the peer-to-peer (P2P) model. The P2P systems are scalable and fully distributed without any centralized coordinator. It is getting more significant to discuss how to reduce the total electric power consumption of computers in information systems in addition to developing distributed algorithms to minimize the computation time. In this paper, we do not discuss the micro level like the hardware specification of each computer. We discuss a simple model to show the relation of the computation and the total power consumption of multiple peer computers to perform types of processes at macro level. We also discuss algorithms for allocating a process to a computer so that the deadline constraint is satisfied and the total power consumption is reduced.
1 Introduction Information systems are getting scalable so that various types of computational devices like server computers and sensor nodes [1] are interconnected in types of networks like wireless and wired networks. Various types of distributed algorithms [6] are so far developed, e.g. for allocating computation resources to processes and synchronizing multiple conflicting processes are discussed to minimize the computation time and response time, maximize the throughput, and minimize the memory space. On the other hand, the green IT technologies [4] have to be realized in order to reduce the consumptions of natural resources like oil and resolve air pollution on the Earth. In information systems, total electric power consumption has to be reduced. Various hardware technologies like low-power consumption CPUs [2,3] are now being developed. Biancini et al. [8] discuss how to reduce the power consumption of a data center with a cluster of homogeneous server computers by turning off servers which are not required for executing a collection of web requests. Various types of algorithms to find required number of servers in homogeneous and heterogeneous servers are discussed [5,9]. In wireless sensor networks [1], routing algorithms to reduce the power consumption of the battery in a sensor node are discussed. In this paper, we consider peer-to-peer (P2P) overlay networks [7] where computers are in nature heterogeneous and cannot be turned off by other persons different from S. Lee and P. Narasimhan (Eds.): SEUS 2009, LNCS 5860, pp. 12–23, 2009. c IFIP International Federation for Information Processing 2009
Energy-Efficient Process Allocation Algorithms in Peer-to-Peer Systems
13
the owners. In addition, the P2P overlay networks are scalable and fully distributed with no centralized coordination. Each peer has to find peers which not only satisfy QoS requirement but also spend less electric power. First, we discuss a model for performing processes on a computer. Then, we measure how much electric power a type of computers spends to perform a Web application process. Next, we discuss a simple power consumption model for performing a process in a computer based on the experiments with servers and personal computers. In the simple model, each computer consumes maximally the electric power if at least one process is performed. Otherwise, the computer consumes minimum electric power, i.e. in idle state. The simple model shows personal computers with one CPU independently of the number of cores, according to our experiment. A request to perform a process like a Web page request is allocated to one of the computers in the network. We discuss a laxity-based allocation algorithm to reduce not only execution time but also power consumption in a collection of computers. In the laxity-based algorithm, processes are allocated to computers so that the deadline constraints are satisfied based on the laxity concept. In section 2, we present a systems model for performing a process on a computer. In section 3, we discuss a simple power consumption model obtained from the experiment. In section 4, we discuss how to allocate each process with a computer to reduce the power consumption. In section 5, we evaluate the process allocation algorithms.
2 Computation Model 2.1 Normalized Computation Rate A system S is includes a set C of computers c1 , ..., cn (n ≥ 1) interconnected in reliable networks. A user issues a request to perform a process like a Web page request. The process is performed on one computer. There are a set P of application processes p1 , ..., pm (m ≥ 1). A term process means an application process in this paper. We assume each process ps can be performed on any computer in the computer set C. A user issues a request to perform a process ps to a load balancer K. For example, a user issues a request to read a web page on a remote computer. The load balancer K selects one computer ci in the set C for a process ps and sends a request to the computer ci . On receipt of the request, the process ps is performed on the computer ci and a reply, e.g. Web page is sent back to the requesting user. Requests from multiple users are performed on a computer ci . A process being performed at time t is referred to as current. A process which already terminates before time t is referred to as previous. Let Pi (t) (⊆ P ) be a set of current processes on a computer ci at time t. Ni (t) shows the number of the current processes in the set Pi (t), Ni (t) = |Pi (t)|. Let P (t) show a set of all current processes on computers in the system S at time t, ∪i=1,...,n Pi (t). Suppose a process ps is performed on a computer ci . Here, Tis is the total computation time of ps on ci and minTis shows the computation time Tis where a process ps is exclusively performed on ci , i.e. without any other process. Hence, minTis ≤ Tis for every process pi . Let maxTs and minTs be max(minT1s , ..., minTns ) and min(minT1s, ..., minTns ), respectively. If a process ps is exclusively performed on the fastest computer ci and the slowest computer cj , minTs = minTis and maxTs =
14
A. Aikebaier, T. Enokido, and M. Takizawa
minTjs , respectively. A time unit (tu) shows the minimum time to perform a smallest process. We assume it takes at least one time unit [tunit] to perform a process on any computer, i.e. 1 ≤ minTs ≤ maxTs . The average computation rate (ACR) Fis of a process ps on a computer ci is defined as follows: Fis = 1/Tis [1/tu].
(1)
Here, 0 < Fis ≤ 1 / minTis ≤ 1. The maximum ACR maxFis is 1 / minTis . Fis shows how many percentages of the total amount of computation of a process ps are performed for one time unit. maxFs = max(maxF1s , ..., maxFns ). minFs = min(maxF1s , ..., maxFns ). maxFs and minFs show the maximum ACRs maxFis and maxFjs for the fastest computer ci and the slowest computer cj , respectively. The more number of processes are performed on a computer ci , the longer it takes to perform each of the processes on the computer ci . Let αi (t) indicate the degradation rate of a computer ci at time t (0 ≤ αi (t) ≤ 1)[1/tu]. αi (t1 ) ≤ αi (t2 ) ≤ 1 if Ni (t1 ) ≤ Ni (t2 ) for every pair of different times t1 and t2 . We assume αi (t) = 1 if Ni (t) ≤ 1 and αi (t) < 1 if Ni (t) > 1. Suppose it takes 50 [msec] to exclusively perform a process ps on a computer ci . Here, minTis = 50. Here, Fis = maxFis = 1/50 [1/msec]. Suppose it takes 75 [msec] to perform the process ps while other processes are performed on the computer ci . Here, Fis = 1/75 [1/msec]. Hence, αi (t) = 50/75 = 0.67 [1/msec]. We define the normalized computation rate (N CR) fis (t) of a process ps on a computer ci at time t as follows: αi (t) · maxFis /maxFs [1/tu] fis (t) = (2) αi (t) · minTs /minTis [1/tu] For the fastest computer ci , fis (t) = 1 if αi (t) = 1, i.e. Ni (t) = 1. If a computer ci is faster than another computer cj and the process ps is exclusively performed on ci and cj at time ti and tj , respectively, fis (ti ) > fjs (tj ). If a process ps is exclusively performed on a computer ci , αis (t) = 1 and fis (t) = maxFis / maxFs . The maximum NRC maxfis shows maxFis / maxFs . 0 ≤ fis (t) ≤ maxfis ≤ 1. The NCR fis (t) shows how many steps of a process ps are performed on a computer ci at time t. The average computation rate (ACR) Fis depends on the size of the process ps while fis (t) depends on the speed of the computer ci . Next, suppose that a process ps is started and terminated on a computer ci at time stis and etis , respectively. Here, the total computation time Tis is etis - stis . The following formulas hold for the degradation rate αi (t) and NCR fis (t): etis αi (t) =1 (3) minT is stis etis etis αi (t) (fis (t)) dt = minTs · = minTs (4) stis stis minTis If there is no other process, i.e. αi (t) = 1 on the computer ci , fis (t) = maxFis / maxFs = minTs / minTis . Hence, Tis = etis − stis = minTis . If other processes are performed, Tis = etis - stis > minTis . Here, minTs shows the total amount of computation to be performed by the process ps .
Energy-Efficient Process Allocation Algorithms in Peer-to-Peer Systems
15
Figure 1 shows the NCRs fis (t) and fjs (t) of a process ps which are exclusively performed on a pair of computers ci and cj , respectively. Here, the computer ci is the fastest in the computer set C. The NCR fis (t) = maxfis = 1 for stis ≤ t ≤ etis and Tis = etis - stis = minTs . On the slower computer cj , fjs (t) = maxfjs < 1 and Tjs = etjs - stjs > minTs . Here, maxfis · minTis = minTs = maxfjs · minTjs from the equation (4). The areas shown by fis (t) and fjs (t) have the same size minTs (= Tis ). Figure 2 shows the NCR fis (t) of a process ps on a computer ci at time t, where multiple precesses are performed concurrently with the process ps . fis (t) is smaller than maxfis if other processes are concurrently performed on the computer ci . Here, Tis = et etis - stis > minTs and stis fis (t)dt = minTs . is fis (t) 1
1 fjs (t)
maxf is
maxf is Tjs 0
stis
stjs
t
etis
etjs
Tis (=minTs )
Fig. 1. Normalized computation rates (NCRs)
0
stis
etis
t
Fig. 2. Normalized computation rate fis (t)
Next, we define the computation laxity Lis (t) [tu] of a process ps on a computer ci at time t as follows: t Lis (t) = minTs − (fis (x)) dx. (5) stis
The laxity Lis (t) shows how much computation the computer ci has to spend to perform up a process ps at time t. Lis (stis ) = minTs and Lis (etis ) = 0. If the process ps would be exclusively performed on the computer ci , the process ps is expected to terminate at time etis = t + Lis (t). 2.2 Simple Computation Model There are types of computers with respect to the performance. First, we consider a simple computation model. In the simple computation model, a computer ci satisfies the following properties: [Simple computation model] 1. maxfis = maxfiu for every pair of different processes ps and pu performed on a computer ci . 2.
ps ∈Pi (t)
fis (t) = maxfi .
(6)
16
A. Aikebaier, T. Enokido, and M. Takizawa
The maximum normalized computation rate (NCR) maxfi of a computer ci is maxfis for any process ps . This means, the computer ci is working to perform any process with the maximum clock frequency. Pi (t) shows a set of processes being performed on a computer ci at time t. In the simple computation model, we assume the degradation factor αi (t) = 1. On a computer ci , each process ps starts at time stis and terminates at time etis . We discuss how the NCR fis (t) of each process ps changes in presence of multiple precesses on a computer ci . A process ps is referred to as precedes another process pu on a computer ci if etis < stiu . A process ps is interleaved with another process pu on a computer ci iff etiu ≥ etis ≥ stiu . The interleaving relation is symmetric but not transitive. A process ps is referred to as connected with another process pu iff (1) ps is interleaved with pu or (2) ps is interleaved with some process pv and pv is connected with pu . The connected relation is symmetric and transitive. A schedule schi of a computer ci is a history of processes performed on the computer ci . Processes in schi are partially ordered in the precedent relation and related in the connected relation. Here, let Ki (ps ) be a closure subset of the processes in the schedule schi which are connected with a process ps , i.e. Ki (ps ) = {pu | pu is connected with ps }. Ki (ps ) is an equivalent class with the connected relation, i.e. Ki (ps ) = Ki (pu ) for every process pu in Ki (ps ). Ki (ps ) is referred to as knot in schi . The schedule schi is divided into knots Ki1 , . . . , Kili which are pairwise disjointing. Let pu and pv are a pair of processes in a knot Ki (ps ) where the starting time stiu is the minimum and the termination time etiv is the maximum. That is, the process pu is first performed and the process pv is lastly finished in the knot Ki (ps ). The execution time T Ki of the knot Ki (ps ) is etiv - stiu . Let KPi (t) be a current knot which is a set of current or previous processes which are connected with at least one current process in Pi (t) at time t. In the simple model, it is straightforward for the following theorem to hold from the simple model properties: [Theorem] Let Ki be a knot in a schedule schi of a computer ci . The execution time T Ki of the knot Ki is ps ∈Ki minTis . Let us consider a knot Ki of three processes p1 , p2 , and p3 on a computer ci as shown in Figure 3 (1). Here, Ki = {p1 , p2 , p3 }. First, suppose that the processes p1 , p2 , and p3 are serially performed, i.e. eti1 = sti2 and eti2 = sti3 . Here, the execution time T Ki is eti3 - sti1 = minTi1 + minTi2 + minTi3 . Next, three processes p1 , p2 , and p3 start at time st and terminate at time et as shown in Figure 3 (2). Here, the execution time T Ki = minTi1 + minTi2 + minTi3 . Lastly, let us consider a knot Ki where the processes are concurrently performed. The processes p1 , p2 , and p3 start at the same time, sti1 = maxf i
maxf i p 1
p 2
p 3
p1
maxf i
p2
p2
p1
p3 minT i1
minT i2 minT i3
(1) Serial execution.
time t
(minT i1+ minT i2 + minT i3) time t (2) Parallel execution.
p3
(minT i1+ minT i2 + minT i3) time t
sti1
t 1 - sti1
t1
(3) Mixed execution.
Fig. 3. Execution time of knot
Energy-Efficient Process Allocation Algorithms in Peer-to-Peer Systems
17
sti2 = sti3 , are concurrently performed, and the process p3 lastly terminates at time eti3 after p1 and p2 as shown in Figure 3 (3). Here, the execution time T Ki of the knot Ki is eti3 - sti1 = minTi1 + minTi2 + minTi3 . The current knot KPi (t1 ) is {p1 , p2 , p3 } at time t1 and KPi (t2 ) is {p1 , p2 } at time t2 . It depends on the scheduling algorithm how much each NCR fis (t) is in the equation (6), fis (t) = αis ·maxfi where ps ∈Pi (t) αis = 1. In the fair scheduler, each fis (t) is the same as the others, i.e. αis = 1/|Pi (t)|: fis (t) = maxfi / |Pi (t)|.
(7)
2.3 Estimated Termination Time Suppose there are a set P of processes {p1 , . . . , pm } and a set C of computers {c1 , . . . , cn } in a system S. We discuss how to allocate a process ps in the process set P to a computer ci in the computer set C. Here, we assume the system S to be heterogeneous, i.e. some pair of computers ci and cj have different specifications and performance. Suppose a process ps is started on a computer ci at time stis . A set Pi (t) of current processes are being performed on a computer ci at time t. [Computation model] Let KPi (t) be a current knot = {pi1 , ..., pili } of processes, where the starting time is st. The total execution time T (st, t) of processes in the current knot KPi (t) is given as; T (st, t) = minTi1 + minTi2 + · · · + minTili
(8)
In Figure 3 (3), t1 shows the current time. A process p1 is first initiated at time sti1 and is terminated before time t1 on a computer ci . A pair of processes p2 and p3 are currently performed at time t1 . Here, KPi (t) is a current knot {p1 , p2 , p3 } at time t1 . T (sti1 , t1 ) = minTi1 + minTi2 + minTi3 . The execution time from time sti1 to t1 is t1 - sti1 . At time t1 , we can estimate that the processes p2 and p3 which are concurrently performed and terminate at the time t1 + T (stt1 , t1 ) - (t1 - sti1 ) = sti1 + T (ti1 , t1 ). sti1 is referred to as starting time of the current knot KPi (t). No process is performed in some period before sti1 and some process is performed at any time since sti1 to t. The estimated termination time ETi (t) of current processes on a computer ci means time when every current process of time t terminates if no other process would be performed after time t. ETi (t) is given as follows: ETi (t) = t + T (stis , t) − (t − stis ) = stis + T (stis , t)
(9)
Suppose a new process ps is started at current time t. By using the equation (9), we can obtain the estimated termination time ETi (t) of the current processes on each computer ci at time t. From the computation model, the estimated termination time ETis (t) of a new process ps starting on a computer ci at time t is given as follows: ETis (t) = ETi (t) + minTis
(10)
18
A. Aikebaier, T. Enokido, and M. Takizawa
3 Simple Power Consumption Model Suppose there are n (≥ 1) computes c1 , . . . , cn and m (≥ 1) processes p1 , . . . , pm . In this paper, we assume the simple computation model is taken for each computer, i.e. the maximum clock frequency to be stable for each computer ci . Let Ei (t) show the electric power consumption of a computer ci at time t [W/tu] (i = 1, . . . , n). maxEi and minEi indicate the maximum and minimum electric power consumption of a computer ci , respectively. That is, minEi ≤ Ei (t) ≤ maxEi . maxE and minE show max(maxE1 , ..., maxEn ) and min(minE1 , ..., minEn ), respectively. Here, minEi shows the power consumption of a computer ci which is in idle state. We define the normalized power consumption rate (N P CR) ei (t) [1/tu] of a computer ci at time t as follows: ei (t) = Ei (t)/maxE (≤ 1).
(11)
Let minei and maxei show the maximum power consumption rate minEi / maxE and the minimum one maxEi / maxE of the computer ci , respectively. If the fastest computer ci maximumly spends the electric power with the maximum clock frequency, ei (t) = maxei = 1. In the lower-speed computer cj , i.e. maxfj < maxfi , ej (t) = maxej < 1. We propose two types of power consumption models for a computer ci , simple and multi-level models. In the simple model, the NPCR ei (t) is given depending on how many number of processes are performed as follows: maxei if Ni (t) ≥ 1. ei (t) = (12) minei if otherwise. This means, if one process is performed on a computer ci , the electric power is maximally consumed on the computer ci . Even if more than one process is performed, the maximum power is consumed in a computer ci . A personal computer with one CPU satisfies the simple model as discussed in the experiments of the succeeding section. The total normalized power consumption T P Ci (t1 , t2 ) of a computer ci from time t1 to time t2 is given as follows: t2 T P Ci (t1 , t2 ) = ei (t)dt (13) t1
Next, T P C1 ·(t1 , t2 ) ≤ t2 - t1 . In the fastest computer ci , T P C1 ·(t1 , t2 ) = maxei ·(t2 t1 ) = t2 - t1 if at least one process is performed at any time from t1 to t2 in the simple model. Let Ki be a knot of a computer ci whose starting time is sti and termination time is eti . The normalized total power consumption of the computer ci to perform every i (sti , eti ). In the simple model, T P Ci (sti , eti ) = eti process in the knot Ki is T P C maxe dt = (et st ) · maxe = i i i i ps ∈Ki minTis · maxei . sti
4 Process Allocation Algorithms 4.1 Round-Robin Algorithms We consider two types of algorithms, weighted round robin (WRR) [20] and weighted least connection (WLC) [21] algorithms. For each of the WRR and WLC algorithms,
Energy-Efficient Process Allocation Algorithms in Peer-to-Peer Systems
19
we consider two cases, Per (performance) and Pow (power). In Per the weight is given in terms of the performance ratio of the servers. That is, the higher performance a server supports, the more number of processes are allocated to the server. In Pow, the weight is defined in terms of the power consumption ratio of the servers. The less power a server consumes, the more number of processes are allocated to the server. 4.2 Laxity-Based Algorithm Some application has the deadline constraint T Cs on a process ps issued by the application, i.e. a process ps has to terminate until the deadline. Here, a process ps has to be allocated to a computer ci so that the process ps can terminate by the deadline T Cs . Cs (t) denotes a set of computers which satisfy the condition T Cs , i.e. Cs (t) = {ci | ETis (t) ≤ T Cs }. That is, in a computer ci in Cs (t), the process ps is expected to terminate by the deadline T Cs . Here, if the process ps is allocated to one computer ci in Cs (t), the process ps can terminate before T Cs . Next, we assume that a normalized power consumption rate (NPCR) ei (t) of each computer ci is given as equation (12) according to the simple model. We can estimate the total power consumption laxity leis (t) of a process ps between time t and ETis (t) at time t when the process ps is allocated to the computer ci [Figure 4]. leis (t) of the computer ci is given as equation (14): leis (t) = maxei ∗ (ETis (t) − t)
(14)
Suppose a process ps is issued at time t. A computer ci in the computer set C is selected for the a process ps with the constraint T Cs at time t as follows: Alloc(t, C, ps , T Cs ) { Cs = φ; NoCs = φ; for each computer ci in C, { if ETis (t) ≤ T Cs , Cs = Cs ∪ {ci }; else /* ETis (t) > T Cs */ NoCs = NoCs ∪ {ci }; } if Cs = φ, { /* candidate computers are found */ computer = ci such that leis (t) is the minimum in Cs ; return(computer); } else { /* Cs = φ */ computer = ci such that ETis (t) is minimum in NoCs ; return(computer); } } Cs and NoCs are sets of computers which can and cannot satisfy the constraint T Cs , respectively. Here, Cs ∪ NoCs = C and Cs ∩ NoCs = φ. In the procedure Alloc, if there is at least one computer which can satisfy the time constraint T Cs of process ps , one of the computers which consumes the minimum power consumption is selected. If there is no computer which can satisfy the application time constraint T Cs , one of the computers which can most early terminate the process ps is selected in the computer set C.
20
A. Aikebaier, T. Enokido, and M. Takizawa
e i (t)
maxe i * (ETis (t)- t )
maxe i mine i time t
f i (t) maxf i
p2
p3
ps
p1
0
(minTi1 + minTi2 + minTi3 )
sti1
minTis
t
time t
ETis
Fig. 4. Estimation of power consumption
5 Evaluation 5.1 Environment We measure how much electric power computers consume for Web applications. We consider a cluster system composed of Linux Virtual Server (LVS) systems which are interconnected in gigabit networks as shown in Figure 5. The NAT based routing system VS-NAT [12] is used as the load balancer K. The cluster system includes three servers s1 , s2 , and s3 in each of which Apache 2.0 [11] is installed, as shown in Figure 5. The load generator server L first issues requests to the load balancer K. Then, the load balancer K assigns each request to one of the servers according to some allocation algorithm. Each server si compresses the reply file by using the Deflate module [13] on receipt of a request from the load generator server L. We measure the peak consumption of electric power and the average response time of each server si (i = 1, 2, 3). The power consumption ratio of the servers s1 , s2 , and s3 is 0.9 : 0.6 : 1 as shown in Figure 5. On receipt of a Web request, each server si finds a Virtual server
Gbit switch
Gbit switch
real server 1
Server 1
Server 2
Server 3
Numbre of CPUs
1
1
2
Numbre of cores
1
1
2
CPU
Load generation Load balancer server
real server 2
Memory Maximum computation rate Maximum power computation rate maxei
real server 3
Fig. 5. Cluster system
Intel Pentium 4 AMD Athlon AMD Opteron (2.8GHz) 1648B (2.7GHz) 2216HE (2.7GHz)
1.024MB
4,096MB
4096MB
1
1.2
4
0.9
0.6
1
Energy-Efficient Process Allocation Algorithms in Peer-to-Peer Systems
21
reply file of the request and compresses the reply file by using the Deflate module. The size of the original reply file is 1Mbyte and the compressed reply file is 7.8Kbyte in size. The Apache benchmark software [10] is used to generate Web requests, where the total number 10,000 of requests are issued where 100 requests are concurrently issued to each server. The performance ratio of the servers s1 , s2 , and s3 are 1 : 1.2 : 4 as shown in Figure 5. The server s3 is the fastest and mostly consumes the electric power. The server s1 is slower than s3 but more consumes the electric power than the serve s2 . 5.2 Experimental Results If the weight is based on the performance ratio (P er), the requests are allocated to the servers s1 , s2 , and s3 with the ratio 1 : 1.2 : 4, respectively. On the other hand, if the weight is based on the power consumption ratio (P ow), the requests are allocated to the servers s1 , s2 , and s3 with the ratio 0.9 : 0.6 : 1, respectively. Here, by using the Apache benchmark software, the load generation server L transmits totally 100,000 requests to the servers s1 , s2 , and s3 where six requests are concurrently issued to the load balancer K. The total power consumption of the cluster system and the average response time of a request from a web server are measured. We consider a static web server where the size of a reply file for a request is not dynamically changed, i.e. the compressed version of the same HTML reply file is sent back to each user. In this experiment, the original HTML file and the compressed file are 1,025,027 [Byte] and 13,698 [Byte] in size, respectively. On the load balancer K, types of process allocation algorithms are adopted; the weighted round-robin (WRR) [20] algorithms, WRR-Per and WRR-Pow; the weighted least connection (WLC) [21] algorithms, WLC-Per and WLC-Pow. Figure 6 shows the total power consumption [W/H] of the cluster system for time. WRR-Per and WLC-Per show the total power consumption of the servers in the WRR and WLC algorithms with the performance based weight (Per), respectively. WRRPow and WLC-Pow indicate the power consumption of the WRR and WLC with power consumption based weight (Pow), respectively. In WRR-Per and WLC-Per, the total
Power consumption [W/H]
500 480 460
WRR Performance WLC-Performance WRR Power WLC Power
440 420 400 380 360 340
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 Execution time [min]
Fig. 6. Power consumption
22
A. Aikebaier, T. Enokido, and M. Takizawa
execution time and peak power consumption are almost the same. In addition, the total execution time and peak power consumption are almost the same in WRR-Pow and WLC-Pow. This experimental result shows that the total power consumption and total execution time are almost the same for the two allocation algorithms if the same weight ratio is used. If the weight of the load balance algorithm is given in terms of the performance ratio (Per), the peak power consumption is higher than the power consumption ratio (Pow). However, in the Per, the total execution time is longer than Pow. Here, the total power consumption is calculated by the multiplication of the execution time and power consumption. The experiment shows the total power consumption is reduced by using the performance based weight (P er).
6 Concluding Remarks In this paper, we discussed the simple power consumption model of computers. We discussed the laxity-based algorithm to allocate a process to a computer so that the deadline constraint is satisfied and the total power consumption is reduced on the basis of the laxity concept. We obtained experimental results on electric power consumption of Web servers. We evaluated the simple model through the experiment of the PC cluster. Then, we showed the PC cluster follows the simple model. We are now considering types of applications like database transactions and measuring the power consumption of multi-CPU servers.
References 1. Akyildiz, I.F., Kasimoglu, I.H.: Wireless Sensor and Actor Networks: Research Challenges. Ad Hoc Networks Journal (Elsevier) 2, 351–367 (2004) 2. AMD, http://www.amd.com/ 3. Intel, http://www.intel.com/ 4. Green IT, http://www.greenit.net 5. Heath, T., Diniz, B., Carrera, E.V., Meira, W., Bianchini, R.: Energy Conservation in Heterogeneous Server Clusters. In: PPoPP 2005: Proceedings of the tenth ACM SIGPLAN symposium on Principles and Practice of Parallel Programming, pp. 186–195 (2005) 6. Lynch, N.A.: Distributed Algorithms, 1st edn. Morgan Kaufmann Publisher, San Francisco (1997) 7. Montresor, A.: A robust Protocol for Building Superpeer overlay Topologies. In: Proc. of the 4th International Conference on Peer-to-Peer Computing, pp. 202–209 (2004) 8. Bianchini, R., Rajamony, R.: Power and Energy Management for Server Systems. IEEE Computer 37(11) (November 2004); Special issue on Internet data centers 9. Rajamani, K., Lefurgy, C.: On Evaluating Request-Distribution Schemes for Saving Energy in Server Clusters. In: Proc. of the 2003 IEEE International Symposium on Performance Analysis of Systems and Software, pp. 111–122 (2003) 10. ab - Apache HTTP server benchmarking tool, http://httpd.apache.org/docs/2.0/programs/ab.html 11. Apache 2.0, http://httpd.apache.org/ 12. VS-NAT, http://www.linuxvirtualserver.org/ 13. Apache Module mod-deflate, http://httpd.apache.org
Energy-Efficient Process Allocation Algorithms in Peer-to-Peer Systems
23
14. Aron, M., Druschel, P., Zwaenepoel, W.: Cluster Reserves: A Mechanism for Resource Management in Cluster-Based Network Servers. In: Proceedings of the International Conference on Measurement and Modeling of Computer Systems, pp. 90–101 (2000) 15. Bevilacqua, A.: A Dynamic Load Balancing Method on a Heterogeneous Cluster of Workstations. Informatica 23(1), 49–56 (1999) 16. Bianchini, R., Carrera, E.V.: Analytical and Experimental Evaluation of Cluster-Based WWW Servers. World Wide Web Journal 3(4) (December 2000) 17. Heath, T., Diniz, B., Carrera, E.V., Meira Jr., W., Bianchini, R.: Self-Configuring Heterogeneous Server Clusters. In: Proceedings of the Workshop on Compilers and Operating Systems for Low Power (2003) 18. Rajamani, K., Lefurgy, C.: On Evaluating Request-Distribution Schemes for Saving Energy in Server Clusters. In: Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software, pp. 111–122 (2003) 19. Colajanni, M., Cardellini, V., Yu, P.S.: Dynamic Load Balancing in Geographically Distributed Heterogeneous Web Servers. In: Proceeding of the 18th International Conference on Distributed Computing Systems, p. 295 (1998) 20. Weighted Round Robin (WRR), http://www.linuxvirtualserver.org/docs/scheduling.html 21. Weighted Least Connection (WLC), http://www.linuxvirtualserver.org/docs/scheduling.html
Power Modeling of Solid State Disk for Dynamic Power Management Policy Design in Embedded Systems Jinha Park1, Sungjoo Yoo1, Sunggu Lee1, and Chanik Park2 1 Department of Electronic and Electrical Engineering, POSTECH (Pohang university of Science and Technology) 790-784 Hyoja-dong, Nam-gu, Pohang, Korea {litrine,sungjoo.yoo,slee}@postech.ac.kr 2 Flash Software Team, Memory Division, DS Solution Samsung Electronics Hwasung, Gyeonggi-do, Korea
[email protected] Abstract. Power consumption now becomes the most critical performance limiting factor to solid state disk (SSD) in embedded systems. It is imperative to devise design methods and architectures for power efficient SSD designs. In our work, we present the first step towards low power SSD design, i.e., power estimation of SSD. We present a practical approach of SSD power estimation which tries to keep the advantage of real measurement, i.e., accuracy, while overcoming its limitations, i.e., long execution time and lack of repeatability (and high cost) by a trace-based simulation. Since it is based on real measurements, it takes into account the power consumption of SSD controller as well as that of Flash memories. We show the effectiveness of the presented method in designing a dynamic power management policy for SSD. Keywords: Solid state disk, power consumption, measurement, trace-based simulation, dynamic power management, low power states.
1 Introduction Flash-based storage is becoming more and more popular in embedded systems such as smart card, smart phone, net book, labtop, etc. as well as in desktop PC and server. Compared with conventional non-volatile memory, namely hard disk drive (HDD), Flash-based storage can give several advantages, e.g., high performance, low power consumption, reliability, etc. Especially, solid state disk (SSD) is becoming a major non-volatile memory while replacing HDD in smart phones (e.g., iPhone 3GS with 32GB Flash memory) and net books as well as in notebook PCs and servers [1]. 1.1 Low Power SSD Design Problem SSD is inherently more power efficient than HDD since HDD requires large driving power for running mechanical parts, i.e., motors for spindle and head as well as performing I/O operations (between disk, DRAM buffer and host). However, SSD does not require the mechanical parts, but consumes power only for the electrical ones. S. Lee and P. Narasimhan (Eds.): SEUS 2009, LNCS 5860, pp. 24–35, 2009. © IFIP International Federation for Information Processing 2009
Power Modeling of SSD for Dynamic Power Management Policy Design
25
SSD can achieve higher performance than HDD mostly by parallel accesses to relatively low speed Flash devices (e.g., achieving a throughput higher than 240MB/s by accessing 8 Flash devices with 33Mbytes/sec each). High performance SSD inherits the same level of power consumption constraints that traditional HDD has in embedded systems. For instance, SSD has a peak power budget of about 1A and an average power consumption budget of about 1.2W in typical notebook PCs [2] and is expected to have much lower power budget in smart phones. Further performance improvement in SSD will require more power consumption especially due to more aggressive parallel accesses to Flash devices. Such an aggressive parallel scheme is not easily applicable due to the given power consumption constraints of embedded systems. However, there is no more large room in peak and average power budget to be used, by aggressively parallel accesses, for further performance improvement of SSD in embedded systems. Power consumption now becomes the most critical performance limiting factor in SSD. It is imperative to devise design methods and architectures for power efficient SSD designs. There has been little work on low power SSD designs. In our work, we present the first step towards low power SSD design, i.e., power estimation of SSD. We also report our application of the power estimation method to a low power SSD design where the parameter of dynamic power management is explored in a fast and cost effective way with the help of the presented power estimation method. 1.2 Power Estimation of SSD The design space for low power SSD design will be huge due to various possible design choices, e.g., parameter sets in dynamic power management (e.g., time-out parameters, DPM policies, etc.), Flash Translation Layer (FTL) algorithms and SSD controller architectural parameters (e.g., I/O frequency in Flash devices, DRAM buffer size and caching/prefetch methods in the controller, etc.).1 When exploring the design space, there can be two typical methods of evaluating the power consumption of design candidates: real measurement and full system simulation with a power model. Real measurement, which gives accurate power information, is in use in SSD product designs. There are two critical problems with real measurements: long design cycle (e.g., hours of SSD execution is required for the evaluation of one design candidate) and changing battery characteristics over repeated runs [3].2 The two problems prevent designers from performing extensive design space exploration which may require evaluating numerous design candidates. Thus, it will be impractical to evaluate all the choices with real SSD executions due to the long execution time and high cost of battery.3 The second candidate, cycle-level full system simulation with a power model, is prohibitive due to tool long a simulation runtime. Assuming a 200MHz SSD controller 1
2
3
In this paper, we consider only software-based solutions. There can be hardware design candidates such as # of channels, # ways/channel, DRAM capacity/speed, # CPUs, etc. Battery lifetime measurements require a procedure of fully charging and then completely discharging the battery while running the system. The battery characteristics degrade significantly after several times (~10 times) of such procedures. Thus, a new battery needs to be used for subsequent battery lifetime measurements. Statistical approximation and optimization methods, e.g., response surface model can also be applied to reduce the number of real executions.
26
J. Park et al.
and ~100 Kcycles/sec of simulation speed, it may take 125 days for the simulation of less than 1.5 hours of real SSD execution. Thus, SSD power estimation based on a detailed simulation model may not be practical in real SSD designs. 1.3 Our Contribution In this paper, we present a practical approach of SSD power estimation which tries to keep the advantage of real measurement, i.e., accuracy, while overcoming its limitations, i.e., long execution time and lack of repeatability (and high cost). The power estimation method takes as input real power measurements and SSD access trace. Then, it performs a trace-based simulation of SSD operation gathering the information of power consumption. Finally, it gives as output power profile (power consumption over time). Since it is based on real measurements, it takes into account the power consumption of SSD controller as well as that of Flash memories. The presented method of SSD power estimation gives fast estimation, via trace-based simulation, and accuracy, based on real power measurements. We also present an application of our method to designing a DPM policy for SSD. The remainder of this paper is organized as follows. Section 2 reviews related work. Section 3 introduces Flash memory and SSD operations. Section 4 explains the power estimation method. Section 5 reports experimental results including the application to DPM policy design. Section 6 concludes the paper.
2 Related Work There have been several works for power characterization and optimization of HDD [4][5][6][7][8][9][10]. Hylick, et al. explain that the mechanical parts incur large overheads of power consumption especially when HDD starts to run the spindle and head [4]. Zedlewski, et al. present a performance and power model of HDD based on a Disk simulation model, DiskSim [5]. The HDD power model considers the entire HDD as a single entity and has four active-mode power states (seeking, rotation, reading and writing) and two idle-mode ones. IBM reports that fine-grained power states enable further energy reduction than conventional coarse-grained ones by enabling to exploit short idle periods to enter low power states more frequently and stay there for a longer period of time [6]. Lu, et al. compare several DPM policies for HDD [7]. Douglis, et al. offer a DPM policy where the time-out (if there is no new access during the time-out, a low power state is entered) is determined adaptively based on the accuracy of previous time-out prediction [8]. Helmhold, et al. present a machine learning-based disk spin-down method [9]. Bisson, et al. propose an adaptive algorithm to adaptively calculate the time-out for disk spin down utilizing multiple timeout parameters and considering spin-up latency cost [10]. Regarding SSD, a performance model is presented only recently in [11]. However, there is little work on power characterization and modeling for SSD. In terms of power optimization in Flash-based storage, there are two categories of approaches depending on the optimization target between active and idle-mode power consumption. Joo, et al. present a low power coding scheme for MLC (multi-level cell) Flash memory which has value-dependent power characteristics (e.g., in case of 2 bit MLC, the power consumptions of coding two two-bit data, 00 and 01 are different) [12]. Recently, a low power
Power Modeling of SSD for Dynamic Power Management Policy Design
27
solution is presented for 3D die stacking of Flash memory, where multiple Flash dies share a common charge pump circuit [13]. Regarding the idle-mode power reduction, in commercial SSD products [14], a fixed time-out and DIPM (device initiated power management) are utilized when the SSD detects an idle period and asks the host of a grant for transition to the low power state.
3 Preliminary: Flash Memory Operation and SSD Architecture Typically, a Flash memory device contains, in a single package, multiple silicon Flash memory dies in a 3D die stacking [13]. The Flash memory dies share the I/O signals of device in a time-multiplexed way. We call the I/O signals of package channel and each memory die way.4 A single Flash memory device can support up to the data throughput of data width*I/O frequency, e.g., 33MBps at 33MHz. In order to support higher bandwidth, we need to access Flash memory dies (ways) and devices (channels) in parallel. Fig. 1 shows an example of SSD architecture consisting of two channels (i.e., two Flash devices) and four ways (i.e., four Flash memory dies on a single device) per channel. The controller takes commands from the host (e.g., smartphone or notebook CPU) and performs its Flash Translation Layer (FTL) algorithm to find the physical page address(es) to access. Then, it accesses the corresponding channels and ways, if needed, in parallel. In terms of available parallelism, each way can run in parallel performing internal operations, e.g., internal read and program to transfer data between the internal memory cell array and its I/O buffer. However, the controller can access, at a time, only one way on each channel, i.e., the I/O signals of the package, in order to transfer data between the I/O buffer of the Flash memory die and the controller. Thus, the peak throughput is determined by the number of channels * I/O frequency. One of salient characteristics in Flash memory is that no update is allowed on already written data. When a memory cell needs an update, it first needs to be erased before a new data is written to it. We call such a constraint “erase before write”. In order to overcome the low performance due to “erase before write”, log buffers (often called update blocks) are utilized and the new write data is written to the log buffers. The Flash translation layer (FTL) on the SSD controller maintains the address mapping between the logical data and the physical data [15][16][17]. In reality, the controller, to be specific, FTL, determines the performance and power consumption, especially, of random reads/writes. Thus, the controller overhead needs to be taken into account in the power estimation of SSD.
Ch 1 Host
Controller Ch 2
NAND NAND NAND NAND NAND NAND NAND NAND
Fig. 1. Multi-channel/multi-way SSD controller architecture 4
We use two terms (channel and the I/O signals of device, and way and die) interchangeably throughout this paper.
28
J. Park et al.
4 SSD Power Estimation The power estimation requires as input -
-
Performance and power measurement data (read/write latency for each of read/write data sizes of 1/2/4/8/16/32/64/128/256 sectors, power consumption of sequential reads/writes, power consumption per power state (idle, partial, and slumber), and power state transition delay values), Information of SSD architecture (# channels, ways/channel, tR/tRC/tPROG/ tERASE), and SSD access trace obtained from a real execution on the host system.
We perform a trace-based simulation of the entire SSD (controller as well as Flash memories) considering critical architectural resource conflicts at a channel level. Then, we obtain power profile over time as well as output execution trace (e.g., the status of each channel/way and latency of each SSD access). The trace-based simulation also allows for the simulation of DPM policy where power state transitions are performed based on the given DPM policy. 4.1 Performance and Power Modeling Resource conflict modeling is critical in performance modeling. Given the performance data per SSD access size and the information of SSD architecture, we model the critical resource conflict at a channel level. To do that, we decompose the measured latency into two parts in terms of resource parallelism: the latency of Flash I/O and that of controller and Flash internal operation. We model the channel-level resource conflict with the Flash I/O latency since only one Flash I/O operation can be active at a time at each channel. However, the controller and Flash internal operation (read from cell array to I/O buffer, program, and erase) can be performed in parallel. Fig. 2 (a) illustrates the decomposition of latency for a single one-sector SSD write operation (i.e., a SATA write command of one sector size) in the case of SSD architecture in Fig. 1. At time t0, the controller transfers, via the corresponding channel, one sector data to the I/O buffer of target Flash die. After the I/O operation is finished at time t1, we model that the controller and Flash program operations take the other portion of latency from time t1 to t2. For the power modeling of active-mode operation, we decompose the power consumption into two parts: baseline and access-dependent parts. The baseline part corresponds to the power consumption of idle state when there is no SSD access in progress while the SSD controller is turned on. We measure the power consumption of idle state and use it as the baseline part. The access-dependent part is obtained by subtracting the baseline from the measured power consumption of sequential reads/writes. The access-dependent part is further decomposed into the power consumption of per-way operation. Fig. 2 (b) illustrates the decomposition. Fig. 2 (b) shows the power profile for the case of Fig. 2 (a). The baseline, i.e. the power consumption of idle state is always consumed over the entire period in the figure (to be specific, until a transition to a low power state is made). The access-dependent part for a single write operation (its derivation will be explained later in this section) is added to the total power between time ,
Power Modeling of SSD for Dynamic Power Management Policy Design
Flash 0 Flash 1 Flash 2 Flash 3 Flash 4 Flash 5 Flash 6 Flash 7
29
Flash 0 Flash 1 Flash 2 Flash 3 Flash 4 Flash 5 Flash 6 Flash 7
t0 t1
t2
t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13
Time
(a) Trace-based simulation of single write
Per-way power consumption = P per_way_write
Power
Baseline = P idle
t0 t1
t2
Power
Measured power consumption level = P sequential_write Time
(b) Power profile of single write
Time
(c) Trace-based simulation of sequential writes
t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13
Time
(d) Power profile of sequential writes
Fig. 2. Performance and power modeling
t0 and t2 during which the write operation, including Flash I/O, program, and controller execution, is executed. Due to the limitation in measuring power consumption, we do not make a further decomposition to separately handle each of Flash I/O, read, program, erase and controller execution. We expect more detailed power measurement will enable such a fine-grained decomposition to give more accurate power estimation, which will be left for our future work. Figs. 3 (c) and (d) illustrate how we derive the power consumption of per-way operation from the access-dependent part of sequential reads/writes. Fig. 2 (c) shows the result of trace-based simulation for the case of 8 page sequential writes to the SSD of Fig. 1. At time t0, t1, t2, and t3, the controller starts to access one Flash die on each channel to transfer a page data from the controller to the I/O buffer of the Flash die. Then, it initiates a program operation in the Flash die. After the latency of program and controller, the next Flash I/O operations start at time t5, t6, t7, and t8, respectively. Fig. 2 (d) shows the corresponding power profile. First, the baseline portion covers the entire period. Then, we add up the contribution of each Flash die to the total power as the figure shows. Thus, we see the peak plateau from t3 to t10 when all the eight Flash dies and controller are active. The measured power consumption of sequential writes corresponds to the power consumption at the plateau. Thus, we obtain the per-way power consumption of write operation (Pper_way_write) as follows. Pper_way_write = (Psequential_write – Pidle)/# active Flash dies at the plateau The per-way power consumption of read operation is calculated in the same way. 4.2 Trace-Based Simulation of Performance, Power and DPM Policy The trace-based simulation covers DPM policy as well as performance and power consumption. Fig. 3 (a) illustrates how a given DPM policy is simulated in the tracebased simulation. In the figure, we assume that a time-out (TO)-based DPM policy is simulated. Thus, the SSD needs to enter a low power state after an idle period of TO
30
J. Park et al.
since the completion of previous access. In the figure, at time t13, an idle period starts and a TO timer starts to count down. At the same time, the power consumption drops to the level of idle state power consumption. At t14, the TO timer expires and a low power state is entered. The figure shows that the total power consumption drops again down to the level of the entered low power state. At t15, a read command for 8 pages arrives. However, since the SSD is in a low power state, it takes a wakeup time, Twakeup to make a state transition to the active state. Thus, the read operations start at time t16 after the wakeup delay. Fig. 3 (b) shows the current profile obtained in the trace-based simulation. Fig. 4 shows the pseudo code of trace-based simulation. The simulation is eventdriven and handles three types of event: host command arrival (e.g., SATA read/write command), start/end of Flash operations (I/O operation, and read/program/erase operation), and power state transition (e.g., when a time-out counter expires). The simulation advances the simulation time, Tnow to the time point when the next event occurs (line 2 in the figure). At the time point when there is any event (line 3), we select the event in the order of end Æ start Æ state transition, where ‘end’ and ‘start’ represent events for the end and start of Flash operation, respectively (line 4). If the selected event is a host command, then we run the FTL algorithm to find the corresponding physical page addresses (PPAs) (line 6). Note that, if there is no available log buffer space, then the FTL can call a garbage collection which schedules data move and erase operations on the corresponding channels and ways. The garbage collection method is specific to the FTL algorithm. Note also that the power consumption and runtime of FTL algorithm is included in the access-dependent part of per-way power consumption and the controller latency. If the current state is a low power state (line 8), we need a state transition to the active state. Thus, during the wakeup period, we mark the power state as an intermediate state of state transition (PState = Transition2Active), set the total power to the idle state power consumption (Ptotal = Pidle), and remove, if any, a future event for transition to a low power state (lines 8—12). Twakeup
TO Flash 0 Flash 1 Flash 2 Flash 3 Flash 4 Flash 5 Flash 6 Flash 7
t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13
t14
t15
t16 t17 t18 t19 t20 t21 t22 Time
t15
t16 t17 t18 t19 t20 t21 t22 Time
(a) Trace-based simulation with DPM policy
Power P idle P low_power_state
t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13
t14 (b) Power profile
Fig. 3. Trace-based simulation of DPM policy
Power Modeling of SSD for Dynamic Power Management Policy Design
31
A host command can create 2 to 128 event pairs for the start/end of Flash operations. If it is a read or write command for a data size less than and equal to the page size, it creates two event pairs: one pair, <start, end> for controller and Flash internal read or write operation and the other pair, <start, end> for Flash I/O operation. The 128 event pairs are created by a host command for 64 page (256 sector) read or write. In the trace-based simulation, the created event pairs are scheduled by a function, Schedule_events_for_Flash_operations(PPAs, Tinit), where PPAs is a list of physical page addresses obtained from the FTL algorithm run (on line 6). The function performs an ASAP (as soon as possible) scheduling algorithm to schedule the event pairs. Thus, it schedules each event pair at the earliest time point when the corresponding Flash channel (for Flash I/O operation) or way (for Flash internal operations and controller operation) becomes available. If the current state is a low power state, the scheduled time of the first event pair is adjusted to account for the wakeup delay (lines 11 and 13). If the new event (selected on line 4) is a start event and the current power state is a low power state, then the power state is set to the active state (line 17). Then, the power consumption of newly started operation is added to the total power consumption (line 18). If the new event is an end event (line 19), then that of the just finished operation is subtracted from the total power consumption (line 20). If there is no more future event for Flash operation, then it means that there is no active Flash channel and way and an idle period starts. Thus, we can insert into this point the function of DPM policy under development. Since we assume a simple TO-based DPM policy in Fig. 4, we schedule a TO event at Tnow + TO (line 23). 1 while(Tnow < end of simulation) { 2 Advance time Tnow to the next event 3 while (any event at time Tno w) { 4 new_event = pop(event_list(Tnow)) // pop the events of end of Flash operation first 5 If (new_event == host command) 6 Run FTL to find the corresponding PPAs 7 Tinit = Tnow 8 If (current status == low power state) 9 PState = Transition2Active 10 Pto tal = Pidle 11 Tinit = Tnow + Twakeup 12 Clear future events for transition to low power state 13 Schedule_events_for_Flash_operations(PPAs, Tinit) 14 Else if (new_event==start or end of Flash operation) 15 If (new event == start of Flash operation), then 16 If (PState==Transition2Active), then 17 PState=Active 18 Add the power consumption of the newly started operation to Pto tal 19 Else, // new_event = end of Flash operation 20 Subtract the power consumption of the just finished operation from Pto tal 21 If there is no more future event for Flash operation, then 22 // Insert DPM policy here. The following is a TO-based DPM policy example 23 Schedule a TO event at Tno w+TO 24 Else // power state transition event 25 // TO event for a power state transition in the DPM policy example 26 PState = LowPowerState 27 Ptotal = Plo w_power_state 28 // If there is any lower power state, then schedule a TO event here 29 } // end of “any event at time Tnow” 30 }
Fig. 4. Pseudo code of trace-based simulation algorithm
32
J. Park et al.
If the new event (selected in line 4) is a power state transition event (TO event in the DPM policy example), then the power state is set to the low power state (line 26). The total power consumption is set to that of low power state (line 27). If there is any lower power state, i.e., in the case of TO-based DPM policy with more than one low power states, then we can schedule another TO event (at line 28). The entire tracebased simulation continues until all the input SSD accesses are simulated.
5 Experiments We designed the power estimator in Matlab. For the experiments, we used a Samsung SSD (2.5”, 128GB, and SATA2)5 [18]. As the input data of performance and power consumption, we used the measurement data obtained from the real usage of SSD on a notebook PC running Windows VISTA. For performance, we used the measured latency of read/write commands for the data sizes of 1, 2, 4, 8, 16, 32, 64, 128, and 256 sectors, respectively. We also used the measured power consumption of sequential reads/writes and that of low power states (power consumption was measured for the idle state and two low power states called partial and slumber). We collected the input traces of SSD accesses from the notebook PC by running three scenarios of MobileMark 2007: Reader, Productivity, and DVD [19]. We also used PCMark05 to collect a SSD access trace as a heavy SSD usage case [20]. The accuracy of performance/power estimation was evaluated by comparing the estimation results and corresponding measurement data. The comparison showed that the trace-based simulation gives the same estimation results as measurement data in both power consumption (of sequential reads/writes) and latency (of all the read/write data sizes). We applied the trace-based simulation to a time out (TO)-based DPM policy design for SSD with two low power states (partial and slumber). TO-based DPM is used in HDD [6] and SSD [14] products. DPM in SSD is different from that in HDD since DPM in SSD can exploit short idle periods (much shorter than a second) which could not be utilized in HDD due to the high wakeup delay of mechanical parts (in seconds). Thus, DPM in SSD can give more reduction in energy consumption by entering low power states more frequently. TO-based DPM policy requires selecting a suitable TO which gives the minimum energy consumption over various scenarios. Fig. 5 shows performance estimation results obtained by sweeping the TO value for each of three MobileMark scenarios. In the TO sweep, for simplicity, we set two TO parameters (one for the transition from the active state to the first low power state partial, and the other for the transition from partial to the second power state slumber) to the same value. A single run of trace-based simulation takes 1~20 minutes depending on the number of accesses and TO values, which is 6~ 100+ times faster than real SSD runs.6 5
6
We assumed MLC, 4KB/page, 33MHz I/O frequency, 8 channels and 8 ways/channel based on the peak performance and capacity. We expect faster trace-based simulation can be achieved when implementing the algorithm in C/C++ rather than in Matlab.
Power Modeling of SSD for Dynamic Power Management Policy Design
"%"
$
33
( )&'
Fig. 5. MobileMark 2007 (a)~(c) and PCMark05 results (d)
Fig. 5 shows that there can be a trade-off between energy reduction and performance drop. The general trend is that as TO increases, energy consumption increases since less idle periods are utilized for energy reduction while performance penalty (due to accumulated wakeup delay) decreases since there are less wakeups. As shown in the figure, in the case of MobileMark scenarios, the sensitivity of performance drop to TO is the most significant in the Productivity scenario. It is because Productivity scenario has 5.34 and 6.26 times the SSD accesses of DVD and Reader scenarios, respectively. However, the absolute level of performance drop is not significant in the case of Reader and DVD scenarios and moderate in the case of Productivity scenario. It is because MobileMark runtime is dominated by idle periods which occupy about 95% of total runtime. Thus, the performance impact of DPM policy is not easily visible with the MobileMark traces. However, users may experience performance loss due to aggressive DPM policies (e.g., short TO such as 1ms) when the SSD is heavily accessed. PCMark05 represents such a scenario where the host runs continuously, thus, accessing the SSD more frequently than MobileMark. Fig. 5 (d) shows the result of PCMark05 which incurs up to 16.9% performance drop in the case of aggressive DPM policy (TO=1ms). It is mainly because PCMark05 has 9.7 times more SSD accesses (per minute) than Productivity scenario. Considering the results in Fig. 5, there can be a trade-off between energy reduction and performance drop which designers need to investigate in low power SSD design. As a final remark, Fig. 5 shows that there is still a large gap between the result of Oracle DPM (where we obtain maximum energy reduction without performance drop) and that of optimal single TO case. Our power estimation method will contribute to the design of sophisticated DPM policies to exploit the trade-off while closing the gap.
34
J. Park et al.
6 Conclusion In this paper, we presented a power estimation method for SSD in embedded systems. It takes as input a SSD access trace and the measurement data of performance and power consumption of SSD accesses and gives as output the power profile considering the DPM policy under development. The presented method gives accuracy in power consumption based on real measurement data and fast simulation by applying a trace-based approach. We also presented a case study of applying the method to designing a DPM policy for SSD. As our future work, we will perform an extensive analysis on the accuracy of power estimation with the comparisons against real measurements on various scenarios.
Acknowledgement This work was supported in part by Samsung Electronics.
References 1. Kim, B.: Design Space Surrounding Flash Memory. In: International Workshop on Software Support for Portable Storage, IWSSPS (2008) 2. Creasey, J.: Hybrid Hard Drives with Non-Volatile Flash and Longhorn. In: Windows Hardware Engineering Conference (WinHEC), MicroSoft (2005) 3. Communications with Samsung engineers 4. Hylick, A., Rice, A., Jones, B., Sohan, R.: Hard Drive Power Consumption Uncovered. ACM SIGMETRICS Performance Evaluation Review 35(3), 54–55 (2007) 5. Zedlewski, J., Sobti, S., Garg, N., Zheng, F., Krishnamurthy, A., Wang, R.: Modeling Hard-Disk Power Consumption. In: The USENIX Conference on File and Storage Technologies (FAST), pp. 217–230. USENIX Association (2003) 6. IBM: Adaptive Power Management for Mobile Hard Drives. IBM (1999), http://www.almaden.ibm.com/almaden/mobile_hard_drives.html 7. Lu, Y., De Micheli, G.: Comparing System-Level Power Management Policies. IEEE Design & Test of Computers 18(2), 10–19 (2001) 8. Douglis, F., Krishnam, P., Bershad, B.: Adaptive Disk Spin-down Policies for Mobile Computers. In: 2nd Symposium on Mobile and Location-Independent Computing, pp. 121–137. USENIX Association (1995) 9. Helmbold, D., Long, D., Sconyers, T., Sherrod, B.: Adaptive Disk Spin-Down for Mobile Computers. Mobile Networks and Applications 5(4), 285–297 (2000) 10. Bisson, T., Brandt, S.: Adaptive Disk Spin-Down Algorithms in Practice. In: The USENIX Conference on File and Storage Technologies (FAST). USENIX Association (2004) 11. Dirik, C., Jacob, B.: The performance of PC solid-state disks (SSDs) as a function of bandwidth, concurrency, device architecture, and system organization. In: International Symposium on Computer Architecture, pp. 279–289. ACM, New York (2009) 12. Joo, Y., Cho, Y., Shin, D., Chang, N.: Energy-Aware Data Compression for Multi-Level Cell (MLC) Flash Memory. In: Design Automation Conference, pp. 716–719. ACM, New York (2007)
Power Modeling of SSD for Dynamic Power Management Policy Design
35
13. Ishida, K., Yasufuku, T., Miyamoto, S., Nakai, H., Takamiya, M., Sakurai, T., Takeuchi, K.: A 1.8V 30nJ adaptive program-voltage (20V) generator for 3D-integrated NAND flash SSD. In: International Solid-State Circuits Conference, pp. 238–239. IEEE, Los Alamitos (2009) 14. Intel: X25-M and X18-M Mainstream SATA Solid-State Drives. Intel (2009), http://www.intel.com/design/flash/nand/mainstream/index.htm 15. Lee, S., Park, D., Jung, T., Lee, D., Park, S., Song, H.: A Log Buffer-Based Flash Translation Layer Using Fully-Associative Sector Translation. ACM Transactions on Embedded Computing Systems (TECS) 6(3) (2007) 16. Kang, J., Cho, H., Kim, J., Lee, J.: A Superblock-based Flash Translation Layer for NAND Flash Memory. In: The 6th ACM & IEEE International conference on Embedded software (EMSOFT), pp. 161–170. ACM, New York (2006) 17. Lee, S., Shin, D., Kim, Y., Kim, J.: LAST: locality-aware sector translation for NAND flash memory-based storage systems. ACM SIGOPS Operating Systems Review 42(6) (2008) 18. Samsung SSD, Samsung, http://www.samsung.com/global/business/semiconductor/ products/flash/ssd/2008/product/pc.html (2009) 19. Business Applications Performance Corporation: MobileMark 2007. BAPCo (2007), http://www.bapco.com/products/mobilemark2007/ 20. Futuremark, co.: PCMark05. Futuremark (2009), http://www.futuremark.com/products/pcmark05/
Optimizing Mobile Application Performance with Model–Driven Engineering Chris Thompson, Jules White, Brian Dougherty, and Douglas C. Schmidt Department of Electrical Engineering and Computer Science, Vanderbilt University, Nashville, TN USA {jules,briand,schmidt}@dre.vanderbilt.edu,
[email protected] Abstract. Future embedded and ubiquitous computing systems will operate continuously on mobile devices, such as smartphones, with limited processing capabilities, memory, and power. A critical aspect of developing future applications for mobile devices will be ensuring that the application provides sufficient performance while maximizing battery life. Determining how a software architecture will affect power consumption is hard because the impact of software design on power consumption is not well understood. Typically, the power consumption of a mobile software architecture can only be determined after the architecture is implemented, which is late in the development cycle when design changes are costly. Model-driven Engineering (MDE) is a promising solution to this problem. In an MDE process, a model of the software architecture can be built and analyzed early in the design cycle to identify key characteristics, such as power consumption. This paper describes current research in developing an MDE tool for modeling mobile software architectures and using them to generate synthetic emulation code to estimate power consumption properties. The paper provides the following contributions to the study of mobile software development: (1) it shows how models of a mobile software architecture can be built, (2) it describes how instrumented emulation code can be generated to run on the target mobile device, and (3) it discusses how this emulation code can be used to glean important estimates of software power consumption and performance.
1 Introduction Emerging trends and challenges. Mobile devices, such as smartphones, mobile internet devices and web-enabled media players, are becoming pervasive. These devices possess limited resources, such as battery capacity, which requires developers to carefully manage resource consumption. To optimize resource utilization, mobile application developers must understand the trade-offs between performance and battery life. It is hard to predict the effects of architectural optimizations in mobile devices until a system has been completely implemented, which makes it difficult to test power consumption and performance until late in the software lifecycle [14], e.g., during implementation and testing. Changes made at this point usually result in farreaching consequences to the overall design and cost much more compared to S. Lee and P. Narasimhan (Eds.): SEUS 2009, LNCS 5860, pp. 36–46, 2009. © IFIP International Federation for Information Processing 2009
Optimizing Mobile Application Performance with Model–Driven Engineering
37
those made during earlier software lifecycle phases [12], e.g., during architectural design and analysis. Conventional techniques for developing mobile device software are not well-suited to identifying performance and power consumption trade-offs during earlier phases of the software lifecycle. These limitations stem largely from the difficulty of comparing the power consumption of one architectural design versus another without implementing and testing each on the target device. Moreover, for each function an application performs, there are often multiple possible designs for accomplishing the same task, each differing in terms of operational speed, battery consumption and accuracy. Even though these design variations can significantly impact device performance, there are too many permutations to implement and test each. For example, if a mobile application communicates with a server it can do so via several protocols, such as HTTP, HTTPS, or other socket connections. Developers can also elect to have the application and/or mobile device infrastructure submit data immediately or in a batch at periodic intervals. Each design option can result in differing power consumption profiles [13]. If the developer elects to use HTTPS over HTTP, the developer is provided with additional security. The overhead associated with key exchange and the encryption/decryption process, however, incurs additional processing time and increases the amount of information that must be transmitted over the network. Both of these require more power and time than would be required if standard HTTP was used. The combination of these architectural options results in too many possible variations to implement and test each one within a reasonable budget and production cycle. A given application could have hundreds or thousands of viable configurations that satisfy the stated requirements. Solution approach Æ Emulation of application behavior through model-driven testing and auto-generated code. Model-driven engineering (MDE) [15] provides a promising solution to the challenges described above. MDE relies on modeling languages, such as domain-specific modeling languages (DSMLs) [16], to visually represent various aspects of application and system design. These models can then be utilized for code generation and performance analysis. By creating a model of candidate solution architectures early in the design phase, instrumented architectural emulation code can be generated and then run on actual mobile devices. This MDE-based approach allows developers to quickly emulate a multitude of possible configurations and provides them with actual device performance data without investing the time and effort manually writing application code. The generated code emulates the modeled architecture by consuming sensor data, computational cycles, and memory as specified in the model, as well as transmitting/receiving faux data over the network. Since wireless transmissions consume most of the power on mobile devices [3] and network interaction is a key performance bottleneck, large-scale power consumption and performance trends can be gleaned by executing the emulation code. Moreover, as the real implementation is built, the actual application logic can be used to replace faux resource consuming code blocks to refine the accuracy of the model. This MDE-based solution has been utilized previously to eliminate some inherent flaws with serialized phasing in layered systems, specifically as they apply to system QoS and to identify design flaws early in the
38
C. Thompson et al.
software production life-cycle [9]. Some prior work [8] also employs model-driven analysis to conduct what-if analysis on potential application architectures. By utilizing MDE-based analysis, mobile software developers can quantitatively evaluate key performance and power consumption characteristics earlier in the software lifecycle (e.g., at design time) rather than later (e.g., during and after implementation), thereby significantly reducing software refactoring costs due to design flaws. MDE provides this by not only allowing developers to generate emulation code, but also by providing them with a high-level understanding of their application that is easy to modify on the fly. Changes can be made at design time by simply moving model elements around rather than rewriting code. Moreover, since emulation code is automatically generated from the model, developers can quickly understand key performance and power consumption characteristics of potential solution architectures without investing the time and effort to implement them. This paper describes emerging R&D efforts that seek to provide developers of mobile applications with an MDE-based approach to optimizing application resource consumption across a multitude of platforms at design time. This paper also describes a methodology for increasing battery longevity in mobile devices through applicationlayer modifications. By focusing on the application layer, developers can still reap the benefits of advanced SDKs and compilers that shield the developer from hardwarecentric decisions. Paper organization. The remainder of this paper is organized as follows: Section 2 presents a sample mobile application running on Google’s Android platform and introduces several challenges associated with resource consumption optimization and mobile application development; Section 3 discusses our current research work developing an MDE tool to allow developers to predict software architecture performance and power consumption properties earlier in the development process; Finally, Section 4 presents concluding remarks and lessons learned.
2 Motivating Example This section presents a motivating mobile application running on Google’s Android platform and describes several challenges associated with resource consumption optimization and mobile application development. 2.1 Overview of Wreck Watch Managing system resources properly can significantly affect device performance and battery life. For instance, reducing CPU instructions not only speeds performance but also reduces the time the process is in a non-idle state thereby reducing power consumption; reducing network traffic also speeds performance and reduces the power supplied to the radio. To demonstrate the importance of proper resource management and the value of model-based resource analysis, we present the following example mobile application, called Wreck Watch, shown in Figure 1.
Optimizing Mobile Application Performance with Model–Driven Engineering
39
Fig. 1. Wreck Watch Behavior
Wreck Watch runs on Google Android smartphones to detect car accidents (1) by analyzing data from the device’s GPS receiver and accelerometer and looking for sudden acceleration events from a high velocity indicative of a collision. Car accident data is then posted to an HTTP server where (2) it can be retrieved by other devices in the area to help alleviate traffic congestion, notify first responders, and (3) provide accident photos to an emergency response center. Users of Wreck Watch can also elect to have certain people contacted in the event of an accident via SMS message or a digital PBX. Figure 1 shows this behavior of Wreck Watch. Since the Wreck Watch application runs continuously in the background, it must conserve its power consumption. The application needs to run at all times and consume a great deal of sensor information to accurately detect wrecks. If not designed properly, therefore, these characteristics could result in a substantial decrease in battery life. In our testing, for example, the Wreck Watch application was able to completely drain the device battery in less than an hour simply through its use of sensors and network connectivity. In the case of Wi-Fi, the radio represents nearly 70% of device power consumption [2] and in extreme cases can consume 100 times the power of one CPU instruction to transmit one byte of data [3]. The amount of power consumed by the network adapter is generally proportional to the amount of information transmitted [1]. The framing and overhead associated with each protocol can therefore significantly affect the power consumption of the network adapter. Prior work [5] demonstrated that significant power savings could be achieved by modifying the MAC layer to minimize collisions and maximize time spent in the idle state. This work also recognized that network operations generally involved only the CPU and transceiver and by reducing client-side processing, they could substantially reduce power consumed by network transactions. Similarly, other work [7] demonstrated that such power savings could also be achieved through transport layer modifications.
40
C. Thompson et al.
Although MAC and transport layer modifications are typically beyond the scope of most software projects, especially mobile application development, the data transmitted on the network can be optimized so it is as lightweight as possible, thereby accomplishing, on a much smaller scale, some of the same effects. The remainder of this paper uses the Wreck Watch application to showcase key design challenges that developers face when building power-aware applications for mobile devices. 2.2 Design and Behavioral Challenges of Mobile Application Development Despite the ease with which mobile applications can be developed via advanced SDKs (such as Google Android and Apple iPhone) developers still face many challenges related to power consumption. If developers do not fully understand the implications of their designs, they can substantially reduce device performance. Battery life represents a major metric used to compare devices and can be influenced significantly by design decisions. Designing mobile applications while remaining cognizant of battery performance presents the following challenges to developers: Challenge 1: Accurately predicting battery consumption of arbitrary architectural decisions is hard. Each instruction executed can result in the consumption of an unknown amount of battery power. Accurately predicting the power consumed for each line of code is hard given the level of abstraction present in modern SDKs, as well as the complexity and numerous variations between physical devices. Moreover, disregarding language commonalities between completely unrelated devices, mobile platforms, such as Android, are designed to operate on a plethora of hardware configurations, which may affect the power consumption of a given configuration. Challenge 2: Trade-offs between performance and battery life are not readily apparent. Although performance and power consumption are generally design tradeoffs, the actual relationship between the two metrics is not readily apparent. For example, when comparing two networking protocols, plain HTTP and SOAP, plain HTTP might operate much faster requiring only 10 ms to transmit the data SOAP requires 50 ms to transmit. At the same time, HTTP might consume .5 mW, while SOAP consumes 1.5 mW. Without the context of real-world performance in a physical device it would be difficult to predict the overhead associated with SOAP. Moreover, this data may vary from one device to the next. Challenge 3: Effects of transmission medium on power consumed are largely device, application, and environment specific. Wireless radios consume a substantial amount of device power relative to other mobile-device components [6], where the power consumed is directly proportional to the amount of information transmitted [1]. Each radio also provides differing data rates, as well as power consumption characteristics. Depending on the application, developers must choose the connection medium best suited to application requirements, such as medium availability and transmission rate. The differences between transmission media are generally subtle and may even depend on environmental factors [10], such as network congestion that are impossible to accurately predict. To deterministically and accurately quantify performance, therefore, testing must be performed in environmentally-accurate situations.
Optimizing Mobile Application Performance with Model–Driven Engineering
41
Challenge 4: It is hard to accurately predict the effects of reducing sensor data consumption rates on power utilization. To provide the most accurate readings and results, device sensors would be polled as frequently as they sample data. This method consumes the most possible power, however, by not only requiring that the sensor be enabled constantly, but by also increasing the amount of data the device must process. In turn, reducing the time that the sensor is active significantly reduces the effectiveness and accuracy of the readings. Determining the exact amount of power saved by a reduction in polling rate or other sensor accuracy change is difficult without profiling such a change on a device. Challenge 5: Accurately assessing effects of different communication protocols on performance is hard without real-world analysis. Each communication protocol has a specific overhead associated with it that directly affects its overall throughput. The natural choice would be to select the protocol with the lowest overhead. While this decision yields the highest performance, it also results in a tightly coupled architecture [11] and substantially increases production time. That protocol would only be useful for the specific data set for which it was designed, in contrast to a standardized protocol, such as HTTP. Standardized protocols often support features that are unnecessary for many mobile applications, however, making the additional data required for HTTP transactions completely useless. It is challenging to predict how much of a tradeoff in performance is required to select the more extensible protocol because the power cost of such protocols cannot be known without profiling them in a real-world usage scenario. Discussions on performance optimization have often focused on hardware- or firmware-level changes and ignored potential application layer enhancements [3] [5] [6]. Interestingly, this corresponds to the level of abstraction present in each layer: device drivers and hardware have little or no abstraction while software applications are often more thoroughly abstracted. It is this level of abstraction, however, that makes such optimizations challenging because often the developer has little or no control over the final machine code. Application code thus cannot be benchmarked until it has been fully developed and compiled. Moreover, problems identified after the code is developed are substantially more costly to correct than those that can be identified at design time. The value of optimizing the performance of an application before any code is written is therefore of great value. Moreover, because power consumption is generally hardware-specific [1] such optimizations result in a tightly coupled architecture that requires the developer to rewrite code to benchmark other configurations.
3 Model-Based Testing and Performance Analysis This section describes our current work in developing a modeling language extension to the Generic Eclipse Modeling System (GEMS) (www.eclipse.org/gmt/gems) [17], called the System Power Optimization Tool (SPOT) for optimizing performance and power consumption of mobile applications at design time. GEMS is an MDE tool for building Domain Specific Modeling Languages (DSMLs) for the Eclipse platform. The goal of SPOT is to allow developers to rapidly model potential application architectures and
42
C. Thompson et al.
obtain feedback on the performance and power consumption of the architecture without manual implementation. The performance data is produced by generating instrumented architectural emulation code from the architectural model that is then run on the target hardware. After execution, cumulative results can be downloaded from the target device for analysis. This section describes the modeling language, emulation code generation, and performance measurement infrastructure that we are developing to address the five challenges described in Section 2.2. 3.1 Mobile Application Architecture Modeling and Power Consumption Estimation with SPOT To accurately model mobile device applications, SPOT provides a domain-specific modeling language (DSML) with components that (1) represent key, resourceconsuming aspects of a mobile application’s architecture and (2) allows developers to specify visual diagrams of a mobile application architecture, as shown in the workflow diagram in Figure 2. SPOT’s DSML, called the System Power Optimization Modeling Language (SPOML), allows developers to build architectural specifications from the following types of model elements: • CPU consumers, which represent computationally intense code-segments such as location-based notifications that require distance calculations on hundreds of points. • Memory consumers, which represent sections of application code that will incur heavy memory operations reducing performance and increasing power consumption, e.g., displaying an image, stored on disk, on the screen, etc. • Sensor data consumers, which will poll device sensors at user-defined intervals. • Network consumers, which periodically utilize network resources emulating actual application traffic • Screen drawing agents, which interact with device graphics libraries, such as OpenGL, to consume power by rendering images to the display. The sensor and network data consumers operate independently of application logic and simply present an interface through which their data can be accessed. The CPU consumer, however, need to incorporate application-specific logic, as well as logic from other aspects of the application. The CPU consumer module also allows for developers to integrate actual logic application as it becomes available to replace emulation code that is generated by SPOML. To provide the software developer with the most flexibility and extensibility possible, SPOML provides them with many key power consumptive architectural options that would be present if they were actually writing device code. For example, if the device presents 10 possible options for granularity of GPS readings, SPOML provides all 10 possibilities via visual elements, such as drop down menus and check boxes. SPOML also provides for constraint checking that warns developers at design time if certain configuration options are unlikely to work together. Ultimately, SPOT provides developers with the ability to modify design characteristics rapidly and model their system without any application-specific logic, as well as provides them with a means to incorporate actual application code.
Optimizing Mobile Application Performance with Model–Driven Engineering
43
Fig. 2. SPOT Analysis Cycle
3.2 Architectural Emulation Code Generation Due to the difficulty of estimating power consumption for an arbitrary device and software architecture it is essential to evaluate application performance on the actual physical hardware in production conditions. To accomplish this task, SPOT can automatically generate instrumented code to perform the functions outlined by the architecture modeled in SPOML. This code generation is done by traversing the inmemory object graph of the model and outputting optimized code to perform the resource-intensive operations specified in the model. The architectural emulation code is constructed from several basic building blocks, as described above. The sensor consumers remain largely the same between applications and require little input from the user developing the model. The only variable in their construction is the rate at which they poll the sensor. They present an interface through which their data can be accessed. The network consumer itself consists of several modules: a protocol, a transmission scheme and a payload interface. The payload interface defines methods that allow other components of the application to utilize the network connection and, for the purposes of emulation and analysis, this interface also helps define the structure of the data to transmit. The protocol module allows the developer to select from a set of predefined protocols (e.g., HTTP or SOAP) or create a custom protocol with a small amount of code. The transmission scheme defines a set of behaviors for how to
44
C. Thompson et al.
transmit data back to the server, which allows developers to specify whether the application should transmit as soon as data is available, wait until a certain amount of data is available, or even wait until a certain connection medium is available (such as Wi-Fi or EDGE). Finally, the screen rendering agent allows users to specify the interval at which the screen is refreshed or invalidated for a given view. Each module described above relies almost entirely on prewritten and optimized code. Of greater complexity for users are the CPU and memory consumers. Users may elect to utilize prewritten code that closely resembles the functionality they wish to provide. Alternatively, they can write their own code to use in these modules profile their architecture more accurately. This iterative approach allows developers to quickly model their architecture without writing detailed application logic and then as this code becomes available, refine their analysis to better represent the performance and behavior of the ultimate system. 3.3 Performance and Resource Consumption Management When generating emulation code, SPOT also generates instrumentation code to record device performance and track power consumption. This code writes these metrics to a file on the device that can later be downloaded to a host machine for analysis after testing. This approach allows developers to quantitatively compare metrics such as application responsiveness (by way of processor idle time, etc), network utilization and throughput and battery longevity. These comparisons provide the developer with a means to quickly and accurately design a system that minimizes power consumption without sacrificing performance. In some instances, this analysis could even highlight simple changes such as reducing the size of XML tags to reduce the overhead associated with downloading information from a server. In each challenge presented in Section 2.2 we establish that through current methods, certain characteristics of a design can only be fully understood postimplementation. Additionally, with newer platforms such as Google’s Android, the mobile device has become an embedded multi-application system. Since each device has significantly less resources than their tethered brethren, however, individual applications must be cognizant of their resource consumption. The value of understanding a given application’s power consumption profile is thus greatly increased. The solutions to each of these challenges lie within the same space: utilization of a model that can be used to accurately assess battery life. SPOT addresses mobile application performance analysis through the use of auto-generated code specified by a DSML, which allows users to estimate performance and power consumption early in the development process. Moreover, developers can perform continuous integration testing by replacing faux code with application logic as it is developed.
4 Concluding Remarks The capabilities of mobile devices have increased substantially over the last several years and with platforms, such as Apple’s iPhone and Google’s Android, will no doubt continue to expand. These platforms have ushered in a new era of applications and have presented developers with a wealth of new opportunities. Unfortunately,
Optimizing Mobile Application Performance with Model–Driven Engineering
45
with these new opportunities have come new challenges that developers must overcome to make the most of these cutting-edge platforms. In particular, predicting performance characteristics of a given design is hard, especially those characteristics associated with power consumption. A promising approach to address these challenges is to enhance model-driven engineering (MDE) tools to enable developers to quickly understand the consequences of architectural decisions. These conclusions can be drawn long before implementation, significantly reducing production costs and time while substantially increasing battery longevity and overall system performance. From our experience developing SPOT, we have learned the following lessons: • • •
By utilizing MDE it becomes possible to quantitatively compare design decisions and deliver some level of optimization with regards to power consumption, Developing applications for platforms such as Android require extensive testing as hardware configurations can greatly influence performance, and It is impossible to completely profile a system configuration because ultimate device performance and power consumption depends on user interaction, network traffic and other applications on the device.
The WreckWatch application is available under the Apache open-source license and can be downloaded at http://vuphone.googlecode.com.
References 1. Feeney, L., Nilsson, M.: Investigating the energy consumption of a wireless network interface in an ad hoc networking environment. In: IEEE INFOCOM, vol. 3, pp. 1548–1557 (2001) 2. Liu, T., Sadler, C., Zhang, P., Martonosi, M.: Implementing software on resourceconstrained mobile sensors: experiences with impala and zebranet. In: Proceedings of the 2nd international conference on Mobile systems, applications, and services, pp. 256–269 (2004) 3. Pering, T., Agarwal, Y., Gupta, R., Want, R.: Coolspots: Reducing the power consumption of wireless mobile devices with multiple radio interfaces. In: Proceedings of the Annual ACM/USENIX International Conference on Mobile Systems, Applications and Services, MobiSys (2006) 4. Poole, J.: Model-driven architecture: Vision, standards and emerging technologies. In: Workshop on Metamodeling and Adaptive Object Models, ECOOP (2001) 5. Chen, J., Sivalingam, K., Agrawal, P., Kishore, S.: A comparison of mac protocols for wireless local networks based on battery power consumption. In: IEEE INFOCOM 1998. Seventeenth Annual Joint Conference of the IEEE Computer and Communications Societies (1998) 6. Krashinsky, R., Balakrishnan, H.: Minimizing energy for wireless web access with bounded slowdown. Wireless Networks 11, 135–148 (2005) 7. Kravets, R., Krishnan, P.: Application-driven power management for mobile communication. Wireless Networks 6, 263–277 (2000) 8. Paunov, S., Hill, J., Schmidt, D., Baker, S., Slaby, J.: Domain-specific modeling languages for configuring and evaluating enterprise DRE system quality of service. In: 13th Annual IEEE International Symposium and Workshop on Engineering of Computer Based Systems, ECBS 2006 (2006)
46
C. Thompson et al.
9. Hill, J., Tambe, S., Gokhale, A.: Model-driven engineering for development-time QoS validation of component-based software systems. In: Proceeding of International Conference on Engineering of Component Based Systems (2007) 10. Carvalho, M., Margi, C., Obraczka, K., et al.: Garcia-Luna-Aceves. Modeling energy consumption in single-hop IEEE 802.11 ad hoc networks. In: Thirteenth International Conference on Computer Communications and Networks (ICCCN 2004), pp. 367–377 (2004) 11. Gay, D., Levis, P., Culler, D.: Software design patterns for TinyOS. ACM, New York (2007) 12. Boehm, B.: A spiral model of software development and enhancement. In: Software Engineering: Barry W. Boehm’s Lifetime Contributions to Software Development, Management, and Research, vol. 21, p. 345. Wiley-IEEE Computer Society Pr. (2007) 13. Tan, E., Guo, L., Chen, S., Zhang, X.: PSM-throttling: Minimizing energy consumption for bulk data communications in WLANs. In: IEEE International Conference on Network Protocols, ICNP 2007, pp. 123–132 (2007) 14. Kang, J., Park, C., Seo, S., Choi, M., Hong, J.: User-Centric Prediction for Battery Life time of Mobile Devices. In: Proceedings of the 11th Asia-Pacific Symposium on Network Operations and Management: Challenges for Next Generation Network Operations and Service Management, pp. 531–534 (2008) 15. Kent, S.: Model Driven Engineering. In: Butler, M., Petre, L., Sere, K. (eds.) IFM 2002. LNCS, vol. 2335, pp. 286–298. Springer, Heidelberg (2002) 16. Lédeczi, A., Bakay, A., Maroti, M., Völgyesi, P., Nordstrom, G., Sprinkle, J., Karsai, G.: Composing domain-specific design environments. Computer, 44–51 (2001) 17. White, J., Schmidt, D.C., Mulligan, S.: The Generic Eclipse Modeling System. In: ModelDriven Development Tool Implementer’s Forum at the 45th International Conference on Objects, Models, Components and Patterns, Zurich, Switzerland (June 2007)
A Single-Path Chip-Multiprocessor System Martin Schoeberl, Peter Puschner, and Raimund Kirner Institute of Computer Engineering Vienna University of Technology, Austria
[email protected], {peter,raimund}@vmars.tuwien.ac.at
Abstract. In this paper we explore the combination of a time-predictable chipmultiprocessor system with the single-path programming paradigm. Time-sliced arbitration of the main memory access provides time-predictable memory load and store instructions. Single-path programming avoids control flow dependent timing variations. To keep the execution time of tasks constant, even in the case of shared memory access of several processor cores, the tasks on the cores are synchronized with the time-sliced memory arbitration unit.
1 Introduction As more and more speedup features are added to modern processors and we are moving from single-core to multi-core processor systems, the analysis of the timing of the applications running on these systems is getting increasingly complex. The timing of single tasks per se is difficult to understand and to analyze. Besides that, task timing can no longer be considered as an isolated issue in such systems as the competition for shared resources and interferences via the state of the shared hardware lead to mutual dependencies of the progress and timing of different tasks. We are convinced that the only way of making these highly complex processing systems time predictable is to impose some restrictions on their architecture and on the way in which the mechanisms of the architecture are used. So far we have worked along two main lines of research aiming at real-time processing systems with predictable timing: On the software side we have conceived the single-path execution strategy [1]. The single-path approach allows us to translate task code in a way that the resulting code has exactly one execution trace that all executions of the task have to follow. To this end, the single-path conversion eliminates all input-dependent control flow decisions – by applying a set of code transformations [2] and if-conversion [3] it translates all input-dependent alternatives (i.e., code with if-then-else semantics) into straight-line predicated code. Loops with input-dependent termination are converted into loops that are semantically equivalent but whose iteration count is fully determined at system construction time. Architecture-wise we have been working on time-predictable processors and chipmultiprocessor (CMP) systems. We have developed the JOP prototype of a timepredictable processor [4] and built a CMP system with a number of JOP cores [5]. In this multiprocessor system a static time-division multiple access (TDMA) arbitration scheme controls the accesses of the cores to the common memory. The pre-planning of S. Lee and P. Narasimhan (Eds.): SEUS 2009, LNCS 5860, pp. 47–57, 2009. c IFIP International Federation for Information Processing 2009
48
M. Schoeberl, P. Puschner, and R. Kirner
memory access schedules eliminates the need for dynamic conflict resolution and guarantees the temporal isolation that is necessary to allow for an independent progression of the computations on the CMP cores. So far, we have dealt with each of the two topics in separation. This paper is the first that describes our work on combining the concepts of the single-path approach and our time-predictable CMP architecture. We thus present an execution environment that provides both temporal predictability to the highest degree and the performance benefits of parallel code execution on multiple cores. By generating deterministic single-path code, running this code on predictable processor cores, and using a rigid, pre-planned scheme to access the global memory we manage to achieve completely stable, and therefore predictable execution times for each single task in isolation as well as for entire applications consisting of multiple cooperating tasks running on different cores. To the best of our knowledge this has not been achieved for any other state-of-the-art CMP system so far.
2 The Single-Path Chip-Multiprocessor System The main goal of our approach is to build an architecture that provides a combination of good performance and high temporal predictability. We rely on chip-multiprocessing to achieve the performance goal and on an offline-planning approach to make our system predictable. The idea of the latter is to take as many control decisions as possible before the system is actually run. This reduces the number of branching decisions that need to be taken during system operation, which, in turn, causes a reduction of the number of possible action sequences with possibly different timings that need to be considered when planning respectively evaluating the system’s timely operation. 2.1 System Overview We consider a CMP architecture that hosts n processor cores, as shown in Figure 1. On each core the execution of simple tasks is scheduled statically as cyclic executive. All core’s schedulers have the same major cycle that is synchronized to the shared memory arbiter. Each of the processors has a small local method cache (M$) for storing recently used methods, a local stack cache (S$), and a small local scratchpad memory (SPM) for storing temporary data. The scratchpad memory can be mapped to thread local scopes [6] for integration into the Java programming language. All caches contain only thread local data and therefore no cache coherence protocol is needed. To avoid cache conflicts between the different cores our CMP system does not provide a shared cache. Instead, the cores of the time-predictable CMP system access the shared main memory via a TDMA based memory arbiter with fine-grained statically-scheduled access. 2.2 TDMA Memory Arbiter The TDMA based memory arbiter provides a static schedule for the memory access. Therefore, access time to the memory is independent of tasks running on other cores. In the default configuration each processor cores has an equally sized slot for the memory
A Single-Path Chip-Multiprocessor System
JOP pipeline
M$
S$
JOP pipeline
SPM
M$
S$
49
JOP pipeline
SPM
M$
S$
SPM
TDMA arbiter
JOP chip-multiprocessor
Memory controller
Main memory
Fig. 1. A JOP based CMP system with core local caches (M$, S$) and scratchpad memories (SPM), a TDMA based shared memory arbiter, and the memory controller
access. The TDMA schedule can also be optimized for different utilizations of processing cores. In [7] we have optimized the TDMA schedule to distribute slack time of tasks to other tasks with a tighter deadline. The worst-case execution time (WCET) of a memory loads or stores can be calculated by considering the worst-case phasing of the memory access pattern relative to the TDMA schedule [8]. With single-path programming, and the resulting static memory access pattern, the execution time of tasks on a TDMA based CMP system is almost constant. The only jitter results from different phases of the task start time to the TDMA schedule. The maximal execution time jitter, due to different phases between the task start time and the TDMA schedule, is the length of the TDMA round minus one. Thus, the TDMA arbiter very well supports time-predictable program execution. The maximal jitter due to TDMA delays is bounded and relatively small. And if one is interested to even completely avoid this short bounded execution time jitter, this can be achieved by synchronizing the task start with the TDMA schedule, using the deadline instruction described in Section 3.2. 2.3 Tasks All tasks in our system are periodic. Tasks are considered to be simple tasks according to the Simple-Task Model introduced in [9]:1 Task inputs are assumed to be available when 1
More complex task structures can be simulated by splitting tasks into sets of cooperating simple tasks.
50
M. Schoeberl, P. Puschner, and R. Kirner
a task instance starts, and outputs become ready for further processing upon completion of a task execution. Within its body a task is purely functional, i.e., it does neither access common resources nor does it include delays or synchronization operations. To realize the simple-task abstraction, a task implementation actually consists of a sequence of three parts: read inputs – execute – write outputs. While the application programmer must provide the code for the execute part (i.e., the functional part), the first and the third part can be automatically generated from the description of the task interface. These read and write parts of the task implementations copy data between the shared state and task-local copies of that state. The local copies can reside in the common main memory or in the processor-local scratchpad memory. The placement depends on the access frequency and size of the local state. Care must be taken to schedule the data transfers between the local state copy and the global, shared state such that all precedence and mutual exclusion constraints between tasks are met. This scheduling problem is very similar to the problem of constructing static scheduling tables for distributed hard real-time computer systems with TDMA message scheduling in which task execution has to be planned such that task-order relations are obeyed and the message and task sequencing guarantees that all communication constraints are met. A solution to this scheduling problem can be found in [10]. Following our strategy to achieve predictability by minimizing the number of control decisions taken during runtime, all tasks are implemented in single path code. This means, we apply the single-path transformation described in [1,2] to (a) serialize all input-dependent branches and (b) transform all loops with input-dependent termination into loops with a constant iteration count. In this way, each instance of a task executes the same sequence of instructions and has the same temporal access pattern to instructions and data. 2.4 Mechanisms for Performance and Time Predictability By executing tasks on different cores with some local cache and scratchpad memory we manage to increase the system’s performance over a single-processor system. The following mechanisms make the operation of our system highly predictable: – Tasks on a single core are executed in a cyclic executive, avoiding cache influences due to preemption. – Accesses to the global shared memory are arbitrated by a static TDMA memory arbitration scheme, thus leaving no room for unpredictable conflict resolution schemes and unknown memory access times. – The starting point of all task periods and the starting point of the TDMA cycle for memory accesses are synchronized, and each task execution starts at a pre-defined offset within its period. Further, the single-path task implementation guarantees a unique trace of instruction and memory accesses. All these properties taken together allow for an exact prediction of instruction execution times and memory access times, thus making the overall task timing fully transparent and predictable. – As the read and write sections of the tasks may need more than a single TDMA slot for transferring their data between the local and the global memory, read and write operations are pre-planned and executed in synchrony with the global execution cycle of all tasks.
A Single-Path Chip-Multiprocessor System
51
Besides its support for predictability, our planning-based approach allows for the following optimizations of the TDMA schedules for global memory accesses. These optimizations are based on the knowledge available at the planning time: – The single-path implementation of tasks allows us to exactly spot which parts of a task’s execute part need a higher and which parts need a lower bandwidth for accessing the global memory (e.g., a task does not have to fetch instructions from global memory while executing a method that it has just loaded into its local cache). This information can be used to adapt the memory access schedule to optimize the overall performance of memory accesses. While an adaption of memory-access schedules to the bandwidth requirements of different processing phases has been proposed before [11,12], it seems that this technique can provide its maximum benefit when applied to single-path code – only the execution of single-path code yields a unique, and therefore fully predictable sequence and timing of memory accesses. – A similar optimization can be applied to optimize the timing of memory accesses during the read and write sections of the task implementations. These sections access shared data and should therefore run under mutual exclusion. Mutual exclusion is guaranteed by the static, table-driven execution regime of the system. Still, the critical sections should be kept short. The latter could be achieved by an adaption of the TDMA memory schedule that assigns additional time slots to tasks at times when they perform memory-transfer operations. Our target is a time-deterministic system, which means that not only the value of a function is deterministic, but also the execution time. It is desirable to exactly know which instruction is executed at each point in time. Execution time shall be a repeatable and predictable property of the system [13].
3 Implementation The proposed design is evaluated in the context of the Java optimized processor (JOP) [4] based CMP system [5]. We have extended JOP with two instructions: a predicated move instruction for single-path programming in Java and a deadline instruction to synchronize application tasks with the TDMA based memory arbiter. 3.1 Conditional Move Single path programming substitutes control decisions (if-then-else) by predicated move instructions. To avoid execution time jitter, the predicated move has to have a constant execution time. On JOP we have implemented a predicated move for integer values and references. This instruction represents a new, system specific Java virtual machine (JVM) bytecode. This new bytecode is mapped to a native function for access from Java code. The semantic of the function result = Native.condMove(x, y, b);
is equivalent to
52
M. Schoeberl, P. Puschner, and R. Kirner result = b ? x : y;
without the need for any branch instruction. The following listing shows usage of conditional move for integer and reference data types. The program will print 1 and true. String a = ”true” ; String b = ” false ” ; String result ; int val ; boolean cond = true; val = Native.condMove(1, 2, cond); System.out.println (val ); result = (String) Native.condMoveRef(a, b, cond); System.out.println ( result );
The representation of the conditional move as a native function call has no call overhead. The function is substituted by the system specific bytecode during link time (similar to function inlining). 3.2 Deadline Instruction In order to synchronize a task with the TDMA schedule a wait instruction with a resolution of single clock cycles is needed. We have implemented a deadline instruction as proposed in [14]. The deadline instruction stalls the processor pipeline until the desired time in clock cycles. To avoid a change in the execution pipeline we have implemented a semantic equivalent to the deadline instruction. Instead of changing the instruction set of JOP, we have implemented an I/O device for the cycle accurate delay. The time value for the absolute delay is written to the I/O device and the device delays the acknowledgment of the I/O operation until the cycle counter reaches this value. This simple device is independent of the processor and can be used in any architecture where an I/O request needs an acknowledgment. I/O devices on JOP are mapped to so called hardware objects [15]. A hardware object represents an I/O device as a plain Java object. Field read and write access are actual I/O register read and write accesses. The following code shows the usage of the deadline I/O device. SysDevice sys = IOFactory.getFactory().getSysDevice(); int time = sys. cntInt ; time += 1000; sys.deadLine = time;
The first instruction requests a reference to the system device hardware object. This object (sys) is accessed to read out the current value of the clock cycle counter. The deadline is set to 1000 cycles after the current time and the assignment sys.deadline = time writes the deadline time stamp into the I/O device and blocks until that time.
A Single-Path Chip-Multiprocessor System
53
4 Evaluation We evaluate our proposed system within a Cyclone EP1C12 field-programmable gate array that contains 3 processor cores and 1 MB of shared memory. The shared memory is an SRAM with 2 cycles read access time and 3 cycles write access time. Some bytecode instructions contain several memory accesses (e.g., an array access needs three memory reads: read of the array size for the bounds check, an indirection through a forwarding handle,2 and the actual read of the array value). For several bytecode instructions the WCET is minimized with a slot length of 6 cycles. The resulting TDMA round for three cores is 18 cycles. As a first experiment we measure the execution time of a short program fragment with access to the main memory. Without synchronizing the task start with the TDMA arbiter we expect some jitter. To provoke all possible phase relations between the task and the TDMA schedule the deadline instruction was used to shift the task start relative to the TDMA schedule. The resulting execution time varies between 342 and 359 clock cycles. Therefore, the maximum observed execution time jitter is the length of the TDMA round minus one (17 cycles). With the deadline instruction we make each iteration of the task start at multiples of the TDMA round (18 clock cycles in our example). In that case each task executes for a cycle accurate constant duration. This little experiment shows that single-path programming on a CMP system, synchronized with the TDMA based memory arbitration, results in repeatable execution time [13]. 4.1 A Sample Application To validate our programming model for cycle-accurate real-time computing, we developed a controller application that consists of five communicating tasks. This case study is a demonstrator that cycle-accurate computing is possible on a CMP system. Further, this case study give us some insights about the practical aspects of using the proposed programming model. The architecture of the sample application is given in Figure 2. The application is demonstrative because of its rather complex inter-process communication pattern, which shows the need of precise scheduling decisions to meet the different precedence constraints. The application consists of the following tasks: – τ1 and τ2 are the sampling tasks that read from sensors. τ1 samples the reference value and τ2 samples the system value. This two tasks share the same code basis and they run at the double frequency than the controller task to allow a low-pass filtering by averaging the sensor values. – τ3 is the proportional-integral-derivative controller (PID controller) that gets the reference value from τ1 and the feedback of the current system value from τ2 . – τ4 is a system guard similar to a watchdog timer that controls the liveness of τ1 , τ2 , and τ3 . Whenever the write phase of τ1 , τ2 , and τ3 has not been executed between two subsequent activations of τ4 then the system is set into an error state. 2
The forwarding handle is needed for the implementation of the real-time garbage collector.
54
M. Schoeberl, P. Puschner, and R. Kirner
τ1 : STSampler
τ3 : STController
τ2 : STSampler Core 1
τ4 : STGuard τ5 : STMonitor
Core 2
Core 3
JOP chip−multiprocessor Fig. 2. Sample application: control application
τ1 : STSampler
τ5 : STMonitor
τ3 : STController
τ4 : STGuard
τ2 : STSampler
Fig. 3. Communication directions of the control application
– τ5 is a monitoring task that periodically collects the sensor values (from τ1 and τ2 ) and the control value (from τ3 ). The write part of τ5 is currently empty, but it can be used to include the code for transferring the collected system state to a host computer. The inter-task communication of the sample application is summarized in Figure 3. It shows that this small application has a relatively complex communication pattern. Each task communicates with almost all other tasks. The communication pattern has a direct influence on the system schedule. The resulting precedence constraints have to be taken into account for scheduling the read, execute, and write phases for each task. And of course, since this is a CMP system, some of the task phases are executed in parallel, which complicates the search for a tight schedule. Tasks τ1 -τ5 are implemented in single-path code, thus their execution time does not depend on control-flow decisions. Since also the scheduler has a single-path implementation, the system executes exactly the same instruction sequence at each scheduling round.
A Single-Path Chip-Multiprocessor System
55
Table 1. Measured single-path execution time in clock cycles Task
Read
Execute
Write
Total
τ1 ,τ2 τ3 τ4 τ5
594 864 26604 1368
774 65250 324 324
576 576 28422 324
1944 66690 55350 2016
All tasks are synchronized on each activation with the same phase of the TDMA based memory arbiter. Therefore, their execution time does not have any jitter due to different phase alignments of the memory arbiter. With such an implementation style it is possible on the JOP to determine the WCET of each task directly by a single execution-time measurement (by enforcing either a cache hit or miss of the method). Table 1 shows the observed WCET values for each task, given separately for the read, execute, and write part of the tasks. The absolute WCET values are not that important to discuss, but more important is the fact that the execution time of each task is deterministic, not depending on the input data. To summarize on the practical aspects of the programming model, it has shown that even this relatively simple application results in a scheduling problem that is rather tricky to be solved without tool support. For the purpose of our paper we solved it manually using a graphical visualization of the relative execution times and determining the activation times of each task manually. However, to successfully use this programming model for industrial production code, the use of a scheduling tool is highly advisable [10]. With respect to generating a tight schedule, it has shown that the predictable execution time of all tasks is very helpful.
5 Related Work Time-predictable multi-threading is developed within the PRET project [14]. The processor cores are based on a RISC architecture. Chip-level multi-threading for up to six threads eliminates the need for data forwarding, pipeline stalling, and branch prediction. The access of the individual threads to the shared main memory is scheduled similar to our TDMA arbiter with the so called memory wheel. The PRET architecture implements the deadline instruction to perform time based, instead of lock based, synchronization for access to shared data. In contrast to our simple task model, where synchronization is avoided due to the three different execution phases, the PRET architecture performs time based synchronization within the execution phase of a task. The approach, which is closest related to our work, is presented in [11,12]. The proposed CMP system is also intended for tasks according to the simple task model [9]. Furthermore, the local cache loading for the cores is performed from a shared main memory. Similar to our approach, a TDMA based memory arbitration is used. The paper deals with optimization of the TDMA schedule to reduce the WCET of the tasks. The design also considers changes of the arbiter schedule during task execution to optimize the execution time. We think that this optimization can be best performed when the
56
M. Schoeberl, P. Puschner, and R. Kirner
access pattern to the memory is statically known – which is only possible with singlepath programming. Therefore, the former approach to TDMA schedule optimization shall be combined with our single-path based CMP system. Optimization of the TDMA schedule of a CMP based real-time system has been proposed in [7]. The described system proposes a single core per thread to avoid the overhead of thread preemption. It is argued that future systems will contain many cores and the limiting resource will be the memory bandwidth. Therefore, the memory access is scheduled instead of the processing time.
6 Conclusion A statically scheduled chip-multiprocessor system with single-path programming and a TDMA based memory arbitration delivers repeatable timing. The repeatable and predictable timing of the system simplifies the safety argument: measurement of the execution time can be used instead of WCET analysis. We have evaluated the idea in the context of a time-predictable Java chip-multiprocessor system. The cycle accurate measurements showed that the approach is sound. For the evaluation of the system we have chosen a TDMA slot length that was optimal for the WCET of individual bytecodes. If this slot length is also optimal for singlepath code is an open question. In future work we will evaluate different slot lengths to optimize the execution time of single-path tasks. Furthermore, the change of the TDMA schedule at predefined points in time is another option we want to explore.
Acknowledgments The research leading to these results has received funding from the European Community’s Seventh Framework Programme [FP7/2007-2013] under grant agreement number 214373 (Artist Design) and 216682 (JEOPARD).
References 1. Puschner, P., Burns, A.: Writing temporally predictable code. In: Proc. 7th IEEE International Workshop on Object-Oriented Real-Time Dependable Systems, January 2002, pp. 85–91 (2002) 2. Puschner, P.: Transforming execution-time boundable code into temporally predictable code. In: Kleinjohann, B., Kim, K.K., Kleinjohann, L., Rettberg, A. (eds.) Design and Analysis of Distributed Embedded Systems, pp. 163–172. Kluwer Academic Publishers, Dordrecht (2002); IFIP 17th World Computer Congress - TC10 Stream on Distributed and Parallel Embedded Systems (DIPES 2002) 3. Allen, J., Kennedy, K., Porterfield, C., Warren, J.: Conversion of Control Dependence to Data Dependence. In: Proc. 10th ACM Symposium on Principles of Programming Languages, January 1983, pp. 177–189 (1983) 4. Schoeberl, M.: A Java processor architecture for embedded real-time systems. Journal of Systems Architecture 54(1-2), 265–286 (2008) 5. Pitter, C., Schoeberl, M.: A real-time Java chip-multiprocessor. Trans. on Embedded Computing Sys. (accepted for publication, 2009)
A Single-Path Chip-Multiprocessor System
57
6. Wellings, A., Schoeberl, M.: Thread-local scope caching for real-time Java. In: Proceedings of the 12th IEEE International Symposium on Object/component/service-oriented Real-time distributed Computing (ISORC 2009), Tokyo, Japan. IEEE Computer Society, Los Alamitos (2009) 7. Schoeberl, M., Puschner, P.: Is chip-multiprocessing the end of real-time scheduling? In: Proceedings of the 9th International Workshop on Worst-Case Execution Time (WCET) Analysis, Dublin, Ireland, OCG (July 2009) 8. Pitter, C.: Time-predictable memory arbitration for a Java chip-multiprocessor. In: Proceedings of the 6th International Workshop on Java Technologies for Real-time and Embedded Systems, JTRES 2008 (2008) 9. Kopetz, H.: Real-Time Systems. Kluwer Academic Publishers, Dordrecht (1997) 10. Fohler, G.: Joint scheduling of distributed complex periodic and hard aperiodic tasks in statically scheduled systems. In: Proceedings of the 16th Real-Time Systems Symposium, December 1995, pp. 152–161 (1995) 11. Andrei, A., Eles, P., Peng, Z., Rosen, J.: Predictable implementation of real-time applications on multiprocessor systems on chip. In: Proceedings of the 21st Intl. Conference on VLSI Design, January 2008, pp. 103–110 (2008) 12. Rosen, J., Andrei, A., Eles, P., Peng, Z.: Bus access optimization for predictable implementation of real-time applications on multiprocessor systems-on-chip. In: Proceedings of the Real-Time Systems Symposium (RTSS 2007), December 2007, pp. 49–60 (2007) 13. Lee, E.A.: Computing needs time. Commun. ACM 52(5), 70–79 (2009) 14. Lickly, B., Liu, I., Kim, S., Patel, H.D., Edwards, S.A., Lee, E.A.: Predictable programming on a precision timed architecture. In: Altman, E.R. (ed.) Proceedings of the International Conference on Compilers, Architecture, and Synthesis for Embedded Systems (CASES 2008), Atlanta, GA, USA, pp. 137–146. ACM, New York (2008) 15. Schoeberl, M., Korsholm, S., Thalinger, C., Ravn, A.P.: Hardware objects for Java. In: Proceedings of the 11th IEEE International Symposium on Object/component/service-oriented Real-time distributed Computing (ISORC 2008), Orlando, Florida, USA. IEEE Computer Society, Los Alamitos (2008)
Towards Trustworthy Self-optimization for Distributed Systems Benjamin Satzger, Florian Mutschelknaus, Faruk Bagci, Florian Kluge, and Theo Ungerer Department of Computer Science University of Augsburg, Germany {satzger,bagci,kluge,ungerer}@informatik.uni-augsburg.de http://www.informatik.uni-augsburg.de/sik
Abstract. The increasing complexity of computer-based technical systems requires new ways to control them. The initiatives Organic Computing and Autonomic Computing address exactly this issue. They demand future computer systems to adapt dynamically and autonomously to their environment and postulate so-called self-* properties. These are typically based on decentralized autonomous cooperation of the system’s entities. Trust can be used as a means to enhance cooperation schemes taking into account trust facets such as reliability. The contributions of this paper are algorithms to manage and query trust information. It is shown how such information can be used to improve self-* algorithms. To quantify our approach evaluations have been conducted. Keywords: Trust, self-*, Autonomic Computing.
1
self-optimization,
Organic
Computing,
Introduction
The evolution of computer systems starting from mainframes towards ubiquitous distributed systems progressed rapidly. Common for early systems is the necessity for human administrators. Future systems, however, should act to a large extent autonomously in order to keep them manageable. The investigation of techniques to allow complex distributed systems to self-organize is of high importance. The initiatives Autonomic Computing [14,5] and Organic Computing [3] propose systems with life-like properties and the ability to self-configure, self-optimize, self-heal, and self-protect. Safety and security play a major role in information technology and especially in the area of ubiquitous computing. Nodes in such a system are restricted to a local view and typically no central instance can be responsible for control and organization of the whole network. Trust and reputation can serve as a means to build safe and secure distributed systems in a decentralized way. With appropriate trust mechanisms, nodes of a system have a clue about which nodes S. Lee and P. Narasimhan (Eds.): SEUS 2009, LNCS 5860, pp. 58–68, 2009. c IFIP International Federation for Information Processing 2009
Towards Trustworthy Self-optimization for Distributed Systems
59
to cooperate with. This is very important to improve reliability and robustness in systems which depend on a cooperation of autonomous nodes. In this paper we adopt the definition that trust is a peer’s belief in another peer’s trust facet. There are many facets of trust in computer systems. Such facets may concern for instance availability, reliability, functional correctness, and honesty. The related term reputation emphasizes that trust information is based on recommendation. The development of trustworthy self-* systems concerns aspects of (1) generation of trust values based on direct experiences, (2) storage, management, retrieval of trust values, and (3) the usage of this information to enhance trustworthiness of the overall system. Generating direct trust values strongly depends on the trust facet. Direct experiences concerning the facet availability may be gathered by using heartbeat messages. The facet functional correctness of a sensor node may be estimated by comparison with measured values of sensors nearby. In this paper we will focus on (2) and (3), i.e. how to manage, access, and use trust information. An instance of a distributed ubiquitous system which exploits self-* properties is our Smart Doorplate Project [11]. This project envisions the use of smart doorplates within an office building. The doorplates are amongst others able to display current situational information about the office owner and to direct visitors to his current location based on a location-tracking system. A middleware called “Organic Computing Middleware for Ubiquitous Environments” OCµ [10] serves as common platform for all included devices. The middleware system OCµ was developed to offer self-configuration, self-optimization, self-healing, and self-protection capabilities. It is based on the assumption that applications are composed of services, which are distributed to the nodes of the network. Service distribution is performed during the initial self-configuration phase considering available resources and requirements of the services. At runtime, resource consumption is monitored. In a former work we have developed a self-optimization mechanism [13,12] to balance the resource consumption (load) between nodes by service transfers to other nodes. OCµ is an open system and designed to allow services and nodes from different manufacturers to interact. In this work we incorporate a trust mechanism into our middleware to allow network entities to decide how far to cooperate with other nodes/services. This is used to enhance the self-optimization algorithm. The paper is organized in seven sections. Section 2 gives an overview of the state of the art of trust in distributed systems. Section 3 presents the basic self-optimization algorithm we have developed. Section 4 introduces different algorithms to build a trust management layer which is able to provide functionalities covered by (2) as mentioned above. In Section 5 we present the trustworthy self-optimization, which extends the basic self-optimization and takes trust into account. This refers to (3). Then, Section 6 describes measurements of an implementation of the algorithm. Finally, Section 7 concludes the paper.
60
2
B. Satzger et al.
Related Work
There are many approaches to incorporate trust into distributed systems. In this section some relevant papers are highlighted. Repantis et al. [7] describe a middleware based on reputation. In their model, nodes can request data and services (objects) and may receive several responses from different nodes. In this case the object from the most trustworthy provider is chosen. The information about reputation of nodes is stored on its direct neighbors and appended to response messages. The nodes define thresholds for any object they request. Thus, only providers with a higher reputation are taken into account. After reception of an object the provider is being rated based on the satisfaction of the requester. In [7] nodes share one common reputation value which means that all nodes have the same trust in a certain node. In contrast, Theodorakopoulos et al. [9] describe a model based on a directed graph in which vertices are the network’s nodes and the weighted edges describe trust relations. The weight of an edge (u, v) describes the trust of node u in v. Any weight consists of a trust value and the confidence in its correctness. The TrustMe [8] protocol focuses on the anonymity of network members. It represents a technique to store and access trust information within a peer-to-peer network. The mining of trust information plays a minor role. An asymmetric encryption technique is used to allow for protection against attacks. In contrast to many trust management systems which support very limited anonymity or assume anonymity to be an undesired feature, TrustMe emphasizes the importance of anonymity. The protocol provides anonymity for the trust provider and the trust requester. Cornelli et al. [4] present a common approach to request trust values: node A interested in the reputation of node B sends a broadcast message and receives response from all nodes which have a trust value about B. The latter message is encrypted by the public key of A. After reception of the encrypted answer the node contacts the responder to identify bogus messages. In [6], a special approach is used to store trust information. It uses Distributed Hash Tables (DHTs) to store the trust value of a node in a number of parent nodes. These are identified by hash functions using the id of the child. A node requesting the trust value of a network member uses the hash function to calculate the parents which hold the value and sends a request to them. Aberer et al. [2] present a reputation system to gather trust values. An interesting point is that trust values are mutual while traditionally nodes judge independently. The hope is to achieve an advantage due to the cooperation. This idea is integrated into the calculation of the global trust value of a node. The interaction triggered by the node itself as well as the requested interactions are accounted. Global trust values are binary, i.e. nodes are considered trustworthy or not. If nodes are cheating during an interaction they are considered globally untrustworthy. If a node detects a cheating communication partner it files a complaint. By the number of interactions with different nodes the probability rises that a liar is unmasked. In this model reputation values are stored within the network in a distributed way. This is done using the so-called PGrid [1].
Towards Trustworthy Self-optimization for Distributed Systems
3
61
Basic Self-optimization Algorithm
The basic self-optimization algorithm [13,12] is inspired by the human hormone system. Its task is to balance the load of a distributed system based on services. This artificial hormone system consists of metrics which calculate a reaction (service transfer), nodes producing digital hormones which indicate their load, receptors collecting hormones and handing them over to the metrics, and finally the digital hormones holding load information. To minimize overhead the digital hormone value enfolds both, the activator as well as the inhibitor hormone. If the value of the digital hormone is above a given level, it activates while a lower value inhibits the reaction. To further reduce overhead, hormones are piggybacked to application messages and do not result in additional messages. The basic idea behind the self-optimization is: When a heavy loaded node receives a message containing a hormone which states that the sender is lightly loaded, services are transferred to this sender. The metrics used to decide whether to balance the load between two nodes of the network are named transfer strategies because they decide on the transfer of a service. Our self-optimization has the ability to improve a system during runtime. It yields very good results in load-balancing by only using local decision and with minimal overhead. However, it is not considered whether a service is transferred to a trustworthy node. Bogus nodes (e.g. nodes running a malicous middleware) might attract services in a systematic way and could induce a breakdown of the system. Unreliable faulty nodes might not have the ability to properly host services. The other way round, you would want to utilize particularly reliable trustworthy nodes for important services. Therefore, we propose to incorporate trust information into the transfer decision.
4
Trust Management
As mentioned, a trust system needs a component which generates trust values based on direct experiences and it needs a component which is able to process this information. The generation of direct trust values depends strongly on the trust facet and the domain. Since a network can be seen as a graph, each node has a set of direct neighbors. In order to be able to estimate the trust facet availability, direct neighbors may periodically check their availability status. But depending on domain and application the mining of trust values differs. In a sensor network the measured data of a node can be compared with the measurements of its neighbors. In this way nonconforming sensors can be identified. Since we do not focus on generation of trust values by observation but on the generic management of trust information we simply assume that any node has a trust value about its direct neighbors. This trust value T (k1 , k2 ) is within [0, 1] and reflects the subjective trust of node k1 in node k2 based on its experiences. T (k1 , k2 ) = 0 means k1 does not trust k2 at all while a value of 1 stands for ’full’ trust. We assume that direct neighbors directly monitor each other to determine a trust value. This trust value might be inadequate due to insufficient monitoring
62
B. Satzger et al.
data. It may be possible that the trust is either too optimistic or too pessimistic. With continuous monitoring of neighbors their trust can probably be estimated better and better. In the following three trust algorithms are presented which determine management of trust information in order to make it useful for the network’s entities. The algorithms are explained being used together with an algorithm spreading information via hormones - like our self-optimization algorithm. However, the trust management is not limited to such a usage. It is assumed that nodes do not alter messages they are forwarding. In a real world application this should be assured by the usage of security techniques like encryption. 4.1
Forwarder
Only direct neighbors have the possibility to directly measure trust. Nodes that are not in the direct neighborhood need to rely on some kind of reputation. Also direct neighbors can use reputation as additional information. This first approach to propagate trust is quite simple. When a node B sends an application message to a node F , direct neighbor D which forwards the message appends its trust value T (D, B) to it. Node F receives a message containing a hormone (with e.g. load information) and the trust of D in B as shown in Figure 1. This trust value is very subjective as only measured by one node, but this approach does not introduce additional messages for trust retrieval. Immediately after the receipt of an application message it has information about the trust of the sender.
Fig. 1. Forwarder algorithm
4.2
Distant Neighborhood
In this approach not only the trust value of one direct neighbor is considered but all neighbors of the according node. A node sends an application message with an appended hormone to the receiver. If the receiver needs to know about the trust of the sender, it sends a trust request to all neighbors of the sender. They reply by a trust response message. Finally, trust in the sender is calculated as the average of all received trust values. This method results in much more messages as the Forwarder algorithm would produce, but also provides more reliable information. A further advantage is the ability to detect diverged trust values. This might be used to identify bogus nodes. In Figure 2 node B sends
Towards Trustworthy Self-optimization for Distributed Systems
63
Fig. 2. Distant Neighborhood algorithm
an application message together with its capacity and load to node J which afterwards asks A, C, D, and E for their trust in node B. 4.3
Near Neighborhood
In this variant (see Fig. 3) a node spreads trust requests to query trust information, e.g. after the receipt of an application message. These trust requests are coupled with a hop counter. First, this hop counter is set to one which means that it first asks its own direct neighbors. If they have information about the target node they answer their trust value, otherwise they negate. If the node receives positive answers it averages the corresponding values. If the node receives only negative answers it increments its hop counter and repeats the trust request. At the latest when the request reaches a direct neighbor of the target node a trust value is responded. The requesting node stores the resulting trust value in order to answer other requests. Note that a node will execute this algorithm in order to update and refine its data even if it already has trust information about a target node. If a node already has a trust value of another node it is integrated into the averaging process. Initially, this algorithm produces much more messages than
Fig. 3. Near Neighborhood algorithm
64
B. Satzger et al.
the ones described above. However, as trust values are distributed within the network the number of messages decreases over time because it becomes more and more likely that few hops are needed until a node is found with information about the target node. The same holds for the accuracy of trust values which increases with the runtime of the algorithm.
5
Trustworthy Self-optimization
The self-optimization reallocates services during runtime to ensure a uniform distribution of load. The version described in Section 3 does not take the trust of nodes into account. In the following an approach is presented to incorporate trust into the self-optimization. Therefore, each of the three trust management algorithms presented above can be used. We assume that different services have different priorities. If a service is of high importance for the functionality of the system or collects sensitive data, its priority is high. The prioritization may be defined at design time or adapted dynamically during runtime. The trustworthy self-optimization aims at load-sharing with additional consideration of the services’ priority, i.e. to avoid hosting of services with high priority on nodes with low trust. A self-optimization step is performed strictly locally with no global control. Like the basic self-optimization algorithm, each node piggybacks hormones containing information about its current load and its capacity to each application message. This enables the receiver to compare this load value with its own load. Additionally the sender’s trust can be queried via the proposed trust management layer. A transfer strategy takes this information and decides whether to transfer a service or not. If a service is identified it is tried to transfer it to the sender. This is only possible if the sender still has enough free resources to host this service. Due to the dynamics of the network this is not always sure. The basic idea of the transfer strategy is to find a balance between pure loadbalancing and trustworthiness of service distribution. Using parameter α allows to determine on which aspect to focus. A higher value for α emphasizes the need to transfer services to nodes with optimal trust values, a lower value for α results in a focus on pure load-balancing. All services B is able to host, for which A’s trust into B is higher than their priority, and whose dislocation would balance the load significantly are considered for transfer.
6
Evaluation
Several test scenarios have been investigated in order to evaluate the trustworthy self-optimization. Each scenario consists of 100 nodes with random resource capacity (e.g. RAM) and a random global but hidden trust value is assigned to each node. In real world applications the trust of a node must be measured. As this strongly depends on the trust facet and the application we have chosen a theoretical approach to simulate the trust by direct observation. It is based on the assumption that nodes are able to better estimate the trust of a node over
Towards Trustworthy Self-optimization for Distributed Systems
65
Fig. 4. Simulation of direct trust monitoring
time. In the simulation the trust of a node in its direct neighbor converges to the true hidden trust value with increasing number of mutual interactions as shown in Figure 4. In this example, the true trust value of the node is 0.5. Initially, the node is only able to estimate the trust very roughly while the error decreases statistically with the interactions. Formally, the trust of node r to node k after n interactions is modeled by Tn (r, k) = t(k) + ρn . In this formula t(k) is the true global but hidden trust value of k and ρn is a random value which falsifies t(k). With further interactions the possible range of ρi decreases, i.e. |ρn | > |ρn+1 |. This random value simulates the inability to precisely estimate trustworthiness. In the simulation the nodes send dummy application messages to random nodes. These are used to piggy-back information necessary for self-optimization as described above. After reception of such a message the node determines the trust using a certain trust algorithm. Then, it is decided whether it is transferred or not. Initially, each node obtains a random number of services. Resource consumption and priority of a service are chosen randomly while the sum of all service weights is not allowed to exceed a node’s capacity. The proposed trustworthy self-optimization is used for load-balancing and additionally tries to assign services with high priority to highly trusted nodes. Rating functions are used to evaluate the fitness of a network configuration concerning trust and equal load-sharing. The main idea of the rating function for trusted service distribution fT is to reward services with high priority and resource consumption running on very trustworthy nodes: fT =
N n
S(n)
(t(n)
c(s) · p(s)).
s
N is the set of all nodes, S(n) is the set of services of a node n, t(n) its true trust value, c(s) and p(s) resource consumption and priority of a service s.
66
B. Satzger et al.
The rating function for load-sharing fL compares the current service distribution with the global theoretical optimal distribution. For each simulation the network consists of 100 nodes. At the beginning it is rated by fT and fL . Then the simulation is started and after each step the network is rated again. Within one step 100 application messages are randomly sent. This means that some nodes may send more than one message and others may send none to reflect the asynchronous character of the distributed system. The result of the receipt of an application message may be a service transfer dependent on the used trust algorithm and the node’s load. Additionally, it is measured how many services are transferred. Each evaluation scenario has been tested 250 times with randomly generated networks and averaged values. Figure 5 shows the gains of the trustworthiness of service distribution regarding the rating function fT . Distant neighborhood reached the best results followed by Forwarder and Near Neighborhood. However, Forwarder introduces no additional message for trust value distribution. Without consideration of trust in self-optimization, services are transferred to random nodes and the overall trust is not improved, but even a little bit declined. Figure 6 shows the network’s load-sharing regarding the function fL . Compared to the initial average load-distribution of 75% of the theoretical optimum, the self-optimization combined with any trust algorithm improves the load-balance. This means that the consideration of trust does not prevent the self-optimization to balance the load within the system. However, the quality of pure load-sharing cannot be reached by any trustworthy algorithm. Distant Neighborhood performs best in the conducted measurements. It improves the trustworthiness of the the service distribution by about 20% while causing a deterioration of the load-sharing by about 4% (compared to load-sharing
!
Fig. 5. Trustworthy service distribution (fT )
Towards Trustworthy Self-optimization for Distributed Systems
! ! "
67
Fig. 6. Load-sharing (fL )
with no consideration of trust). This is supposed to be a beneficial trade-off as a slightly improved load-sharing will have weak impact on the whole system. However, running important services on unreliable or malicious nodes may result in very poor system performance. Forwarder avoids overhead due to queries, as trust values are appended to application messages. This algorithm improves the trustworthy service distribution by about 13% and decreases load-sharing by about 5%. Near neighborhood shows similar results as Forwarder, but its explicit trust queries cause a higher overhead. However, due to its working principle it may be the case that it would show better results after a longer experiment duration.
7
Conclusion
This paper presents an approach for a trust management layer and shows how information provided by this layer can be used to improve a self-optimization algorithm. Three trust management algorithms have been introduced. With Forwarder, direct neighbors append its directly observed trust values automatically to each application message. Distant Neighborhood explicitly asks all direct neighbors of a node for a trust value. The last trust algorithm distributes trust values within the network for faster gathering of trust information especially after a longer runtime. The trustworthy self-optimization does not only consider pure load-balancing but also takes into account to transfer services only to nodes regarded sufficiently trustworthy. A feature of the self-optimization is not to use explicitly sent messages but append information in the form of hormones to application messages to minimize overhead. Three different transfer strategies have been proposed which determine whether a service is transferred to another node or not regarding load and trust.
68
B. Satzger et al.
The presented techniques have been evaluated. The results show that trust aspects can be integrated into the system with little restrictions regarding loadbalancing. The proposed trust mechanisms describe a way to increase the robustness of self-* systems with cooperating nodes.
References 1. Aberer, K.: P-Grid: A self-organizing access structure for P2P information systems. In: Batini, C., Giunchiglia, F., Giorgini, P., Mecella, M. (eds.) CoopIS 2001. LNCS, vol. 2172, p. 179. Springer, Heidelberg (2001) 2. Aberer, K., Despotovic, Z.: Managing trust in a peer-2-peer information system. In: CIKM, pp. 310–317. ACM, New York (2001) 3. Allrutz, R., Cap, C., Eilers, S., Fey, D., Haase, H., Hochberger, C., Karl, W., Kolpatzik, B., Krebs, J., Langhammer, F., Lukowicz, P., Maehle, E., Maas, J., M¨ uller-Schloer, C., Riedl, R., Schallenberger, B., Schanz, V., Schmeck, H., Schmid, D., Schr¨ oder-Preikschat, W., Ungerer, T., Veiser, H.-O., Wolf, L.: Organic Computing - Computer- und Systemarchitektur im Jahr 2010 (in German). VDE/ITG/GI position paper (2003) 4. Cornelli, F., Damiani, E., di Vimercati, S.D.C., Paraboschi, S., Samarati, P.: Choosing reputable servents in a P2P network. In: WWW, pp. 376–386 (2002) 5. Horn, P.: Autonomic computing: IBM’s perspective on the state of information technology (2001) 6. Kamvar, S.D., Schlosser, M.T., Garcia-Molina, H.: Eigenrep: Reputation management in p2p networks. In: Proceedings of the 12th International World Wide Web Conference, WWW 2003 (2003) 7. Repantis, T., Kalogeraki, V.: Decentralized trust management for ad-hoc peer-topeer networks. In: Terzis, S. (ed.) MPAC. ACM International Conference Proceeding Series, vol. 182, p. 6. ACM, New York (2006) 8. Singh, A., Liu, L.: Trustme: Anonymous management of trust relationships in decentralized P2P systems. In: Shahmehri, N., Graham, R.L., Caronni, G. (eds.) Peerto-Peer Computing, pp. 142–149. IEEE Computer Society, Los Alamitos (2003) 9. Theodorakopoulos, G., Baras, J.S.: Trust evaluation in ad-hoc networks. In: WiSe 2004: Proceedings of the 3rd ACM workshop on Wireless security, pp. 1–10. ACM, New York (2004) 10. Trumler, W.: Organic Ubiquitous Middleware. PhD thesis, Universit¨ at Augsburg (July 2006) 11. Trumler, W., Bagci, F., Petzold, J., Ungerer, T.: Smart Doorplate. In: First International Conference on Appliance Design (1AD), Bristol, GB, May 2003, pp. 24–28 (2003) 12. Trumler, W., Pietzowski, A., Satzger, B., Ungerer, T.: Adaptive self-optimization in distributed dynamic environments. In: Di Marzo Serugendo, G., Martin-Flatin, J.-P., J´elasity, M., Zambonelli, F. (eds.) First IEEE International Conference on Self-Adaptive and Self-Organizing Systems (SASO 2007), Cambridge, Boston, Massachussets, pp. 320–323. IEEE Computer Society, Los Alamitos (2007) 13. Trumler, W., Thiemann, T., Ungerer, T.: An artificial hormone system for selforganization of networked nodes. In: Pan, Y., Rammig, F.J., Schmeck, H., Solar, M. (eds.) Biologically Inspired Cooperative Computing, Santiago de Chile, pp. 85–94. Springer, Heidelberg (2006) 14. Weiser, M.: The computer for the 21st century (1995)
An Experimental Framework for the Analysis and Validation of Software Clocks Andrea Bondavalli1, Francesco Brancati1, Andrea Ceccarelli1, and Lorenzo Falai2 1 University of Florence, Viale Morgagni 65, I-50134, Firenze, Italy {bondavalli,francesco.brancati,andrea.ceccarelli}@unifi.it 2 Resiltech S.r.l., Via Filippo Turati 2, 56025, Pontedera (Pisa), Italy
[email protected] Abstract. The experimental evaluation of software clocks requires the availability of a high quality clock to be used as reference time and a particular care for being able to immediately compare the value provided by the software clock with the reference time. This paper focuses i) on the definition of a proper evaluation process and consequent methodology, and ii) on the assessment of both the measuring system and of the results. These aspects of experimental evaluation activities are mandatory in order to obtain valid results and reproducible experiments, including comparison of possible different realizations or prototypes. As case study to demonstrate the framework we describe the experimental evaluation performed on a basic prototype of the Reliable and Self-Aware Clock (R&SAClock), a recently proposed software clock for resilient time information that provides both current time and current synchronization uncertainty (a conservative and self-adaptive estimation of the distance from an external reference time). Keywords: experimental framework and methodology, assessment and measurements, software clocks, R&SAClock, NTP.
1 Introduction Experimental evaluation (i.e., testing [2]) offers the possibility to observe a system in its actual execution environment and to perform fault forecasting and removal (e.g., through fault injection [1]). During experimental evaluation, measurement results (the data) are collected and allow to discover insight on the tested system. It is mandatory that these results are valid i.e., they are not altered due to an intrusive set up, a badly designed experiment or measurement errors. To achieve valid results it is necessary i) a proper evaluation process that is carefully designed, its objective pinpointed, and the relevant quantities carefully addressed, and ii) an assessment of the measuring system (the set of instruments used to perform the measurements) and of the results [13]. The focus on the paper is on the process and methodology for the experimental validation of software clocks. The main issues to address for properly cope with this problem are the provision of a high quality clock to be used as reference time and a particular care when designing the measuring system so to be able to immediately compare the value provided by the software clock with the reference time [15]. In this S. Lee and P. Narasimhan (Eds.): SEUS 2009, LNCS 5860, pp. 69–81, 2009. © IFIP International Federation for Information Processing 2009
70
A. Bondavalli et al.
paper, the evaluation process is carefully planned, and the validity of the measuring system and of the results is investigated and assessed through principles of measurement theory. As further benefits, the experimental set-up and the whole evaluation process can be easily adapted and reused for the evaluation of different types of software clocks and for comparisons of possible different implementations or prototypes within the same category. The paper illustrates the experimental process and set up by showing the evaluation of a prototype of the Reliable and Self-Aware Clock (R&SAClock [5]), a recently proposed software clock. R&SAClock exploits services and data provided by any chosen synchronization mechanism (for external clock synchronization) to provide both the current time and the current synchronization uncertainty (an adaptive and conservative estimation of the distance of local clock from the reference time). The rest of this paper is organized as follows. In Section 2 we introduce our case study: the R&SAClock prototype that will be analyzed. Section 3 describes our experimental process and the measuring sub-system. Section 4 presents the results obtained by the planned experiments and their analysis. Conclusions are in Section 5.
2 The Reliable and Self-aware Clock 2.1 Basic Notions of Time and Clocks Let us consider a distributed system composed of a set of nodes. We define reference time as the unique time view shared by the nodes of the system, reference clock as the clock that always holds the reference time, and reference node as the node that owns the reference clock. Given a local clock c and any time instant t, we define c(t) as the time value read by local clock c at time t. The behavior of the local clock is characterized by the quantities offset, accuracy and drift. The offset Θc(t) = t − c(t) is the actual distance of local clock c from reference time at time t [9]. This distance may vary through time. Accuracy Ac is an upper bound of the offset [10] and is often adopted in the definition of system requirements and therefore targeted by clock synchronization mechanisms. Drift ρc(t) describes the rate of deviation of a local clock c at time instant t from the reference time [10]. Unfortunately, accuracy and offset are usually of practical little use for systems, since accuracy is usually a high value, and it is not a representative estimation of current distance from reference time, and offset is difficult to measure exactly at any time t. Synchronization mechanisms typically compute an estimated offset Θ (and an estimated drift ρ˜c(t)), without offering guarantees and only at synchronization instants. Instead of the static notion of accuracy, a dynamic conservative estimation of the offset provides more useful information. For this purpose, the notion of uncertainty as used in metrology [4], [3] can provide such useful estimation: we define the synchronization uncertainty Uc(t) as an adaptive and conservative evaluation of offset Θc(t) at any time t; that is Ac ≥ Uc(t) ≥ |Θc(t)| ≥ 0 [5]. Finally we define the root delay RDc(t) as the transmission delay (one-way or round trip, depending on the synchronization mechanism), including all systemrelated delays, from the node that holds local clock c to the reference node [9].
An Experimental Framework for the Analysis and Validation of Software Clocks
71
2.2 Basic Specifications of the R&SAClock R&SAClock is a new software clock for external clock synchronization (a unique reference time is used as target of the synchronization) that provides to users (e.g., system processes) both the time value and the synchronization uncertainty associated to the time value [5]. R&SAClock is not a synchronization mechanism, but it acts as a new software clock that exploits services and data provided by any chosen synchronization mechanism (e.g., [9], [11]). When a user asks the current time to R&SAClock (by invoking the function getTime), R&SAClock provides an enriched time value [likelyTime, minTime, maxTime, FLAG]. LikelyTime is the time value computed reading the local clock i.e., likelyTime = c(t). MinTime and maxTime are computed using the synchronization uncertainty provided by the internal mechanisms of R&SAClock. More specifically, for a clock c at any time instant t, we extend the notion of synchronization uncertainty Uc(t) distinguishing between a right synchronization uncertainty (positive) Ucr(t) and a left synchronization uncertainty (negative) Ucl(t), such that Uc(t) = max[Ucr(t); −Ucl(t)]. Values minTime and maxTime are respectively a left and a right bound of the reasonable values that can be attributed to the actual time: minTime is set to c(t) + Ucl(t) and maxTime is set to c(t) + Ucr(t). The user that exploits the R&SAClock can impose an accuracy requirement, that is the largest synchronization uncertainty that the user can accept to work correctly. Correspondingly, R&SAClock can give value to its output FLAG, which is a Boolean value that indicates whether the current synchronization uncertainty is within the accuracy requirement or not. The main core of R&SAClock is the Uncertainty Evaluation Algorithm (UEA), that equips the R&SAClock with the ability to compute the synchronization uncertainty. Possible different implementations of the UEA may lead to different versions of R&SAClock (for example, in [5] and [12] two different versions are shown). Besides the R&SAClock specification shown, we identify the following two non-functional requirements: REQ1. The service response time provided by R&SAClock is bounded: there exists a maximum reply time ∆RT from a getTime request made by a user to the delivery of the enriched time value (the probability that the getTime is not provided within ∆RT is negligible). REQ2. For any minTime and maxTime in any enriched time value generated at time t, it must be minTime ≤ t ≤ maxTime with a coverage ∆CV (by coverage we mean the probability that this equation is true). 2.3 R&SAClock Prototype and Target System Here we describe the prototype of the R&SAClock and the system in which it is executing as our target system used for the subsequent experimental evaluations. The R&SAClock prototype works with Linux and with the NTP synchronization mechanism. The UEA implemented in this prototype computes a symmetric left and right synchronization uncertainty with respect to likelyTime i.e., −Ucl(t) = Ucr(t) and Uc(t) = Ucr(t) [5]. Using functionalities of both NTP and Linux, the UEA gets i) c(t) querying the local clock and ii) the root delay RDc(t) and the estimated offset Θ t by
72
A. Bondavalli et al.
monitoring the NTP log file (NTP refreshes root delay and estimated offset when it performs a synchronization). The behavior of the UEA is as follows. First, the UEA reads an upper bound δc, fixed to 50 part per million (ppm) in the experiments, on clock drift from a configuration file and listens on the NTP log file. When NTP updates the log file, the UEA reads the estimated offset and the root delay and starts a counter called TSLU which represents the Time elapsed Since Last (more recent) Update of root delay and estimated offset. Given t the most recent time instant in which root delay and estimated offset have been updated, at any time t1 ≥ t the synchronization uncertainty Uc(t1) is computed as: Uc(t1) =|Θ
| + RD(t) + (δc · TSLU).
(1)
The basic idea of (1) is that, given Uc(t) = |Θ | + RD(t) ≥ |Θc(t)| at time t, we have Uc(t1) ≥ Θc(t1) at t1 ≥ t (a detailed discussion is in [5]). The target system is depicted in Fig. 1. The hardware components are a Toshiba Satellite laptop, that we call PC_R&SAC, and the NTP servers connected to PC_R&SAC through high-speed Internet connection. The software components are the R&SAClock prototype, the NTP client (a process daemon) and the software local clock of PC_R&SAC. The NTP client synchronizes the local clock using information from the NTP servers.
PC_R&SAC
R&SAClock NTP Servers NTP
Disk local clock
Fig. 1. The target system: R&SAClock for NTP and Linux
3 The Experimental Evaluation Process and Methodology The process of our experimental evaluation starts by identifying the goals, and then designing the experimental set up (composed by injectors, probes and the experiment control subsystems), the planning of the experiments to conduct and finally defining the structure and organization of the data related to the experiments. 3.1 Objective The objective of our analysis is in this case to validate a R&SAClock prototype, verifying if and how much it is able to fulfill its requirements in varying operating conditions, especially nominal and faulty. We aim to assign values to ∆RT and ∆CV.
An Experimental Framework for the Analysis and Validation of Software Clocks
73
3.2 Planning of the Experimental Set Up The experimental set up is described by the grey components of Fig. 2. Its hardware components are a HP Pavilion desktop, that we call PC_GPS, and a high quality GPS (Global Positioning System [8]) receiver. Through such receiver, the PC_GPS is able to construct a clock tightly synchronized to the reference time that is used as the reference clock. Obviously such reference clock does not hold the exact reference time, but it is orders of magnitude closer to the reference time than the clock of the target system: it is sufficiently accurate to suit our experiments. The PC_GPS contains a software component for the control of the experiment (Controller hereafter) that is composed of an actuator that commands the execution of workload and faultload to the Client (e.g., requests for the enriched time value, that the Client will forward to the R&SAClock), and of a monitor that receives the information from the Client of the completion of services, accesses the reference clock and writes data on the disk. The Client is a software component located on the target system: it performs injection functions, to inject the faultload and to generate the workload, and probe functions to collect the relevant quantities and write this data on the disk. An experimental setup in which the target system and the Controller are placed on the same PC would require two software clocks, thus introducing perturbations that are hard to address.
PC_GPS Controller
Disk
monitor actuator
Client probes injector
PC_R&SAC
Disk NTP
R&SAClock
GPS
(approximation of ) reference clock
NTP Servers
local clock
Fig. 2. Measuring system and target system
Given the description of the target system (an implementation of an R&SAClock) and of the experimental set up, we describe now how the relevant measures can be collected. To verify requirements REQ1 and REQ2, our measuring system implements solutions which are specific for R&SAClocks but general for any instance of it. To evaluate REQ1, the Client logs, for each request for the enriched time value, the time Client.start in which the Client queries the R&SAClock and the time Client.end in which it receives the answer from R&SAClock (these two values are collected reading the local clock of PC_R&SAC). To verify REQ2 (see Fig. 3), the measuring system computes a time interval [Controller.start, Controller.end] that contains the actual time in which the enriched time value is generated. This time interval is collected by the Controller’s monitor that reads the reference clock. Controller.start is the time instant in which a request for the enriched time value is sent by the Controller to the Client, and Controller.end is the time instant in which the enriched time value is received by the Controller. REQ2 is satisfied if [Controller.start, Controller.end] is within [minTime, maxTime].
74
A. Bondavalli et al.
Controller.start
Controller.end
generation of the enriched time value
Fig. 3. The time interval [Controller.start, Controller.end] that allows to evaluate REQ2
3.3 Instrumentation of the Experimental Set Up PC_R&SAC is a Linux PC connected to (one or more) NTP servers by means of an Internet connection and to the PC_GPS (another Linux-based PC) by means of an Ethernet crossover cable. The Controller and the Client are two high-priority processes that communicate using a socket. Fig. 4 shows their interactions to execute the workload. The Client waits for Controller’s commands. At periodic time intervals, the Controller sends a message containing a getTime request and an identifier ID to the Client, and logs ID and Controller.start. When the Client receives the message, it logs ID and Client.start and performs a getTime request to the R&SAClock. When the Client receives the enriched time value from R&SAClock, it logs the enriched time value, Client.end and ID, and sends a message containing an acknowledgment and ID to the Controller. The Controller receives the acknowledgment and logs ID and Controller.end. At the experiment termination, the log files created on the two machines are combined pairing entries with the same ID. Controller and Client interact to execute the faultload as follows. The Controller sends to the Client the commands to inject the faults (e.g. to close the Internet connection or to kill the NTP client), and logs the related event; the Client executes the received command and logs the related event. Data logging is handled by NetLogger [6], a tool for logging data in distributed systems guaranteeing negligible system perturbation. PC_GPS
PC_R&SAC
Controller
Client
R&SAClock
LOG
get_time() e ed Tim Enrich e Valu
“ack ID=1” “get:time ID=2”
C Clien lient.star t.end t
Con Co trol ntr ler. oll sta er. rt en d
“get:time ID=1”
get_time( ) Enriched Time Value
=2”
“ack ID
LOG
Fig. 4. Controller, Client and R&SAClock interactions to execute the workload
An Experimental Framework for the Analysis and Validation of Software Clocks
75
3.4 Planning of the Experiments Here the execution scenarios, the faultload, the workload and the experiments need to be defined. Our framework allows to easily define and modify such aspects of the experimental planning as desired or needed by the objectives of the analysis. In this case of the basic R&SAClock prototype, with the objective of showing how our set up works, the selection has been quite simple and not particularly demanding (for the target system) or rich (to obtain a complete assessment). Execution scenarios. Two execution scenarios are considered, that corresponds to the two most important operating conditions of the R&SAClock: i) beginning of synchronization (the NTP client has an initial transient phase and starts to synchronize to the NTP servers, and the PC_R&SAC is connected to the network), and ii) nominal operating condition of the NTP client (it represents a steady state phase: the NTP client is active and fully working, holding information on local clock). Faultload. Beside the situation of no faults, the following situations are considered: i) loose of connection between the NTP client and servers (thus making the NTP servers unreachable), and ii) failure of the NTP client (the Controller commands to shut down the NTP client). These are the most common scenarios that the R&SAClock is expected to face during its execution. Workload. The workload selected is simply composed of getTime requests to the R&SAClock to be sent once per second. This workload does not allow to observe the behavior of the target system under overload or stress conditions, and must be modified to a more demanding one if one wants to thoroughly evaluate the behavior of the target system. Experiments. Combining the scenarios, the faultload and the workload, we identify four significant experiments: i) beginning of synchronization, and no faults injected; ii) nominal operating condition, and no faults injected; iii) nominal operating condition, and failure of NTP client; iv) nominal operating condition, and loose of connection. The duration of each experiment is 12 hours; the rationale (confirmed by the collected evidence) is that 12 hours are more than sufficient to observe all the relevant events in each experiment. 3.5 Structure of the Data The structure and organization of the data related to the experiments is shown in Fig. 5, and are organized using a star-schema [7]. The organization of the data using a star schema is an intuitive model composed of facts and dimension tables. Facts tables (such as table R&SAClock_Results in Fig. 5) contain an entry for each experiment run. Each entry in its turn contains the values observed for the relevant metrics and values for the parameters of the experimental setup used in that specific experiment run. Each dimension table refer to a specific characteristic of the experimental set up and contain the possible values for that specific feature (in Fig. 5, tables Scenario, Workload, Faultload, Experiment, Target_System).
76
A. Bondavalli et al.
Fig. 5. Structure of the datta related to the experiment organized following a star-schemaa
This model allows to strructure and highlight the objectives, the results and the key elements of each evaluation n; consequently, it helps to reason on and keep the purpposes and contexts of the analy ysis clear.
4 Analysis of Resultss We subdivide the offline prrocess to investigate the results collected in three activitties: i) data staging on the raw data collected, to populate a database where data cann be easily analyzed, ii) investig gation of the validity of the results, and iii) presentation and discussion of the results. 4.1 Data Staging We organize the data staging in three steps: log collection, log parsing and databbase loading. In log collection th he events are merged in a unique log file using NetLoggeer’s API (Application Program mming Interface). In log parsing we use an AWK sccript (AWK is a programming language for processing text-based data) to parse raw ddata and create CSV (Comma Separated S Value: a standard data file format used for the storage of data structured in i table form) files, that are easier to handle than the rraw data. In database loading we create SQL (Structured Query Language) queries startting from the content of CSV filles to populate the database. 4.2 Quality of the Measu uring System and Results We assess the quality of th he measuring system along the principles of experimenntal validation and fault injectio on [1], [14] and the confidence of the results through principles of measurement theo ory [3], [4]; we focus on the uncertainty of the results and the intrusiveness of the meeasuring system [3], [4]. Furthermore, repeatability [44] is discussed to identify to whaat extend the results can be replicated in other experimennts.
An Experimental Framework for the Analysis and Validation of Software Clocks
77
Intrusiveness (perturbation). Despite the Client is a high priority thread, its execution does not perturb the target system and does not affect results. In fact, the other relevant thread that if delayed could induce a change in the system behavior is the R&SAClock thread, responsible for the generation of the enriched time value. This thread is the one with the highest priority in the target system. Uncertainty. The actual time instant in which the enriched time value is computed is within the interval [Controller.start, Controller.end] whose length is constituted by the length of the interval [Client.start, Client.end] plus the delay due to the communications between the Client and the Controller. The sampled duration of this interval is within 1.7 ms in the 99% of cases. Analyzing the other 1% of executions we discovered they were affected by large communications delay: we decided then to discard these runs. We set the time instant in which the enriched time value is computed as the middle value of the interval [Controller.start, Controller.end], that is (Controller.end + Controller.start) / 2. Such value is affected by an uncertainty of (Controller.end − Controller.start) / 2 = 0.85 ms (milliseconds) and confidence 1 [4]. Since the time instant in which the enriched time value is computed is the time instant in which the likelyTime is generated, we can attribute to the likelyTime the same uncertainty. As a consequence also the measured offset (difference between reference time and likelyTime) suffers from the same uncertainty. In principle also the resolution of our measurement system should contribute to the final uncertainty. In our case, the Linux clock resolution is 1 µs (microsecond) and its contribution is irrelevant to the computation of uncertainty. Repeatability. Re-executing the same experiment will almost certainly produce different data. Repeatability in deterministic terms as defined in [4] is not achievable. However, re-executing the same set of experiments will lead to statistically compatible results and to the same conclusions.
distance from reference time (seconds)
0.06
maxTime
synchronizations ) G c = 50 ppm
0.04
Ucr(t) = Uc(t)
0.02
likelyTime 0
reference time (provided by GPS)
-0.02
Ucl(t) = - Uc(t)
-0.04
minTime
-0.06 200
300
400
500
600
seconds
700
800
900
1000
s
Fig. 6. A sample trace of execution of the R&SAClock
78
A. Bondavalli et al.
4.3 Results In Fig. 6 we explain the results shown in Fig. 7-9. Reference time is on the x-axis. The central dashed line with label likelyTime is the distance between likelyTime and reference time: it is the offset of the local clock of PC_R&SAC that may vary during the experiments execution. The two external lines represent the distance of minTime and maxTime from reference time; this distance vary during experiments execution. If the NTP client performs a synchronization to the NTP servers at time t, the synchronization uncertainty Uc(t) is set to Θ . After the synchronization, the synchronization uncertainty grows steadily at rate 50 ppm until the next synchronization. The time interval between two adjacent synchronizations varies depending on the behavior of the NTP client. 0.25
distance from reference time (seconds)
0.15
distance from reference time (seconds)
0.2 0.15
maxTime
0.1 0.05 0
likelyTime
-0.05 -0.1
minTime
-0.15 -0.2 -0.25
0
1
2
3
4
5
6 hours
a)
7
8
9
10
11
12
maxTime 0.1
0.05
likelyTime 0
-0.05
-0.1
minTime -0.15 0
1
2
3
4
5
6
7
8
9
10
11
12
hours
b)
Fig. 7. a) Exp. 1: beginning of synchronization. b) Exp. 2: nominal operating condition.
Experiment 1. Fig. 7a shows the behavior of the R&SAClock prototype at the beginning of synchronization. The initial offset of the local clock of PC_R&SAC is 100.21 ms. At the beginning of the experiment, the NTP client performs frequent synchronizations to correct the local clock. After 8 hours, the offset is close to zero and consequently the NTP client performs less frequent synchronizations: this behavior affects Uc(t), that increases. Reference time is always within [minTime, maxTime]. Experiment 2. Fig. 7b shows the behavior of the R&SAClock prototype when the target system is in nominal operating condition and no faults are injected. The offset is close to zero and the local clock of PC_R&SAC is stable: the NTP client performs rare synchronization attempts. Reference time is always within [minTime, maxTime]. Uc(t) varies from 65.34 ms to 281.78 ms; the offset is at worst 4.22 ms. Experiment 3. Fig. 8a shows the behavior of the R&SAClock prototype when the target system is in nominal operating condition and the NTP client is failed (the figure does not contain the beginning of synchronization, about 8 hours). LikelyTime drifts from reference time (the NTP client does not discipline the local clock): after 12 hours, the offset is close to 500 ms. Since the actual drift of local clock is smaller than , the reference time is always within [minTime, maxTime].
An Experimental Framework for the Analysis and Validation of Software Clocks 2.5
1.5
distance from reference time (seconds)
distance from reference time (seconds)
2
maxTime
1 0.5 0 -0.5
likelyTime
-1 -1.5 -2
minTime
-2.5 -3 0
79
1
2
3
4
5
6
7
8
9
10
11
12
2
maxTime
1.5 1 0.5
likelyTime
0 -0.5 -1 -1.5
minTime
-2 -2.5
0
1
2
3
4
5
6
7
8
9
10
11
12
hours
hours
a)
b)
Fig. 8. a) Exp. 3: failure of the NTP client. b) Exp. 4: unavailability of the NTP servers.
Experiment 4. Fig. 8b shows the behavior of the R&SAClock prototype when the target system is in nominal operating condition and Internet connection is lost (the NTP client is unable to communicate with the NTP servers and consequently does not perform synchronizations). NTP client disciplines the local clock using information from the most recent synchronization. After 12 hours the offset is 26.09 ms: the NTP client, thanks to stable environmental conditions, succeeds in keeping likelyTime relatively close to the reference time. Reference time is within [minTime, maxTime]. Assessment of REQ1 and REQ2. The time intervals [Client.start, Client.end] from all the samples collected in the experiments are shown in Fig. 9. The highest value is 1.630 ms, thus REQ1 is satisfied simply by setting ∆RT ≥ 1.630 ms. However such multimodal distribution shows that the response time of a getTime varies significantly depending on current system activity and possible overheads of system resources. This suggest the possibility to build a new improved prototype with a reduced ∆RT and less variance in the interval [Client.start, Client.end] (e.g., implementing the R&SAClock within the OS layer and the getTime as an OS call).
number of samples
10000 8000 6000 4000 2000 0 700
900
1100
1300
microseconds
1500
µs
Fig. 9. Intervals [Client.start, Client.end]
1700
80
A. Bondavalli et al.
In the experiments shown, REQ2 is always satisfied (∆CV = 1). However, the interval [minTime, maxTime] is often a very large value, even when offset is continuously close to zero. Results of the experiments suggest that different (more efficient) UEAs that predict the oscillator drift behavior using statistical information on past values may be developed and used. A preliminary investigation is in [12].
6 Conclusions and Future Works In this paper we described a process and set up for the experimental evaluation of software clocks. The main issues addressed have been the provision of a high quality clock (resorting to high quality GPS) to be used as reference time in the experimental set up, and a particular care in designing the measuring system, that has allowed to assess the validity of the measuring system and of the results. Besides the design and planning of the experimental activities, the paper illustrates the experimental process and set up by showing the evaluation of a prototype of the R&SAClock [5], a recently proposed software clock. Even the simple experiments described allowed to get insight on the major deficiencies of the considered prototype and to identify the directions for improvements. Acknowledgments. This work has been partially supported by the European Community through the project IST-FP6-STREP-26979 (HIDENETS - HIghly DEpendable ip-based NETworks and Services).
References 1. Hsueh, M., Tsai, T.k., Iyer, R.K.: Fault Injection Techniques and Tools. Computer 30(4), 75–82 (1997) 2. Avizienis, A., Laprie, J., Randell, B., Landwehr, C.: Basic Concepts and Taxonomy of Dependable and Secure Computing. IEEE Trans. on Dependable and Secure Computing 1(1), 11–33 (2004) 3. BIPM, IEC, IFCC, ISO, IUPAC, OIML: Guide to the Expression of Uncertainty in Measurement (2008) 4. BIPM, IEC, IFCC, ISO, IUPAC, OIML: ISO International Vocabulary of Basic and General Terms in Metrology (VIM), Third Edition (2008) 5. Bondavalli, A., Ceccarelli, A., Falai, L.: Assuring Resilient Time Synchronization. In: Proceedings of the 27th IEEE Symposium on Reliable Distributed Systems (SRDS), pp. 3–12. IEEE Computer Society, Washington (2008) 6. Gunter, D., Tierney, B.: NetLogger: a Toolkit for Distributed System Performance Tuning and Debugging. In: IFIP/IEEE Eighth International Symposium on Integrated Network Management, pp. 97–100 (2003) 7. Kimball, R., Ross, M., Thornthwaite, W.: The Data Warehouse Lifecycle Toolkit. J. Wiley & Sons, Inc., Chichester (2008) 8. Dana, P.H.: Global Positioning System (GPS) Time Dissemination for Real-Time Applications. Real-Time Systems 12(1), 9–40 (1997) 9. Mills, D.: Internet Time Synchronization: the Network Time Protocol. IEEE Trans. on Communications 39, 1482–1493 (1991)
An Experimental Framework for the Analysis and Validation of Software Clocks
81
10. Verissimo, P., Rodriguez, L.: Distributed Systems for System Architects. Kluwer Academic Publishers, Dordrecht (2001) 11. Cristian, F.: Probabilistic Clock Synchronization. Distributed Computing 3, 146–158 (1989) 12. Bondavalli, A., Brancati, B., Ceccarelli, A.: Safe Estimation of Time Uncertainty of Local Clocks. In: IEEE Symposium on Precision Clock Synchronization for Measurement, Control and Communication, ISPCS (to appear, 2009) 13. Bondavalli, A., Ceccarelli, A., Falai, L., Vadursi, M.: Foundations of Measurement Theory Applied to the Evaluation of Dependability Attributes. In: Proceedings of the 37th Annual IEEE/IFIP international Conference on Dependable Systems and Networks, pp. 522–533 (2007) 14. Arlat, J., Aguera, M., Amat, L., Crouzet, Y., Fabre, J.-C., Laprie, J.-C., Martins, E., Powell, D.: Fault Injection for Dependability Validation: a Methodology and Some Applications. IEEE Trans. on Software Engineering 16(2), 166–182 (1990) 15. Veitch, D., Babu, S., Pàsztor, A.: Robust Synchronization of Software Clocks across the Internet. In: Proceedings of the 4th ACM SIGCOMM Conference on internet Measurement, pp. 219–232 (2004)
Towards a Statistical Model of a Microprocessor’s Throughput by Analyzing Pipeline Stalls Uwe Brinkschulte, Daniel Lohn, and Mathias Pacher Institut f¨ ur Informatik Johann Wolfgang Goethe Universit¨ at Frankfurt, Germany {brinks,lohn,pacher}@es.cs.uni-frankfurt.de
Abstract. In this paper we model a thread’s throughput, the instruction per cycle rate (IPC rate), running on a general microprocessor as used in common embedded systems. Our model is not limited to a particular microprocessor because our aim is to develop a general model which can be adapted thus fitting to different microprocessor architectures. We include stalls caused by different pipeline obstacles like data dependencies, branch misprediction etc. These stalls involve latency clock cycles blocking the processor. We also describe each kind of stall in detail and develop a statistical model for the throughput including the entire processor pipeline.
1
Introduction
Nowadays, the development of embedded and ubiquitous systems is strongly advancing. We find microprocessors embedded and networked in all areas of life, e.g. in cell phones, cars, planes, and household aids. In many of these areas the microprocessors need special capabilities, e.g. guaranteeing execution time bounds for real-time applications like a control task on an autonomous guided vehicle. Therefore, we need models of the timing behavior of these microprocessors by which the execution time bounds can be computed. In this paper we develop a statistical model for the IPC rate of a general purpose multi-threaded microprocessor to predict timing behavior thus improving the real-time capability. We consider both effects like data dependencies and processor speed-up techniques like branch- and branch target prediction, or caches. The model is a transfer function computing the IPC rate. By analyzing this model we obtain bounds for the IPC rate which can be used to compute bounds for the execution time of user applications. Another important use of a model like this is to control the IPC rate similar to [1,2,3]. Controlling the IPC rate in pipelined microprocessors is one of the long-term goals of our work: If we develop precise statistical models of the throughput we are able to adjust the controller parameters in a very fine-grained way. In addition, we can compute estimations for the applications’ time bounds which is necessary for real-time systems. S. Lee and P. Narasimhan (Eds.): SEUS 2009, LNCS 5860, pp. 82–90, 2009. c IFIP International Federation for Information Processing 2009
Towards a Statistical Model of a Microprocessor’s Throughput
83
The paper is structured as follows: Section 2 presents related work and similar approaches. In Section 3 we discuss modern scalar and multi-threaded microprocessors and in Section 4 we present our model which is validated by an example. Section 5 concludes our paper and gives an outlook to the future work.
2
State of the Art
Many approaches for Worst Case Execution Time (WCET) analysis are known. Most of them examine the semantics of the program code in respect to the pipeline used for the execution resulting in a cycle accurate analysis of the program code. One example is the work in [4]. The authors examine the WCET in the Motorola ColdFire 5307 and study i.e. cache interferences occurring while loop execution. In [5], the authors discuss the WCET analysis in an out-of-order execution processor. They transform the WCET analysis problem by computing and examining the execution graph (whose nodes represent the tuple consisting of an instruction identifier and a pipeline stage) of the program code to be executed. The authors of [6] consider the WCET analysis for processors with branch prediction. They classify each control transfer instruction with respect to branch prediction and use a timing analyzer to estimate the WCET according to the instruction classification. A WCET analysis with respect to industrial requirements is discussed in [7]. A good review on existing WCET analysis techniques is given in [8]. The authors also present a generic framework for WCET analysis. In [9], the author propose to split the cache into several independent caches to simplify the WCET analysis and get tighter upper bounds. The authors of [10] design a model of a single-issue in-order pipeline for a static WCET analysis and consider time dependencies between instructions. The papers presented above mostly provide cycle accurate techniques for WCET analysis of program codes. This is different to our approach as we use a probabilistic approach based on the pipeline structure. The characteristics of the program codes are generalized by statistical values like the probability of a misprediction etc. As a result, our model is not necessarily cycle accurate but we are able to use analytical techniques to examine the throughput behavior for individual programs as well as for program classes. Furthermore, as mentioned in the section 1, our long-term goal is to control the IPC rate. Using control theory as a basis, a model of a processor as proposed in this paper is necessary not only to analyze, but as well to improve and guarantee real-time behavior by a closed control loop.
3
Pipeline Stalls in Microprocessors
Techniques like long pipelines, branch prediction and caches were developed to improve the average performance of modern super-scalar microprocessors. But the worst case oriented real-time behavior suffers from various reasons like branch
84
U. Brinkschulte, D. Lohn, and M. Pacher
misprediction, cache misses, etc. Besides, there is only one set of registers in single-threaded processors, thus producing context switch costs of several clock cycles in case of a thread switch. On application level thread synchronization also introduces latency clock cycles if one thread has to wait on a synchronization event. This problem depends on the programming model and is not affected by the architecture of the microprocessor used for its execution. Multi-threaded processors suffer from the same problems considering realtime as single-threaded processors. However, there are several differences which makes them an interesting research platform to model the IPC rate: Contrary to single-threaded processors there are mostly separated internal resources in multithreaded processors like program counters, status- and general purpose registers etc. for each thread. This decreases the interdependencies of different threads caused by the processor hardware. The remaining thread interdependencies only depend on the programming model, thus e.g. the context switching time between different threads is eliminated. In addition, if a scheduling strategy like Guaranteed Percentage Scheduling (GP-Scheduling, see [11]) is used, the controller is able to to control each thread in a fine-grained way. In GP scheduling, a requested number of clock cycles is assigned to each thread. This assignment is guaranteed within a certain time period (e.g. 100 clock cycles). Fig. 1 gives an example of three threads with a GP rate of 30% for Thread A, 20% for Thread B, and 40% for Thread C, and a time period of 100 clock cycles. This means, thread A gets 30 clock cycles, thread B gets 20 clock cycles, and thread C gets 40 clock cycles within the 100 clock cycle time period. Thread A, 30% Thread B, 20% Thread C, 40%
...
Thread A 30 clock cycles
Thread B 20 clock cycles
Thread C 40 clock cycles
100 clock cycles
Thread A 30 clock cycles
Thread B 20 clock cycles
Thread C 40 clock cycles
...
100 clock cycles
Fig. 1. Example for a GP schedule
4
Modeling
In this section we present a statistical model of a microprocessor evolving the parameters influencing the throughput. Our first approach considers a processor with
Towards a Statistical Model of a Microprocessor’s Throughput
85
one core and a simple scalar multi-threaded pipeline. Our analysis of throughput hazards starts with the lowest hardware level, the pipeline. Every instruction to be executed passes the pipeline. Therefore, we have to consider the first pipeline stage, the instruction fetch unit. This stage exists in almost all modern microprocessors [11,12]. The instruction set of a processor can be divided into different classes: An instruction is either controlflow or data related. We can compute a probability for the occurrence of instructions from these classes. The probability of the occurrence of a controlflow class instruction in the interval n is denoted by pa (n), while pb (n) represents the probability of a data related instruction in the interval n. We assume the probabilities to be time dependent, because they may change with the currently executed program code. First, we consider controlflow related instructions like unconditional and conditional branches. These instructions may lead to a lower IPC rate, caused by delay cycles in the pipeline. Therefore, it is necessary to identify them as early as possible and handle them appropriately1. This is done with the help of a branch target buffer (BTB) [11]. The BTB contains the target addresses of unconditional branches, and some additional prediction bits for conditional branches to predict the branch direction. Whenever the target address of an instruction can’t be found in the BTB, the pipeline has to be stalled until the target address has been computed. Therefore, we model these delay cycles by a penalty Datarget , while patarget (n) is the probability that such a stall event occurs in the time interval n. If a conditional branch is fetched, the predictor may fail. In this case the pipeline has to be flushed and the instructions of the other branch direction have to be fed into the pipeline mostly leading to a long pipeline stall. The actual number of delay cycles depends on the length of the pipeline [11]. We call pamp (n) the probability a branch is mispredicted in the interval n and Damp the penalty in delay cycles for pipeline flushing. Now, we consider data related instructions because data dependencies also influence the IPC rate. There are three different kinds of dependencies [11]: The anti dependency or write-after-read-hazard (WAR) is the easiest one because it does not affect the execution in an in-order-pipeline at all. As an example let’s assume an instruction writes to a register that was read earlier by another instruction. This does not influence the execution in any way. Output dependencies or write-after-write-hazards (WAW) can be solved by register renaming thus not affecting the IPC rate, too. True dependencies or read-after-write-hazards (RAW) are the worst kind of data dependencies. Their impact on the IPC rate can be reduced by hardware (forwarding techniques [12]) or by software (instruction reordering [12]). However, in several cases instructions have to wait in the reservation stations and several delay cycles have to be inserted into the pipeline until the dependency is solved. pbd denotes the statistical probability for a pipeline stalling data dependency and Dbd denotes the average penalty in clock cycles. 1
A modern microprocessor is able to detect controlflow related instructions yet in the instruction fetch stage of its pipeline.
86
U. Brinkschulte, D. Lohn, and M. Pacher
The following formula (1) computes the IPC rate I of a microprocessor including the above mentioned pipeline obstacles in the interval n: G(n) 1 + X(n) X(n) =pa (n)(patarget (n)Datarget + pamp (n)Damp ) + pb (n)pbd (n)Dbd I(n) =
(1)
The IPC rate I(n) of the executed thread in the interval n is the Guaranteed Percentage rate G(n) divided by one plus a penalty term X(n), where X(n) is the expected value of all inserted penalty delay cycles. If we assume a perfect branch prediction and no pipeline stalls caused by data dependencies, then the probabilities for pipeline stalling events would be zero, turning the whole term X(n) into zero. The resulting IPC rate would equal the Guaranteed Percentage rate, due to no latency cycles occurring. However, in case a data dependency could not be solved by forwarding, then pbd (n) would not be zero and X(n) would contain the penalty for the data dependency. Therefore, the IPC rate would suffer. Figure 2 shows the impact of those pipeline hazards on the IPC rate. The next step is to consider the effects of caches on the IPC rate, ignoring any delay cycles from other pipeline stages. Since we have no out-of-order execution, every cache miss leads to a pipeline stall, until the required data or instruction is available. The statistical probability of a cache miss occurring in the interval n is denoted by pc (n) and Dc is the average penalty in delay cycles. So the resulting formula is quite similar to formula 1: GP (n) 1 + Y (n) Y (n) =pc (n)Dc I(n) =
Fig. 2. Impact of pipeline hazards on the IPC rate
(2)
Towards a Statistical Model of a Microprocessor’s Throughput
87
Y (n) is the expected value of all delay cycles in the interval n, lowering the IPC rate I(n). Figure 2 shows the effects of cache misses on the IPC rate. Our final goal in this paper is to combine the pipeline hazard and the cache miss effects in one formula. As there is no dependency between cache misses and pipeline hazards, all the inserted delay cycles can simply be added, resulting in a final penalty of Z(n). Thus, we can bring together the effects of pipeline hazards and cache misses leading to the following formula 3: GP (n) 1 + Z(n) Z(n) =X(n) + Y (n) X(n) =pa (n)(patarget (n)Datarget + pamp (n)Damp ) + pb (n)pbd (n)Dbd I(n) =
(3)
Y (n) =pc (n)Dc Figure 3 shows the according IPC rate, taking into account all effects on hardware level. Now, we show that formula 3 is an adequate model of a simple microprocessor. Therefore, we examine a short code fragment of ten instructions executed in the time interval i: 1. 2. 3. 4. 5.
data instruction controlflow instruction (jump target not known) data instruction data instruction (with dependency) data instruction
Fig. 3. Impact of cache misses on the IPC rate
88
U. Brinkschulte, D. Lohn, and M. Pacher
Fig. 4. The final IPC rate including pipeline hazards and cache misses
6. 7. 8. 9. 10.
data instruction (with dependency) controlflow instruction data instruction (cache miss) data instruction data instruction
We assume our microprocessor has a five stage pipeline and runs two different threads and a Guaranteed Percentage value of 0.5 is granted to each of them . Furthermore, we assume a penalty of 2 clock cycles for an unknown branch target, 5 clock cycles for flushing the pipeline after a mispredicted branch, 1 clock cycle for an unresolved data dependency and 30 clock cycles for a cache miss. Analyzing the code fragment produces the following probability values: pa (i) = 0.2 patarget (i) = 0.1 pamp (i) = 0 pb (i) = 0.8 pbd (i) = 0.25 pc = 0.1 Having these values we are able to compute the IPC rate according to our model: X(i) = 0.2 · (0.1 · 2 + 0) + 0.8 · 0.25 · 1 = 0.4 Y (i) = 0.1 · 30 = 3 Z(i) = 0.4 + 3 = 3.4 0.5 I(i) = ≈ 0.114 1 + 3.4
Towards a Statistical Model of a Microprocessor’s Throughput
89
To verify the model, we examine what happens on pipeline level. At the beginning of interval i it takes five clock cycles to fill the pipeline. At the 6th clock cycle the first instruction is completed and then the pipeline is stalled for two cycles to compute the branch target of instruction 2. So instruction 2 finishes at the 9th clock cycle. Instruction 3 is completed at the 10th clock cycle and instruction 4 at the 12th clock cycle, because the unresolved data dependency of instruction 4 leads to a pipeline stall of one cycle. At the 13th clock cycle, instruction 5 is finished and at the 15th and 16th clock cycles the instructions 6 and 7, too. Because a cache miss happens during the execution of instruction 8, it finishes at the 47th clock cycle. The last two instructions finish at the 48th and 49th clock cycle. Since the thread has a GP value of 0.5, we have to double the execution time. This means, the execution of the code fragment would take 98 clock cycles on the real processor. This is already very close to our model (about 10%). If we neglect the first cycles needed to fill the pipeline we even get exactly an IPC rate of I(i) = 10 88 ≈ 0.114. Since real programs consist of many instructions, the time for the first pipeline filling can be easily neglected, thus enabling our model to predict the correct value of the IPC rate.
5
Conclusion and Future Work
In this paper we developed a statistical model of a simple multi-threaded microprocessor to compute the throughput of a thread. We started to consider the influence of hardware effects like pipeline hazards or cache misses on the IPC rate. First, we considered each hardware effect on its own, then we combined all together to a single formula, see formula 3. We showed with the help of an example that our model adequately describes the IPC rate. Future work will concern further improvements of the model, taking into account more advanced hardware techniques, like multicore or out-of-order execution. As already mentioned above, our future work will not only concern to compute the throughput of a thread, but also to control and stabilize it to a given IPC rate by closed control loops. Therefore, we want to develop a model by what we are able to identify the most important parameters for the IPC rate.
References 1. Brinkschulte, U., Pacher, M.: A Control Theory Approach to Improve the RealTime Capability of Multi-Threaded Microprocessors. In: ISORC, pp. 399–404 (2008) 2. Pacher, M., Brinkschulte, U.: Implementing Control Algorithms Within a Multithreaded Java Microcontroller. In: Beigl, M., Lukowicz, P. (eds.) ARCS 2005. LNCS, vol. 3432, pp. 33–49. Springer, Heidelberg (2005) 3. Brinkschulte, U., Pacher, M.: Improving the Real-time Behaviour of a Multithreaded Java Microcontroller by Control Theory and Model Based Latency Prediction. In: WORDS 2005, Tenth IEEE International Workshop on Object-oriented Real-time Dependable Systems, Sedona, Arizona, USA (2005)
90
U. Brinkschulte, D. Lohn, and M. Pacher
4. Langenbach, M., Thesing, S., Heckmann, R.: Pipeline modeling for timing analysis. In: Hermenegildo, M.V., Puebla, G. (eds.) SAS 2002. LNCS, vol. 2477, pp. 294–309. Springer, Heidelberg (2002) 5. Li, X., Roychoudhury, A., Mitra, T.: Modeling out-of-order processors for software timing analysis. In: RTSS 2004: Proceedings of the 25th IEEE International RealTime Systems Symposium, Washington, DC, USA, pp. 92–103. IEEE Computer Society, Los Alamitos (2004) 6. Colin, A., Puaut, I.: Worst case execution time analysis for a processor withbranch prediction, vol. 18(2/3), pp. 249–274. Kluwer Academic Publishers, Norwell (2000) 7. Ferdinand, C.: Worst case execution time prediction by static program analysis, vol. 3, p. 125a. IEEE Computer Society, Los Alamitos (2004) 8. Kirner, R., Puschner, P.: Classification of WCET analysis techniques. In: Proc. 8th IEEE International Symposium on Object-oriented Real-time distributed Computing, May 2005, pp. 190–199 (2005) 9. Schoeberl, M.: Time-predictable cache organization (2009), http://www.jopdesign.com/doc/tpcache.pdf 10. Engblom, J., Jonsson, B.: Processor pipelines and their properties for static wcet analysis. In: Sangiovanni-Vincentelli, A.L., Sifakis, J. (eds.) EMSOFT 2002. LNCS, vol. 2491, pp. 334–348. Springer, Heidelberg (2002) 11. Brinkschulte, U., Ungerer, T.: Mikrocontroller und Mikroprozessoren, 2nd edn. Springer, Heidelberg (2007) 12. Hennessy, J.L., Patterson, D.A.: Computer architecture: a quantitative approach, 4th edn. Elsevier [u.a.], Amsterdam (2007); Includes bibliographical references and index
Joining a Distributed Shared Memory Computation in a Dynamic Distributed System Roberto Baldoni1 , Silvia Bonomi1 , and Michel Raynal2 1
2
Universit´a La Sapienza, Via Ariosto 25, I-00185 Roma, Italy IRISA, Universit´e de Rennes, Campus de Beaulieu, F-35042 Rennes, France {baldoni,bonomi}@dis.uniroma1.it,
[email protected] Abstract. This paper is on the implementation of high level communication abstractions in dynamic systems (i.e., systems where the entities can enter and leave arbitrarily). Two abstractions are investigated, namely the read/write register and add/remove/get set data structure. The paper studies the join protocol that a process has to execute when it enters the system, in order to obtain a consistent copy of the (register or set) object despite the uncertainty created by the net effect of concurrency and dynamicity. It presents two join protocols, one for each abstraction, with provable guarantees. Keywords: Churn, Dynamic system, Provable guarantee, Regular register, Set object, Synchronous system.
1 Introduction Dynamic systems. The passage from statically structured distributed systems to unstructured ones is now a reality. Smart environments, P2P systems and networked systems are examples of modern systems where the application processes are not aware of the current system composition. Because they are run on top of a dynamic distributed system, these applications have to accommodate a constantly change of the their membership (i.e., churn) as a natural ingredient of their life. As an extreme, an application can cease to run when no entity belongs to the membership, and can later have a membership formed by thousands of entities. Considering the family of state-based applications, the main issue consists in maintaining their state despite membership changes. This means that a newcomer has to obtain a valid state of the application before joining it (state transfer operation). This is a critical operation as a too high churn may prevent the newcomer from obtain such a valid state. The shorter the time taken by the join procedure to transfer a state, the highest the churn rate the join protocol is able to cope with. Join protocol with provable guarantees. This paper studies the problem of joining a computation that implements a distributed shared memory on the top of a messagepassing dynamic distributed system. The memory we consider is made up of the noteworthy object abstractions that are the regular registers and the sets. For each of them, a notion of admissible value is defined. The aim of that notion is to give a precise meaning to the object value a process can obtain in presence of concurrency and dynamicity. S. Lee and P. Narasimhan (Eds.): SEUS 2009, LNCS 5860, pp. 91–102, 2009. c IFIP International Federation for Information Processing 2009
92
R. Baldoni, S. Bonomi, and M. Raynal
The paper proposes two join protocols (one for each object type) that provide the newcomer with an admissible value. To that end, the paper considers an underlying synchronous system where, while processes can enter and leave the application, their number remains always constant. While the regular register object is a well-known shared memory abstraction introduced by Lamport [10], the notion of a set object in a distributed context is less familiar. The corresponding specification given in this paper extends the notion of weak set introduced by Delporte-Gallet and Fauconnier in [5]. Roadmap. The paper is made up of 5 sections. First, Section 2 introduces the register and set objects (high level communication abstractions), and Section 3 presents the underlying computation model. Then, Sections 4 and 5 presents two join protocols, each suited to a specific object.
2 Distributed Shared Memory Paradigm A distributed shared memory is a programming abstraction, built on top of a message passing system, that allows processes to communicate and exchange informations by invoking operations that return or modify the content of shared objects, thereby hiding the complexity of the message exchange needed to maintain it. One of the simplest shared objects that can be considered is a register. Such an object provides the processes with two operations called read and write. Objects such as queues, stacks are more sophisticated objects. It is assumed that every process it sequential: it invokes a new operation on an object only after receiving an answer from its previous object invocation. Moreover, we assume a global clock that is not accessible to the processes. This clock can be seen as measuring the real time as perceived by an external observer that would not part of the system. 2.1 Base Definitions An operation issued on a shared object is not instantaneous: it takes time. Hence, two operations executed by two different processes, may overlap in time. Two events (denoted invocation and the response) are associated with each operation. They occur at the beginning (invocation time) and at the end of the operation (return time). Given two operation op and op having respectively invocation times tB (op) and tB (op ), and return times tE (op) and tE (op ), respectively, we say that op precedes op (op ≺ op ) iff tE (op) < tB (op ). If op does not precede op and op does not precede op then they are concurrent (op||op ). Definition 1 (Execution History). Let H be the set of all the operations issued on a = (H, ≺) is a partial order on H satisfying shared object O. An execution history H the relation ≺. t of H at time t). Given an execution history H = (H, ≺) Definition 2 (Sub-history H such that: and a time t, the sub-history Ht = (Ht , ≺t ) of H at time t is the sub-set of H (i) Ht ⊆ H, and (ii) ∀op ∈ H such that tB (op) ≤ t then op ∈ Ht .
Joining a Distributed Shared Memory Computation
93
2.2 Regular Register: Definition A register object R has two operations. The operation write(v) defines the new value v of the register, while the operation read() returns a value from the register. The semantic of a register is given by specifying which are the values returned by its read operations. Without loss of generality, we consider that no two write operations write the same value. This paper consider a variant of the regular register abstraction as defined by Lamport [10]. In our case, a regular register can have any number of writers and any number of readers [13]. The writes appear as if they were executed sequentially, this sequence complying with their real time occurrence order (i.e., if two writes w1 and w2 are concurrent they can appear in any order, but if w1 terminates before w2 starts, w1 has to appear as being executed before w2 ). As far as a read operation is concerned we have the following. If no write operation is concurrent with a read operation, that read operation returns the last value written in the register. Otherwise, the read operation returns any value written by a concurrent write operation or the last value of the register before these concurrent writes. Definition 3 (Admissible value for a read() operation). Given a read() operation op, a value v is admissible for op if: – ∃ write(v) : write(v) ≺ op ∨ write(v) || op, and – write(v ) (with v = v ) : write(v) ≺ write(v ) ≺ op. Definition 4 (Admissible value for a regular register at time t). Given an execution = (H, ≺) of a regular register R and a time t, let H t = (Ht , ≺t ) be the subhistory H history of H at time t. An admissible value at time t for R is any possible admissible value v for an instantaneous read operation op executed at time t. 2.3 Set Data Structure: Definition A set object S can be accessed by processes by means of three operations: add() and remove() that modify the content of the set and get() that returns the current content of the set. The add(v) operation takes an input a parameter v and returns the value ok when it terminates. Its aim is to add the element v to S. Hence, if {x1 , x2 , . . . , xk } are the values belonging to S before the invocation of add(v), and if no remove(v) operation is executed concurrently, the value of the set will be {x1, x2 , . . . , xk , v} after its invocation. The remove(v) operation takes an input a parameter v and returns the value ok. Its aim is to delete the element v from S if v belongs to S, otherwise the remove operation has no effect. The get() operation takes no input parameter. It returns a set containing the current content of S, without modifying the content of the object. In a concurrency-free context, every get() operation returns the current content of the set. The content of the set is welldefined when the operations occur sequentially. In order to state without ambiguity the value returned by get() operation in a concurrency context, let us introduce the notion of admissible values for a get() operation op (i.e. Vad (op)) by defining two sets, denoted sequential set (Vseq (op)) and concurrent set (Vconc (op)).
94
R. Baldoni, S. Bonomi, and M. Raynal
GET()
Pi ADD(2)
Pj Pk
GET()
Pi REMOVE(1)
Pj
ADD(1)
Pk
(a) Vseq = {1, 2}, Vconc = ∅
(b) Vseq = ∅, Vconc = ∅ GET()
Pi Pj Pk
ADD(1)
REMOVE(1) ADD(1)
(c) Vseq = ∅, Vconc = {1} Fig. 1. Vseq and Vconc in distinct executions
Definition 5 (Sequential set for a get() operation). Given a get() operation op executed on S, the set of sequential values for op is a set (denoted Vseq (op)) that contains all the values v such that: 1. ∃ add(v) : add(v) ≺ op, and 2. If remove(v) exists, then add(v) ≺ op ≺ remove(v). As an example, let consider Figure 1(a). The sequential set Vseq (op) = {1, 2} because there exist two operations adding the values 1 and 2, respectively, that terminate before the get() operation starts, and there is neither remove(1), nor remove(2), before get() termionates. Differently, Vseq (op) = ∅ in Figure 1(b). Definition 6 (Concurrent set for a get() operation). Given a get() operation op executed on S, the set of concurrent values for the get() operation is a set (denoted Vconc (op)) that contains all the value v such that: 1. ∃ add(v) : add(v) || op, or 2. ∃ add(v), remove(v) : (add(v) ≺ op) ∧ (remove(v) || op), or 3. ∃ add(v), remove(v) : add(v) || remove(v) ∧ add(v) ≺ op ∧ remove(v) ≺ op. When considering the execution of Figure 1(c), Vconc (op) = {1} due to the first item 1 of the previous definition. Definition 7 (Admissible set for a get() operation). Given a get() operation op, a sequential set Vseq (op) and a concurrent set Vconc (op), a set Vad (op) is an admissible set of values for op if Vseq (op) ⊆ Vad (op) ∧ Vad (op) \ Vseq (op) ⊆ Vconc (op) . As an example, let consider the executions depicted in Figure 1. For Figure 1(a) and Figure 1(b), there exists only one admissible set Vad for the get operation and it is respectively Vad (op) = {1, 2} and Vad (op) = ∅. Differently, for Figure 1(c) there exist two different admissible sets for the get operation, namely, Vad (op) = ∅ and Vad (op) =
Joining a Distributed Shared Memory Computation
ADD(4)
GET()
95
REMOVE(2)
Pi
GET()
REMOVE(4)
ADD(1)
GET()
Pj
ADD(3) Pk
REMOVE(3)
ADD(2)
t
t at time t Fig. 2. Sub-History H
{1}. Note that, in the executions depicted in Figure 1(c), if there was another get() operation after the add() and the remove() operations, these two get() could return different admissible sets. In order to take into consideration such point, consistency criteria have to be defined. Definition 8 (Admissible Sets of values at time t). An admissible set of values at time t for S (denoted V ad (t)) is any possible admissible set Vad (op) for any get() operation op that would occur instantaneously at time t. As an example, consider the scenario depicted in Figure 2. The sub-history at the time t is the partial order of all the operations started before t (i.e. the operations belonging to the set Ht are add(4) and get() executed by pi , get(), remove(4) and add(1) executed by pj , and add(3) and remove(3) executed by pk . The instantaneous get operation op is concurrent with add(1) executed by pj and remove(3) executed by pk . The sequential set for op Vseq (op) is ∅ because for both the add operations preceding op there exists a remove not following op while the concurrent set for op Vconc (op) is {1, 3}. The possible admissible sets for op (and then the possible admissible sets at time t) could be then (i) ∅, (ii) {1}, (iii) {3} and (iv) {1, 3}.
3 Joining a Computation in Dynamic Distributed System 3.1 System Model The distributed system is composed, at each time, by a fixed number (n) of processes that communicate by exchanging messages. Processes are uniquely identified with their indexes and they may join and leave the system at any point in time. The system is synchronous in the following sense. The processing times of local computations are negligible with respect to communication delays, so they are assumed to be equal to 0. Contrarily, messages take time to travel to their destination processes, but their transmission time is upper bounded. Moreover, we assume that processes can access a global clock (this is for ease of presentation; as we are in a synchronous system, such a global clock could be implemented by synchronized local clocks). We assume
96
R. Baldoni, S. Bonomi, and M. Raynal
that there exists an underling protocol (implemented at the connectivity layer) that keeps processes connected. 3.2 The Problem Given a shared object O (e.g. a register or a set), it is possible to associate with it, at each time t, a set of admissible values. Processes continuously join the system along time and every process pi that enters the computation has no information about the current state of the object with the consequence of being unable to perform any operation. Therefore every process pi that wishes to enter into the computation needs to retrieve an admissible value for the object O from the other processes. This problem is captured by adding a join() operation that has to be invoked by every joining process. This operation is implemented by a distributed protocol that builds an admissible value for the object. 3.3 Distributed Computation A distributed computation is defined, at each time, by a subset of processes. A process p, belonging to the system, that wants to participate to the distributed computation has to execute the join() operation. Such an operation, invoked at some time t, is not instantaneous. But, from time t, the process p can receive and process messages sent by any other process that belongs to the system and that participate to the computation. Processes participating to the computation implements a shared object. A process leaves the computation in an implicit way. When it does, it leaves the computation forever and does not longer send messages. (From a practical point of view, if a process wants to re-enter the system, it has to enter it as a new process, i.e., with a new name.) We assume that no process crashes during the computation (i.e., it does not crash from the time it joins the system until it leaves). In order to formalize the set of processes that participate actively to the computation we give the following definition. Definition 9. A process is active from the time it returns from the join() operation until the time it leaves the system. A(t) denotes the set of processes that are active at time t, while A([t1 , t2 ]) denotes the set of processes that are active during the interval [t1 , t2 ]. 3.4 Communication Primitives Two communication primitives are used by processes belonging to the distributed computation to communicate: point-to-point and broadcast communication. Point-to-point communication. This primitive allows a process pi to send a message to another process pj as soon as pi knows that pj has joined the computation. The network is reliable in the sense that it does not loose, create or modify messages. Moreover, the synchrony assumption guarantees that if pi invokes “send m to pj ” at time t, then pj receives that message by time t + δ (if it has not left the system by that time). In that case, the message is said to be “sent” and “received”.
Joining a Distributed Shared Memory Computation
97
Broadcast. Processes participating to the distributed computation are equipped with an appropriate broadcast communication sub-system that provides the processes with two operations, denoted broadcast() and deliver(). The former allows a process to send a message to all the processes in the distributed system, while the latter allows a process to deliver a message. Consequently, we say that such a message is “broadcast” and “delivered”. These operations satisfy the following property. – Timely delivery: Let t be the time at which a process p belonging to the computation invokes broadcast(m). There is a constant δ (δ ≥ δ ) (known by the processes) such that if p does not leave the system by time t + δ, then all the processes that are in the system at time t and do not leave by time t + δ, deliver m by time t + δ. Such a pair of broadcast operations has first been formalized in [8] in the context of systems where process can commit crash failures. It has been extended to the context of dynamic systems in [7]. 3.5 Churn Model The phenomenon of continuous arrival and departure of nodes in the system is usually referred as churn. In this paper, the churn of the system is modeled by means of the join distribution λ(t), the leave distribution µ(t) and the node distribution N (t) [3]. The join and the leave distribution are discrete functions of the time that returns, for any time t, respectively the number of processes that have invoked the join operation at time t and the number of processes that have left the system at time t. The node distribution returns, for every time t, the number of processes inside the system. We assume, at the beginning, n0 processes inside the system and we assume to have λ(t) = µ(t) = cn0 (where c ∈ [0, 1] is a percentage of node of the system) meaning that at each time unit, the number of precess that joins the system is the same as the number of process that leave, i.e. the number of processes inside the system N (t) is always equal to n0 .
4 Joining a Register Computation 4.1 The Protocol Local variables at a process pi Each process pi has the following local variables. – Two variables denoted registeri and sni ; registeri contains the local copy of the regular register, while sni is the associated sequence number. – A boolean activei , initialized to false, that is switched to true just after pi has joined the system. – Two set variables, denoted repliesi and reply toi , that are used during the period during which pi joins the system. The local variable repliesi contains the 3-uples < id, value, sn > that pi has received from other processes during its join period, while reply toi contains the processes that are joining the system concurrently with pi (as far as pi knows). The local variables of each process pk (of the n processes that compose the initial set of processes) are such that registerk contains the initial value of the regular register (say the value 0), snk = 0, activek = true, and repliesk = reply tok = ∅.
98
R. Baldoni, S. Bonomi, and M. Raynal
operation join(i): (01) registeri ← ⊥; sni ← −1; active i ← false; repliesi ← ∅; reply toi ← ∅; (02) wait(δ); (03) if (registeri = ⊥) then (04) repliesi ← ∅; broadcast INQUIRY (i); wait(2δ); (05) let < id, val, sn >∈ repliesi such that (∀ < −, −, sn >∈ repliesi : sn ≥ sn ); (06) if (sn > sni ) then sni ← sn; registeri ← val end if (07) end if; (08) activei ← true; (09) for each j ∈ reply toi do send REPLY (< i, registeri , sni >) to pj end for; (10) return(ok). ————————————————————————————————————– (11) when INQUIRY(j) is delivered: (12) if (activei ) then send REPLY (< i, registeri, sni >) to pj (13) else reply toi ← reply toi ∪ {j} (14) end if. (15) when REPLY(< j, value, sn >) is received: repliesi ← repliesi ∪ {< j, value, sn >}. Fig. 3. The join() protocol for a register object in a synchronous system (code for pi )
The join() operation When a process pi enters the system, it first invokes the join operation. The algorithm implementing that operation, described in Figure 3, involves all the processes that are currently present (be them active or not). The interested reader will find a proof in [3]. First pi initializes its local variables (line 01), and waits for a period of δ time units (line 02); This waiting period is explained later. If registeri has not been updated during this waiting period (line 03), pi broadcasts (with the broadcast() operation) an INQUIRY (i) message to the processes that are in the system (line 04) and waits for 2δ time units, i.e., the maximum round trip delay (line 04)1 . When this period terminates, pi updates its local variables registeri and sni to the most uptodate values it has received (lines 05-06). Then, pi becomes active (line 08), which means that it can answer the inquiries it has received from other processes, and does it if reply to = ∅ (line 09). Finally, pi returns ok to indicate the end of the join() operation (line 10). When a process pi receives a message INQUIRY(j), it answers pj by return sending back a REPLY(< i, registeri , sni >) message containing its local variable if it is active (line 12). Otherwise, pi postpones its answer until it becomes active (line 13 and lines 08-09). Finally, when pi receives a message REPLY(< j, value, sn >) from a process pj it adds the corresponding 3-uple to its set repliesi (line 15). 1
The statement wait(2δ) can be replaced by wait(δ + δ ), which provides a more efficient join operation; δ is the upper bound for the dissemination of the message sent by the reliable broadcast that is a one-to-many communication primitive, while δ is the upper bound for a response that is sent to a process whose id is known, using a one-to-one communication primitive. So, wait(δ) is related to the broadcast, while wait(δ ) is related to point-to-point communication. We use the wait(2δ) statement to make easier the presentation.
Joining a Distributed Shared Memory Computation δ pj
δ
0 write (1) 1
ph
0
pk
0
pi
99
1 1
⊥
pj
0 write (1) 1
ph
0
pk
0
pi
⊥
0 read()
Join()
Join
1 1
1
Join()
read() Join
Write
Write
Reply
δ
δ
Reply
δ
(a) Without wait(δ)
δ
δ
(b) With wait(δ)
Fig. 4. Why wait(δ) is required
Why the wait(δ) statement at line 02 of the join() operation? To motivate the wait(δ) statement at line 02, let us consider the execution of the join() operation depicted in Figure 4(a). At time τ , the processes pj , ph and pk are the three processes composing the system, and pj is the writer. Moreover, the process pi executes join() just after τ .The value of the copies of the regular register is 0 (square on the left of pj , ph and pk ), while registeri = ⊥ (square on its left). The ‘timely delivery” property of the broadcast invoked by the writer pj ensures that pj and pk deliver the new value v = 1 by τ + δ. But, as it entered the system after τ , there is no such a guarantee for pi . Hence, if pi does not execute the wait(δ) statement at line 02, its execution of the lines 03-07 can provide it with the previous value of the regular register, namely 0. If after obtaining 0, pi issues another read it obtains again 0, while it should obtain the new value v = 1 (because 1 is the last value written and there is no write concurrent with this second read issued by pi ). The execution depicted in Figure 4(b) shows that this incorrect scenario cannot occur if pi is forced to wait for δ time units before inquiring to obtain the last value of the regular register.
5 Joining a Set Computation 5.1 The Protocol Local variables at process pi . Each process pi has the following local variables. – Two variables denoted seti and sni ; seti is a set variable and contains the local copy of the set, while sni is an integer variable that count how many update operations have been executed by process pi on the local copy of the set. – A FIFO set variable lastopsi used to maintain an history of the update operations executed by pi . Such variable contains all the 3-uples < val, op type, id > each one characterizing an operation of type op type = {add or remove} of the value val issued by a process with identity id. – A boolean activei , initialized to false, that is switched to true just after pi has joined the system. – Three set variables, denoted repliesi , reply toi and pendingi ,that are used in the period during which pi joins the system. The local variable repliesi contains the
100
R. Baldoni, S. Bonomi, and M. Raynal
3-uples < set, sn, ops > that pi has received from other processes during its join period, while reply toi contains the processes that are joining the system concurrently with pi (as far as pi knows). The set pendingi contains the 3-uples < val, op type, id > each one characterizes an update operation executed concurrently with the join. Initially, n processes compose the system. The local variables of each of these processes pk are such that setk contains the initial value of the regular register (without loss of generality, we assume that, at the beginning, every process pk has nothing in its variable setk ), snk = 0, activek = true, and pendingk = repliesk = reply tok = ∅. The join() operation. The algorithm implementing the join operation for a set object, is described in Figure 5, and involves all the processes that are currently present (be them active or not). First pi initializes its local variables (line 01), and waits for a period of δ time units (line 02); the motivations for such waiting period are basically the same described for the regular register and it is needed to avoid that pi looses some update. After this waiting period, pi broadcasts (with the broadcast() operation) an INQUIRY (i) message to the processes that are in the system and waits for 2δ time units, i.e., the maximum round trip delay (line 02). When this period terminates, pi first updates its local variables seti , sni and lastopsi to the most uptodate values it has received (lines 03-04) and then executes all the operations concurrent with the join contained in pendingi and not yet executed (lines 05-13). Then, pi becomes active (line 14), which means that it can answer the inquiries it has received from other processes, and does it if reply to = ∅ (line 15). Finally, pi returns ok to indicate the end of the join() operation (line 16). When a process pi receives a message INQUIRY(j), it answers pj by sending back a REPLY(< seti , sni , lastopsi >) message containing its local variables if it is active (line 18). Otherwise, pi postpones its answer until it becomes active (line 19 and line 15). Finally, when pi receives a message REPLY(< set, sn, ops >) from a process pj it adds the corresponding 3-uple to its set repliesi (line 21). 5.2 add() and remove() protocols These protocols are trivially executed by sending an update message using the broadcast primitives (i.e. their execution time is bounded by δ). At the receipt of the update message, every process pi checks its state. If pi is active, it simply adds or removes the value from its local copy of the set. If pi is not active (i.e. it is still executing the join() protocol), it buffers the operation in the local set pendingi set by adding the 3-uple < val, op type, id >. Such a tuple is made up of (i) the value val to be updated, (ii) the type op type of the operation (add or remove), and (iii) the id of the process that issued the update. Every operation in the set pendingi will be then executed by pi at the end of the join() protocol (lines 05-13 of Figure 5). 5.3 Correctness Proof Due to page limitation, this section only states two lemmas and the main theorem. Their proofs can be found in [4].
Joining a Distributed Shared Memory Computation
101
operation join(i): (01) sni ← 0; lastopsi ← ∅ seti ← ∅; active i ← false; pending i ← ∅; repliesi ← ∅; reply toi ← ∅; (02) wait(δ); broadcast INQUIRY (i); wait(2δ); (03) let < set, sn, ls >∈ repliesi such that (∀ < −, sn , − >∈ repliesi : sn ≥ sn ); (04) seti ← set; sni ← sn; lastopi ← ls; (05) for each < val, op type, id >∈ pendingi do (06) if (< val, op type, id >∈ / lastopi ) then (07) sni ← sni + 1; (08) lastopi ← lastopi ∪ {< val, op type, id >}; (09) if (op type = add) then seti ← seti ∪ {val}; (10) else seti ← seti /{val}; (11) end if (12) end if (13) end for; (14) activei ← true; (15) for each j ∈ reply toi do send REPLY (< seti , sni , lastopi >) to pj end for; (16) return(ok). ——————————————————————————————————— (17) when INQUIRY(j) is delivered: (18) if (activei ) then send REPLY (< seti , sni , lastopi >) to pj (19) else reply toi ← reply toi ∪ {j} (20) end if. (21) when REPLY(< set, sn, ops >) is received: repliesi ← repliesi ∪ {< set, sn, ops >}. Fig. 5. The join() protocol for a set object in a synchronous system (code for pi )
Lemma 1. Let c < 1/3δ. ∀t : |A[t, t + 3δ]| ≥ n(1 − 3δc) > 0. = Lemma 2. Let t0 be the time at which the computation of a set object S starts, H at (H, ≺) an execution history of S, and Ht1 +3δ = (Ht1 +3δ , ≺) the sub-history of H time t1 + 3δ. Let pi be a process that invokes join() on S at time t1 = t0 + 1, if c < 1/3δ then at time t1 + 3δ the local copy seti of S maintained by pi will be an admissible set at time t1 + 3δ. = (H, ≺) be the execution history of a set object S, and pi a Theorem 1. Let H process that invokes join() on the set S at time t. If c < 1/3δ then at time t + 3δ the local copy seti of S maintained by pi will be an admissible set at time t + 3δ.
References 1. Aguilera, M.K.: A Pleasant Stroll Through the Land of Infinitely Many Creatures. ACM SIGACT News, Distributed Computing Column 35(2), 36–59 (2004) 2. Baldoni, R., Bonomi, S., Kermarrec, A.M., Raynal, M.: Implementing a Register in a Dynamic Distributed System. In: Proc. 29th IEEE Int’l Conference on Distributed Computing Systems (ICDCS 2009), pp. 639–647. IEEE Computer Society Press, Los Alamitos (2009)
102
R. Baldoni, S. Bonomi, and M. Raynal
3. Baldoni, R., Bonomi, S., Raynal, M.: Regular Register: an Implementation in a Churn Prone Environment. In: SIROCCO 2009. LNCS, vol. 5869. Springer, Heidelberg (2009) 4. Baldoni, R., Bonomi, S., Raynal, M.: Joining a Distributed Shared Memory Computation in a Dynamic Distributed System. Tech Report 5/09, MIDLAB, Universit´a di Roma, La Sapienza (Italy) (July 2009), http://www.dis.uniroma1.it/˜midlab/publications 5. Delporte-Gallet, C., Fauconnier, H.: Two Consensus Algorithms with Atomic Registers and Failure Detector Ω. In: Garg, V., Wattenhofer, R., Kothapalli, K. (eds.) ICDCN 2009. LNCS, vol. 5408, pp. 251–262. Springer, Heidelberg (2009) 6. Dolev, S., Gilbert, S., Lynch, N., Shvartsman, A., Welch, J.: GeoQuorums: Implementing Atomic Memory in Mobile Ad Hoc Networks. In: Fich, F.E. (ed.) DISC 2003. LNCS, vol. 2848, pp. 306–320. Springer, Heidelberg (2003) 7. Friedman, R., Raynal, M., Travers, C.: Abstractions for Implementing Atomic Objects in Distributed Systems. In: Anderson, J.H., Prencipe, G., Wattenhofer, R. (eds.) OPODIS 2005. LNCS, vol. 3974, pp. 73–87. Springer, Heidelberg (2006) 8. Hadzilacos, V., Toueg, S.: Reliable Broadcast and Related Problems. Distributed Systems, 97–145 (1993) 9. Ko, S., Hoque, I., Gupta, I.: Using Tractable and Realistic Churn Models to Analyze Quiescence Behavior of Distributed Protocols. In: Proc. 27th IEEE Int’l Symposium on Reliable Distributed Systems, SRDS 2008 (2008) 10. Lamport, L.: On Interprocess Communication, Part 1: Models, Part 2: Algorithms. Distributed Computing 1(2), 77–101 (1986) 11. Leonard, D., Yao, Z., Rai, V., Loguinov, D.: On lifetime-based node failure and stochastic resilience of decentralized peer-to-peer networks. IEEE/ACM Transaction on Networking 15(3), 644–656 (2007) 12. Merritt, M., Taubenfeld, G.: Computing with Infinitely Many Processes. In: Herlihy, M.P. (ed.) DISC 2000. LNCS, vol. 1914, pp. 164–178. Springer, Heidelberg (2000) 13. Shao, C., Pierce, E., Welch, J.: Multi-writer consistency conditions for shared memory objects. In: Fich, F.E. (ed.) DISC 2003. LNCS, vol. 2848, pp. 106–120. Springer, Heidelberg (2003)
BSART (Broadcasting with Selected Acknowledgements and Repeat Transmissions) for Reliable and Low-Cost Broadcasting in the Mobile Ad-Hoc Network Ingu Han1,*, Kee-Wook Rim2, and Jung-Hyun Lee1 1
Dept. of computer science & information technology, Inha University 2 Dept. of computer & information science, Sunmoon University
[email protected],
[email protected],
[email protected] Abstract. In this paper, we suggest enhanced broadcasting method, named 'BSART(Broadcasting with Selected Acknowledgement and Repeat Transmissions)' which reduces broadcast storm and ACK implosion on the mobile ad hoc network with switched beam antenna elements that can enable bidirectional communication. To reduce broadcast storm, we uses DPDP(Directional Partial Dominant Pruning) method, too. To control ACK implosion problem rising on reliable transmission based on ACK, in case of the number of nodes that required message reception is more than throughput, each nodes retransmit messages constant times without ACK which considering message transmission success probability through related antenna elements(R method). Otherwise, the number of message reception nodes is less than throughput, each node verify message reception with ACK with these antenna elements(A method). In this paper, we suggest mixed R /A method. This method not only can control the number of message transmitting nodes, can manage the number of ACK for each antenna elements. By simulations, we proved that our method provides higher transmission rate than legacy system, reduces broadcast messages and ACKs.
‐
‐
‐ ‐
‐
Keywords: selected broadcasting, mobile ad-hoc network, node selection.
1 Introduction Because every node roles not only host but router, the broadcasting method is indispensable to wireless ad hoc network for searching special node’s positioning information or indentifying existence of any node. To control broadcast storm problem which too heavily duplicated messages are occurred when nodes operate broadcasting, it is useful a method that only a few node receives forwarded message[1][2][3]. The CDS(connected dominant set) can be is equal to forward node set for those network
‐
*
"This research was supported by the MKE(Ministry of Knowledge Economy), Korea, under the ITRC(Information Technology Research Center) Support program supervised by the IITA(Institute of Information Technology Advancement)" (IITA-2009-C1090-0902-0020.
S. Lee and P. Narasimhan (Eds.): SEUS 2009, LNCS 5860, pp. 103–113, 2009. © IFIP International Federation for Information Processing 2009
104
I. Han, K.-W. Rim, and J.-H. Lee
‐
set, but it is proved that finding the lowest cost CDS is NP complete problem. There are various heuristic methods to search CDS, the one method is source dependent broadcasting which consists of only one CDS per whole network, another method is source independent broadcasting which consists of one CDS per each network, the other method which mixes source independent method and source dependent method[2][3][5][6]. In general, the former method can reduce the number of selected nodes, the latter method can support node’s mobility and it also can split up whole traffics. The wireless ad hoc network environment may increase the rate of error during transmission rather than wired network environment, and the probability of message loss is high because the signals interfere and collide with each other. The one solution of these problems is ACK transmission and the other solution is selective flooding which can receive partially overlapped messages [4][7][8]. But if all nodes that received broadcasting message response with ACK message, it may cause ACK implosion which many ACK messages occur simultaneously and it leads congestion[9]. Furthermore it can reduce the performance of link in the case of ACK message is missed, because nodes must retransmit messages[1][9]. Related researches show that the node which required receiving and forwarding is applied ACK response method, and dead end node that required only receiving can receive duplicated message from neighboring nodes in the wireless ad hoc network environment with omnidirectional antennas[2][11]. But these methods select forwarding nodes that neighboring all dead end nodes with definite number of forwarding nodes compulsory, it may increase the number of forwarding node. This means that the number of broadcasting messages and ACK messages are increased, and so it can’t be appropriate solution for broadcast storm or ACK implosion. The methods that reducing duplicated messages in the wireless ad hoc network with directional antenna are message forwarding in the MAC layer, directional self pruning, three hop horizon pruning and etc. But these research didn't consider reliable transmission or ACK implosion problem though they attempt to reduce broadcasting messages[5][8][12]. Most research considered just one of them, but Lou-Wu considered both problems[1][2][3][6][10][11][14][15]. In this paper, we suggest a low-cost, reliable broadcasting method with switched beam antenna which enables directional transmission on mobile ad-hoc network. Our method manages broadcasting storm with DPDP and in case of ACK implosion, we applies SART selectively. By simulation, we proved our method quite reduced both of broadcast storm and ACK implosion and enables reliable transmission with directional antenna on mobile ad-hoc network. This instruction file for Word users (there is a separate instruction file for LaTeX users) may be used as a template. Kindly send the final and checked Word and PDF files of your paper to the Contact Volume Editor. This is usually one of the organizers of the conference. You should make sure that the Word and the PDF files are identical and correct and that only one version of your paper is sent. It is not possible to update files at a later stage. Please note that we do not need the printed paper. We would like to draw your attention to the fact that it is not possible to modify a paper in any way, once it has been published. This applies to both the printed book and
‐
‐
‐
‐
‐
‐
‐
‐
‐
‐
BSART for Reliable and Low-Cost Broadcasting in the Mobile Ad-Hoc Network
105
the online version of the publication. Every detail, including the order of the names of the authors, should be checked before the paper is sent to the Volume Editors.
2 System Model The Mobile ad-hoc network that discussed in this paper is divided by not overlapped K sectors and we supposed that each sector contains switched beam antennas which controls each sector. Let Go where the transmission gain using omni-directional antenna, Gd where the transmission gain using directional antenna, in general the following inequality comes, Gd>Go. In case that omni-directional antenna using 10dBm power reaches 250m, but using the same antenna which beam angle setted by 60 , it reaches 450m[16]. A switched beam antenna that using only one antenna element at a time, omni-directional broadcasting can be realized by sequential sweeping process[16]. In other words, a clockwise antenna element 0, 1, 2, ..., K-1 transmits messages with constant delay. If it transmits only special antenna elements group, it can realize selective flooding, too. Let dd=λdo(where λ>1, dd: reaching distance using directional antenna, do: reaching distance using omni-directional antenna), the reaching area using directional antenna is larger than area using omni-directional antenna for λ2 times, so we can regard network model that increasing λ2 times node per neighbor node. The mobile ad-hoc network can be described by unit disk graph G=(V,E) where V is set of wireless mobile nodes and E is set of node's edge. A edge (u,v) E means wireless link between node u and node v which can reach each other. We suppose that all wireless links (u,v) satisfy symmetrical property. In other words, if u can transmit messages to v, v can transmit to u, too. We supposed u's neighbor nodes to u can reach and declare u's neighbor nodes set to N(u). By definition, u N(u). If we declare u's 2-hop neighbor nodes set to N(N(u)) or N2(u), a inequality {u} N(u) N2(u) is established and N(v) N2(u) follows if v N(u). If we declare Nh(u) that within h-hop nodes from u and Hh(u) that h-hop nodes from u, a following equation comes, Nh(u) = Nh-1(u) Hh(u) where h≥1 and N0(u) = H0(u) = {u}. For the convenience, we omit subscript if h=1.
∘
∈
∈
⊆
∪
⊆
∈
⊆
Fig. 1. Omnidirectional antenna and directional antenna, directional antenna which consist of 6 antenna elements(K=6)
106
I. Han, K.-W. Rim, and J.-H. Lee
Fig. 2. An example using 4 antenna elements
∪
∪
Fig. 2 describes N2(1) = N(1) H2(1) ={1,2,3,4,5,9,10} {6,7,8,11} = {1,2,3,4,5,6,7,8,9,10}. Nodes can communicate directly with antenna element i, where the nodes which using unoverlaped K antenna elements, so to speak 1-hop away nodes set declared to Ni→(u). Then Ni→(u) N(u) and N(u)= N0→(u) N2→(u) ... NK-1→(u) {u}. A degree of node u is |N(u)|-1 = |N0→(u)| + |N2→(u)| + ... + |NK-1→(u)| where |Ni→(u)| is the number of nodes that belongs to Ni→(u). We suppose that antenna element's direction for every node maintains fixed direction by using magnetic compass or etc.. Because radiowave travels straight, there are diagonal relationship established between antenna elements for u and v(where u N(v)) communicate each other. In other words, the antenna j where 0≤j≤K-1 which transmit messages u to v, the antenna that v uses must (j+(K/2)) mod K. In fig. 2, the antenna is 1 when node 2 transmit messages to node 8, so node 8 can receive message from node 2 via antenna 3. If Dv→u = {i|u Ni→(v)}, Dv→V= w∈V Dv→w where V is nodes set that satisfy V N(v). For example, D8→2 = {3}, N(10) = {1,2,4,9}, D10→N(10) = D10→1 D10→2 D10→4 D10→9= {0} {1} {0} {1} = {0,1} in fig. 2. In this paper, we suppose that node u broadcast HELLO periodically for obtain neighbor node's state information. In other words, node v that receives HELLO from u, transmits HELLO to u via piggybacking to communicate with 1-hop neighbor node N(v).
⊆
∪
∪ ∪
∪
∈
∪
⊆ ∪ ∪
∈
∪ ∪ ∪
∪
BSART for Reliable and Low-Cost Broadcasting in the Mobile Ad-Hoc Network
107
3 BSART: Broadcasting with Selected Acknowledgement and Repeat Transmissions Suppose that node v gets self forwarding node set F(v) and dead-end set D(v) using DPDP. Then v gets nodes set Ti to transmit message that not classified F(v) and D(v) per antenna element 0, 1, ..., K-1, where Ti = Ni→(v) {F(v) D(v)}, i means antenna element's ID, and 0 i K-1. Then v gets nodes set Ti to transmit message that not classified F(v) and D(v) per antenna element 0, 1, ..., K-1, where Ti = Ni→(v) {F(v) D(v)}, i means antenna element's ID, and 0 i K-1. In case that |Ti| exceeds a constant number then nodes transmit messages repeatedly constant times(for convenience, we call this A-method), otherwise nodes identify message reception via ACK(for convenience, we call this Rmethod). For example, to prohibit receiving 3 messages per antenna simultaneously, set c=3. It can increase conjestion by ACK and messages generated simultaneously, if ACK identification(A-method) just as Low-Wu and method that get oppotunity from neighbor nodes minimum 2 times are applied at the same time for the area that mixed F(v) and D(v), via each antenna element i because a network with directional antenna, each antenna can control separately[10]. And if one node receives ACK heavily, the ACK implosion occurs and this situation cause not only performance decrease, extreme delay. Let the M(v, s, seq#, F(v), mode, DATA) is message to broadcast where v is ID of forwarding node, s is broadcast message source node's ID, seq# means sequential number of broadcast message that generated by s. s and seq# are used for identifying overlaped or not. Data means transmission message. F(v) is forward node set that acquired by DPDP. Besides, all v must get Rv which antenna elements set that to apply A-method and Av which antenna elements set that to apply R-method. Then for every antenna element i where i {0,1, ..., K-1}, v calculates Ti = Ni (v) {F(v) D(v)}. If |Tj| 26.5
This listing represents data produced by a temperature sensor located in the kitchen of out home that is identified by Home ID 1. When this data is passed to the Mapper component it considers the lists of attributes it contains and retrieves the corresponding mappings from the internal storage. In this case the mappings are: /Device/Home ID ⇒ [1] /Device/Location ⇒ [“Kitchen”, “Dining Room”, “Toilet”] /Device/Sensor/Temperature/Temp value ⇒ [10, 15, 20, 25, 30, 35] The Home ID and Location attributes are thus mapped to their corresponding values. The Temp value attribute is mapped to the representative value 25 as the real value 26.5 is included in the range 25 − 30. Now that the mapper knows the values for the mapped attributes it can build the mapped data: Listing 2. Mapped data 1 K i t c h e n <S e n s o r > 25
Designing Highly Available Repositories for Heterogeneous Sensor Data
153
The mapped data is then used to calculate through the hash function the key representing the sensor data in the key-value storage component. Note how all data generated by temperature sensors located in the kitchen of house 1 and whose sensed value is in the range 25 − 30 are stored with the same key. Op 2: querying temperature data - Suppose now that an external software component needs to interrogate the repository and obtain data from all temperature sensors in the house whose last reading reported a temperature value greater than 25.5°C. This kind of query could be useful for example to control the heating subsystem in order to maintain a constant temperature in the house. The query looks like the following: Listing 3. A query submitted to the repository 1 * <S e n s o r > >25.5
Note how values for some attributes have been substituted by constraints; a star “*” wildcard is used to match any possible value of the attribute meaning that the query is not interested in restricting the search to a specific location, while the constraint “>25” limits the range of temperatures that are returned. The mapper receives the query and splits it into its components. It retrieves the mappings for all three attributes and starts to match the contraints against the intervals. All intervals in the mapping associated to attribute Location is matched given the “*” constraint. The Home ID attribute defined with value 1 leads to a 1-to-1 mapping. Finally, the constraint “>25.5” defined for attribute Temp value is matched against the intervals contained in the corresponding mapping; a match is positive if there is a non empty intersection between the set of values defined by the contraint and the set of values contained in one of the intervals defined by the mapping; the matching operation thus returns values 25, 30 and 35. All these matched values are combined in all possible ways to obtain the mapped queries. One of the 9 mapped queries generated by the Mapper component is the one presented as listing 2. This mapped query is indistinguishable from the mapped data we showed in the previous section. This means that this query generates at least one key that corresponds to the key previously generated to store the sensor data. This is correct, indeed, as the sensor data perfectly matches the query. Note that a a slightly different query, e.g. a query requiring all data from temperature sensors exposing a value greater than 29°C would have returned the same mapped query. In this case the previously stored sensor data would constitute a false positive returned in the response due to the lack of precision in the attribute mapping.
154
R. Baldoni et al.
4 Related Work Even if the possibility to store and retrieve data is fundamental in all smart home applications, to the best of our knowledge issues related to the design of embedded repositories for such kind of applications have been hardly takled in the State of the Art. Probably this is due to the fact that previous works in this area often considered simple centralized approaches to the problem. To find some informations about the distributed storage problem in this kind of environment, we have to look works addressing problems related to context-awareness, that typically need to access a reliable repository. Schmidt in [13] explains that context-aware systems are computing systems that provide relevant services and information to users based on their situational conditions. This status of the system must be stored reliably in order to be queried and explored. From this point of view the availability of a reliable distributed repository can be very useful to a context-aware systems deployed in a smart home. In our work we focused on how such reliable distributed repository could be realized. Khungar and Riekki introduced Context Based Storage (CBS) in [11], a context aware storage system. The structure of CBS is designed to store all types of available data related to a user and provide mechanisms to access this data everywhere using devices capable to retrieve and use the information. CBS provides a simple way to store documents using the context explicitly provided by the user. It allows users to retrieve documents from a ubiquitous storage using the context related directly to the document or context related to the user that is then linked to the document through timestamping methods. The main difference between CBS and our system is that in CBS special emphasis is given to group activities and access control, since CBS is designed for an ubiquitous environment. In our system rights are not considered since we assume that within a closed home environment the set of users is well known and security is tackled when trying to access the system. Another great difference regards the way data is stored: the storage system in our architecture is completely distributed and built to provide reliable storage and access to devices deployed in the environment. All the previous solutions assume the presence of a central database, a storage server that in same cases could be overloaded. In [14] the authors show a solution targeted at enhancing the process of data retrieval. The system is based on the idea of prefetching data from a central repository to improve responsiveness to user requests. The solution proposed in this paper tries to overcome these problems using a scalable distributed approach where all participating nodes are able to provide the same functionalities. A solution similar to the one proposed in this paper has been previously adopted in the field of large scale data diffusion to implement a content-based publish/subscribe system on top of a DHT [10]. The main difference between the two approaches is in the way data is accessed: while publish/subscribe systems assume that queries (subscriptions) are generated before data (publications) is diffused, our system targets a usage pattern closer to the way classical storage systems are accessed.
5 Conclusions This paper presented the architecture of a distributed repository for smart home application environments characterized by the presence of a large number of heterogeneous
Designing Highly Available Repositories for Heterogeneous Sensor Data
155
devices. The repository bases its reliability and scalability properties on an underlying DHT that is used to store and retrieve data. The limitations imposed by the DHT lookup primitive is solved by introducing a mapping component able to correctly map queries and matching data. The authors plan to start experimenting this idea through an initial prototype that will be adopted for testing purposes. Aim of these test will be to evaluate the adaptability of the proposed architecture to different applicative scenarios. A further improvement plannes as futur work will consist in modifying the system in order to automatically adapt at run-time the mappings definition in order to reduce the number of false positives returned by the repository as response to queries without adversely affecting its performance.
References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10.
11. 12.
13. 14.
AMX, http://www.amx.com/ BTicino “My Home”, http://www.myhome-bticino.it/ KNX, http://www.knx.org/ LonWorks, http://www.echelon.com/ Lutron Electronics Co., Inc., http://www.lutron.com/ Philips Dynalite, http://www.dynalite-online.com/ The UPnP forum, http://www.upnp.org/ Web Ontology Language (OWL), http://www.w3.org/2004/OWL/ Smart Homes for All: An embedded middleware platform for pervasive and immersive environments for-all. EU STREP Project: FP7-224332 (2008) Baldoni, R., Marchetti, C., Virgillito, A., Vitenberg, R.: Content-based publish-subscribe over structured overlay networks. In: Proceedings of the International Conference on Distributed Computing Systems, pp. 437–446 (2005) Khungar, S., Riekki, J.: A context based storage system for mobile computing applications. SIGMOBILE Mob. Comput. Commun. Rev. 9(1), 64–68 (2005) Rowstron, A., Druschel, P.: Pastry: Scalable, decentralized object location, and routing for large-scale peer-to-peer systems. In: Guerraoui, R. (ed.) Middleware 2001. LNCS, vol. 2218, pp. 329–350. Springer, Heidelberg (2001) Schmidt, A.: Ubiquitous Computing - Computing in Context. PhD thesis, Ph.D dissertation, Lancaster University (2002) Soundararajan, G., Mihailescu, M., Amza, C.: Context-aware prefetching at the storage server. In: Proceedings of the 33rd USENIX Technical Conference, pp. 377–390 (2008)
Fine-Grained Tailoring of Component Behaviour for Embedded Systems Nelson Matthys, Danny Hughes, Sam Michiels, Christophe Huygens, and Wouter Joosen IBBT-DistriNet, Department of Computer Science, Katholieke Universiteit Leuven, B-3001, Leuven, Belgium {firstname.lastname}@cs.kuleuven.be
Abstract. The application of run-time reconfigurable component models to networked embedded systems has a number of significant advantages such as encouraging software reuse, adaptation to dynamic environmental conditions and management of changing application demands. However, reconfiguration at the granularity of components is inherently heavy-weight and thus costly in embedded scenarios. This paper argues that in some cases component-based reconfiguration imposes an unnecessary overhead and that more fine-grained support for the tailoring of component functionality is required. This paper advocates for a highlevel policy-based approach to tailoring component functionality. To that end, we introduce a lightweight framework that supports fine-grained adaptation of component functionality based upon high-level policy specifications. We have realized and evaluated a prototype of this framework for the LooCI component model.
1
Introduction
Run-time reconfigurable component models provide an attractive programming model for Wireless Sensor Networks (WSN). As WSN environments are typically highly dynamic, run-time reconfigurable component models allow this dynamism to be effectively managed through the deployment of new functionality or the modification of existing compositions. WSNs are also increasingly expected to support multiple applications in the long-term perspective. In response, reconfigurable component models allow system functionality to evolve to meet changing application requirements. Run-time reconfigurable component models also promote reuse, which is essential in resource-constrained WSN environments. A number of run-time reconfigurable component models have been developed for embedded systems, most notably OpenCOM [4], RUNES [3], and OSGi [13]. These component models address the problems of dynamism, evolution and reuse by offering developers: – Concrete interfaces that promote the reuse of components between applications. – On demand component deployment that can be used to manage dynamism and evolution through the injection of new functionality. S. Lee and P. Narasimhan (Eds.): SEUS 2009, LNCS 5860, pp. 156–167, 2009. c IFIP International Federation for Information Processing 2009
Fine-Grained Tailoring of Component Behaviour for Embedded Systems
157
– Component rewiring that can be used to modify component compositions on the fly and thus offers a mechanism to manage dynamism and evolution. The ability to dynamically wire a third party component into a composition also promotes reuse. In sum, run-time reconfigurable component models allow for reconfiguration of system functionality through the introduction of new components, or the modification of relationships between existing components. However, component-based reconfiguration has two critical disadvantages: – Coarse granularity: As reconfigurations may be enacted only by modifying relationships between components or deploying new components, componentbased reconfiguration is a poor fit for enacting fine-grained changes. Thus, while component-based reconfiguration provides a generic mechanism for enacting changes, it is inefficient when that change may be represented by a few lines of code. This is particularly critical for embedded platforms, such as WSN nodes, where memory is limited and software updates are costly operations. – Complexity of abstraction level: Component-based reconfiguration is complex and requires a domain expert to be enacted properly. This complexity prevents end-users from tailoring the functionality of the deployed system themselves. Furthermore, expressing simple changes in a component-based system should be offered at the abstraction level of the end-user. This paper addresses the problems of coarse granularity and complexity through the introduction of a lightweight policy framework for adapting component behaviour. Policies for this framework are high-level and platform independent, thus allowing end-users to more easily tailor component behaviour. The performance of this system is evaluated through a number of case studies. The remainder of this paper is structured as follows: Section 2 provides background on component and policy frameworks for networked embedded systems, while Section 3 presents the design of a policy language and corresponding framework for tailoring component behaviour. An initial prototype of this framework is evaluated based on a case study in Section 4. Section 5 critically discusses advantages and shortcomings of our approach. Finally, Section 6 concludes and presents directions for future work.
2
Background
This section firstly discusses the state-of-the-art in component models for networked embedded systems. Section 2.2 then discusses existing policy-based mechanisms for tailoring component functionality. Finally, Section 2.3 provides a brief overview of the LooCI component model. 2.1
Component Models for Networked Embedded Systems
NesC [6] is perhaps the best known component model for networked embedded systems and is used to implement the TinyOS [12] operating system. NesC
158
N. Matthys et al.
provides an event-driven programming approach together with a static component model. NesC components cannot be dynamically reconfigured, however, the static approach of NesC allows for whole-program analysis and optimization. Mat´e [11] extends NesC and provides a framework to build application-specific virtual machines. As applications are composed using specific virtual machine instructions, they can be represented concisely, which saves power that would otherwise be consumed due to transmitting software modules. However, compared to component-based approaches, Mat´e has one critical shortcoming - compositions are limited by the functionality that is already deployed on each node and thus it is not possible to inject new functionality into a Mat´e application without reflashing each node. OpenCOM [4] is a general purpose, run-time reconfigurable component model and while it is not specifically targeted at networked embedded systems, it has been deployed in a number of WSN scenarios [7]. OpenCOM supports dynamic reconfiguration via a compact runtime kernel. Reconfiguration in OpenCOM is coarse-grained, being achieved through the deployment of new components and modifying connections between components. The RUNES [3] component model brings OpenCOM functionality to more embedded devices. Along with a smaller footprint, RUNES adds a number of introspection API calls to the OpenCOM kernel. Like OpenCOM, RUNES allows for only coarse-grained component-based reconfiguration. The OSGi component model [13] targets powerful embedded devices along with desktop and enterprise computers. OSGi provides a secure execution environment, support for run-time reconfiguration and life-cycle management. Unfortunately, while OSGi is suitable for powerful embedded devices, the smallest implementation, Concierge [15] consumes more than 80KB, making it unsuitable for highly resource-constrained devices. 2.2
Policy Techniques for Tailoring Component Behaviour
Over the last decade, research on policy-based management [2] has primarily been applied to facilitate management tasks, such as component configuration, security, or Quality of Service in large-scale distributed systems. Policy-based management allows the specification of requirements about the intended behaviour of a managed system using a high-level policy language, which are then automatically enforced in the system. Furthermore, policies can be changed dynamically without having to modify the underlying implementation or requiring the consent or cooperation of the components being governed. ESCAPE [16] is a component-based policy framework for programming sensor network applications using TinyOS [12]. Similar to our approach, ESCAPE advocates the use of policy rules to govern component behaviour. However, policies in ESCAPE are exclusively used to specify interactions between components, removing interaction code from the individual components, whereas in our approach we apply policy techniques to configure entire component compositions, including the existing information flow. In addition, ESCAPE is implemented
Fine-Grained Tailoring of Component Behaviour for Embedded Systems
159
on top of the static NesC component model [6], whereas our policy framework builds on top of a more flexible run-time reconfigurable component model. Recently, the Service Component Architecture (SCA) defined a Policy framework specification [14], which aims to use policies for describing capabilities and constraints that can be applied to service components or to the interactions between different service components. While not being bound to a specific implementation technology, the SCA policy framework focusses on service-oriented environments such as OSGi [13] which may only be applied to relatively powerful embedded devices. The approach this paper proposes is to combine the key benefits of a runtime reconfigurable component model (i.e. the ability to inject new functionality dynamically and reason about distributed relationships between components), with the efficiency of policy-based tailoring of functionality. As we will show in Section 4, this reduces the burden on developers while also reducing performance overhead for simple reconfigurations. Furthermore, the policy language we propose is high-level and easy to understand, allowing end-users, as well as domain experts, to customize the functionality of component compositions. 2.3
LooCI: The Loosely-Coupled Component Infrastructure
The Loosely-coupled Component Infrastructure (LooCI) [8] is designed to support Java ME CLDC 1.1 platforms such as the Sun SPOT [17]. LooCI is comprised of a component model, a simple yet extensible networking framework and a common event bus abstraction. LooCI components support run-time reconfiguration, interface definitions, introspection and support for the rewiring of bindings. LooCI offers support for two component types, macrocomponents and microcomponents. Macrocomponents are coarse-grained and service-like, building upon the notion of Isolates inherent in embedded Java Virtual Machines such as Sentilla [1] or SQUAWK [18]. Isolates are process-like units of encapsulation and provide varying levels of control over their execution (exactly what is provided depends on the specific JVM). LooCI standardizes and extends the functionality offered by Isolates. Each macrocomponent runs in a separate Isolate and communicates with the runtime middleware via Inter Isolate RPC (IIRPC), which is offered by the underlying system. Unlike microcomponents, macrocomponents may use multiple threads and utility libraries. Microcomponents are fine-grained and self-contained. All microcomponents run in the master Isolate alongside the LooCI runtime. Unlike macrocomponents, microcomponents must be single threaded and self-contained, using no utility libraries. Aside from these restrictions, microcomponents offer identical functionality to macrocomponents in a smaller memory footprint. Unlike OpenCOM or RUNES, LooCI components are indirectly bound over a lightweight event bus. LooCI components define their provided interfaces as the set of LooCI events that they publish. The receptacles of a LooCI component are similarly defined as the events to which they subscribe. As bindings are indirect, they may be modified in a manner that is transparent to the composition.
160
N. Matthys et al.
Furthermore, as all events are part of a globally specified event hierarchy, it becomes easier to understand and modify data flows.
3 3.1
A Policy-Based Component Tailoring Framework Policy Language Design and Tool Support
The specification of policies to tailor component behaviour is accomplished by using policy rules following Event-Condition-Action (ECA) semantics, which correspond well to the event-driven nature of the target embedded platforms. An ECA policy consists of a description of the triggering events, an optional condition which is a logical expression typically referring to external system aspects, and a list of actions to be enforced in response. In addition, our prototype policy language allows various functions to be called inside the condition and action parts of a policy. By using these policies, we offer a simple, yet powerful method to tailor component behaviour for end-users. In addition, we provide tool support to the end-users to allow simple tailoring of system behaviour. Our tool allows the end-user to firstly select the components and interfaces that can be tailored. Secondly, after specification of the corresponding policies, the tool parses and analyzes each policy for syntactic consistency. Finally, the tool allows the end-user to choose which nodes he wants to deploy the policy to. Concrete examples of the policy language can be found in Section 4. 3.2
Policy Framework Design
As illustrated in Figure 1, the policy framework is deployed on each sensor node and consists of three key components: the Policy Engine, the Rule Manager, and a Policy Distribution component. The Policy Engine is the main component in the framework and is responsible for intercepting events as they pass between two components and evaluating them based upon the set of policy rules on each node. In case of a match (i.e. a triggering event and a condition evaluating to true), the engine enforces the actions defined in the action part of the matching policy. Typical examples of actions are, e.g. denying the event to pass, publishing a custom event, or invoking a particular function in the middleware runtime. Potential conflicts between multiple matching policies are handled by following a priority-based ordering of policies, whereas only the actions of the highest priority policy are executed. Distribution of policy files from the back-end to the sensor network is achieved using a Policy Distribution component hosted on each individual sensor node. After specification and analysis of a policy by our tool, the policy is transformed into a compact binary representation that can be efficiently disseminated to the sensor nodes. On reception of this binary policy representation, the policy distribution component passes it to the Rule Manager component. The Rule Manager on each individual sensor node is responsible for storing and managing the set of policy rules on the node. After reception of a binary
Fine-Grained Tailoring of Component Behaviour for Embedded Systems
161
Fig. 1. Overview of the policy framework
policy from the distribution component, the rule manager converts the policy into a data structure suitable for more efficient evaluation which is then passed to the policy engine on a per triggering-event base. By retaining the ability to dynamically change the set of policies at run-time, the framework can be adapted according to evolving application demands.
4
Case-Study Based Evaluation
This section presents a scenario that requires two archetypal reconfigurations of a distributed component composition: (i.) introduction of filtering functionality and (ii.) binding interception and monitoring. For each case, we compare the overhead of realizing reconfigurations using LooCI macrocomponents and microcomponents to that of realizing reconfiguration using the policy framework introduced in Section 3. Specifically Section 4.1 describes our motivating application scenario. Section 4.2 describes how compositions may be modified through component reconfiguration and policy application. Section 4.3 then considers the overhead for developers inherent in each approach, while Section 4.4 analyzes the memory consumption of each approach. Finally, Section 4.5 explores the performance overhead of component-based versus policy-based reconfiguration. 4.1
Application Scenario
Consider the scenario of a WSN-based warehouse monitoring scenario. In this scenario, a company STORAGO CO provides temperature controlled storage of
162
N. Matthys et al.
goods, wherein the temperature of stored packages is monitored using a WSN running the LooCI middleware. STORAGE CO offers two classes of service for stored goods: best effort temperature control and assured temperature control. The customers of STORAGE CO (CHOCOLATE CO and CHEMICAL CO) each have different storage requirements that evolve over time. – Best effort temperature control : in this scheme, STORAGE CO sets temperature alarms, which alert warehouse employees if the temperature of a stored package has breached a specified threshold. As the scheme is alarm-based, it generates low levels of traffic, increasing battery life and reducing cost. – Assured temperature control : in this scheme, STORAGE CO provides continuous data to warehouse employees, who may view detailed temperature data and take pre-emptive action to avoid package spoiling. As this scheme transmits continuous data, it decreases node battery life and increases costs. Scenario 1. CHOCOLATE CO begins by requesting the assured temperature service level from STORAGE CO, however, due to tightening cost-constraints, CHOCOLATE CO later requests their service level to be switched to best effort. CHEMICAL CO begins by requesting the low-cost best effort service, however stricter government regulations require CHEMICAL CO increasing their coverage to assured temperature control. Scenario 2. STORAGE CO wishes to perform a detailed analysis of how their WSN infrastructure is being used, and thus deploys functionality to monitor all component bindings in their WSN. This functionality includes accounting of all events that pass. 4.2
Component-Based Modification versus Policy-Based Modification
Scenario 1. This section explores how the changing requirements of the customers on both temperature monitoring schemes can be reflected using (i.) component-based modification of the compositions, and (ii.) by a single composition customized using our policy-based approach. Component-Based Tailoring of Functionality. The assured and best effort temperature monitoring schemes discussed in Section 4.1 may be represented by two distinct component compositions. This is shown in Figure 2. In the assured monitoring scheme, a TEMP SENSOR exposes a single interface of type TEMP, which is wired to the matching receptacle of a TEMP MONITORING component. In the best effort temperature monitoring scheme, the TEMP SENSOR component is wired to the matching receptacle of a TEMP ALARM component, the ALARM interface of which is then wired to the matching interface of a TEMP MONITORING component.
Fine-Grained Tailoring of Component Behaviour for Embedded Systems
163
Fig. 2. Component configurations
In the case of CHOCOLATE CO, switching from assured to best effort temperature monitoring, the existing TEMP SENSOR component will be unwired from the TEMP MONITORING component and rewired to a TEMP ALARM component, the ALARM interface of which is wired to the TEMP MONITORING component. In the case of CHEMICAL CO, switching from best effort to assured monitoring, the existing TEMP ALARM component will be unwired from the TEMP MONITORING and TEMP SENSOR component. Subsequently, the TEMP interface of the TEMP SENSOR component will be wired to the matching receptacle of the TEMP MONITORING component. Policy-Based Modification. To enable CHOCOLATE CO to switch from assured to best effort monitoring, the developer needs to specify and enable the following policy with priority 1: policy " assured - to - best - effort " " 1 " { on TEMP as t ; // TEMP contains ( source , dest , value ) if ( t . value > 20 && t . dest == T E M P _ M O N I T O R I N G _ C H O C _ C O) then ( // publish an ALARM event to T E M P _ M O N I T O R I N G _ C H O C _ C O publish ALARM ( t . source , TEMP_MONITORING _C HO C_C O , t . value ) ; deny t ; // and block TEMP event for further dissemination ) }
This policy specifies that the policy engine should intercept all TEMP events, while only allowing those events to pass with a temperature value higher then 20 degrees Celsius and by converting them to ALARM events destined for the TEMP MONITORING component. To enable CHEMICAL CO switching from best effort to assured temperature monitoring, the developer needs to specify and enable the following policy:
164
N. Matthys et al.
policy " best - effort - to - assured " " 1 " { on TEMP as t ; if ( t . dest == TEMP_ALARM ) then ( // allow sending to TEMP_ALARM for threshold checking allow t ; t . dest = T E M P _ M O N I T O R I N G _ C H E M _ C O; // change destination publish t ; // assure sending to T E M P _ M O N I T O R I N G _ C H E M _ C O ) }
This policy changes the destination of TEMP events from the TEMP ALARM to the TEMP MONITORING CHEM CO component to enforce the assured monitoring scheme. In addition, it may not break the existing composition (i.e. TEMP events must also be sent to the TEMP ALARM component). Scenario 2: Insertion of Global Monitoring Behaviour. The networkwide monitoring of component interactions described in Section 4.1 may also be implemented using a component-based or policy-based approach. In either case, the reception and transmission of an event should be logged to a ACCOUNTING component which stores events for future retrieval and analysis. In order to implement logging or accounting using component-based modification, STORAGE CO would be required to continually probe the network to discover the state of compositions and then insert a BINDING MONITOR interception component into each discovered binding - clearly a resource intensive process. In contrast, as the LooCI Event manager provides a common point of interception for all events on each node, a single, generic policy may be inserted to perform equivalent monitoring. As all events are routed through the policy engine, such a configuration is agnostic to the component compositions executing on the WSN, and clearly entails significantly lower overhead. A policy to implement this is shown below: policy " logging " " 1 " { on * as e ; // all events have source , dest , data [] as payload then ( // always do accounting of event occurrence invoke ACCOUNTING ( e . source , e . dest , e . data []) ; allow e ; // do not block e , allow it to continue ) }
While this example is simple, we believe that the ability to install per-node, as well as per binding policies to enforce various non-functional concerns may reduce overhead in many scenarios. 4.3
Overhead for the Developer
In this section, we analyze the effort required to implement the TEMP ALARM component and compare this with the effort required to develop a functionally equivalent policy, as described in Section 4.2. Each implementation was analyzed in terms of Source Lines of Code (SLoC). The results are shown in Table 1.
Fine-Grained Tailoring of Component Behaviour for Embedded Systems
165
Table 1. Development Effort Comparison Micro-component Macro-component Policy 35 SLoC 35 SLoC 8 SLoC
Perhaps more critically than the conservation of development effort, as illustrated by the SLoC savings shown in Table 1, is the high-level and platform independent nature of the policy specification language, which unlike a Javabased LooCI component could equally be applied to a TinyOS [12] or Contiki [5] software configuration where a suitable policy interpreter exists. 4.4
Memory Footprint
The size of the policy framework is 26 kB. Subsequently, we analyzed the static memory (size on disk) and dynamic memory (RAM) consumed by the software elements introduced in Section 4.2. As can be seen in Table 2, policy-based reconfiguration consumes significantly less memory than component-based reconfiguration, a critical advantage in memory-constrained environments like WSNs. Table 2. Memory Consumption
Static Dynamic
4.5
Micro-component Macro-component Policy 1 kB 1 kB 103 bytes 3 kB 26 kB 376 bytes
Performance Overhead
We evaluated the performance of policy-based and component-based reconfiguration using a standard SunSPOT node (180 MHz ARM9 CPU, 512 kB RAM, SQUAWK VM ‘BLUE’ version) and a 3 GHz Pentium 4 desktop with 1 GB of RAM running Linux 2.6 and Java 1.6. We first logged the time required to deploy and initialize the policy specification and component implementation required to achieve the reconfigurations described in Section 4.2. We then analyzed the time which each took to handle an incoming TEMP event (i.e. process it and disseminate an ALARM event to the gateway). In each case, the SPOT node was deployed between 20 cm and 30 cm from the network gateway and we performed 50 experiments, the averaged results of which are illustrated in Table 3. As can be seen from Table 3, not only is the overhead inherent in deploying and initializing a policy significantly lower than that of deploying and initializing a component, the ongoing performance overhead per event caused by applying a policy to a binding is also lower (or equal to microcomponent performance) than that caused by inserting a new macrocomponent. In embedded environments where CPU and energy resources are scarce, we believe that policy-based reconfiguration provides concrete benefits over component-based reconfiguration for tailoring compositions as it does not introduce additional overhead.
166
N. Matthys et al. Table 3. Performance Comparison
Deployment Initialization Execution overhead
5
Microcomponent Macrocomponent Policy 11330 ms 11353 ms 200 ms 8418 ms 7420 ms 6 ms 28 ms 43 ms 28 ms
Discussion
The evaluation presented in the previous section clearly shows that policy-based modification of component compositions can have significant advantages in terms of: (i.) lowering development overhead, (ii.) reducing memory footprint and (iii.) improving performance. This leads to a critical question: When to apply component-based modification of functionality, and when to use policy-based tailoring of functionality? The policy-based approach is suited to enforce non-functional concerns like accounting or security on component compositions, as these non-functionalities are orthogonal to the composition and not radically change the end-to-end information flow in the component composition. Despite the concrete advantages of policy-based composition modification, this approach is not without drawbacks: it can reduce reusability of components. In a pure (or functional) component composition, the functionality of each component is solely identified by its type along with the interfaces and receptacles it provides. As the application of policies to component bindings can modify functionality in a manner that is opaque, this can effectively render the component unreliable for use in other compositions and thus reduces the maintainability of the system. Managing long-term system evolution must be done with care. Rather, we believe that policies should be used to efficiently realize transient modifications to compositions and to enforce non-functional concerns on compositions.
6
Conclusions
This paper has presented a policy-based framework that can be used to tailor the functionality of component compositions. We have presented a compact and lightweight prototype of this framework realized for the LooCI component model and, through evaluation, we have shown that policy-based tailoring can reduce overhead for developers, reduce memory consumption and improve the performance of reconfiguration when compared to purely component-based reconfiguration approaches. In the short term, future work will focus upon further researching the impact of policy-based modifications on component compositions. In addition, we plan evaluating policy-based tailoring of functionality in a logistics scenario with concrete WSN end-users. In the longer term we hope to improve the expressiveness of our policy language, and implement prototypes of our policy engine and evaluate its performance for the OpenCOM [4] and OSGi [13] component models.
Fine-Grained Tailoring of Component Behaviour for Embedded Systems
167
Acknowledgments. Research for this paper was partially funded by the Interuniversity Attraction Poles Programme Belgian State, Belgian Science Policy, Research Fund K.U.Leuven, and is conducted in the context of the IBBT-DEUS project [9] and IWT-SBO-STADiUM project No. 80037 [10].
References 1. Sentilla Perk Platform (July 2009), http://www.sentilla.com/ 2. Boutaba, R., Aib, I.: Policy-based management: A historical perspective. J. Network Syst. Manage. 15(4), 447–480 (2007) 3. Costa, P., Coulson, G., Mascolo, C., Mottola, L., Picco, G.P., Zachariadis, S.: Reconfigurable component-based middleware for networked embedded systems. International Journal of Wireless Information Networks 14(2), 149–162 (2007) 4. Coulson, G., Blair, G., Grace, P., Taiani, F., Joolia, A., Lee, K., Ueyama, J., Sivaharan, T.: A generic component model for building systems software. ACM Trans. Comput. Syst. 26(1), 1–42 (2008) 5. Dunkels, A., Gronvall, B., Voigt, T.: Contiki - a lightweight and flexible operating system for tiny networked sensors. In: Proceedings of the 29th Annual IEEE International Conference on Local Computer Networks (LCN 2004), Washington, DC, USA, pp. 455–462. IEEE Computer Society, Los Alamitos (2004) 6. Gay, D., Levis, P., von Behren, R.V., Welsh, M., Brewer, E., Culler, D.: The nesc language: A holistic approach to networked embedded systems. In: PLDI 2003: Proceedings of the ACM SIGPLAN 2003 conference on Programming language design and implementation, pp. 1–11. ACM Press, New York (2003) 7. Hughes, D., Greenwood, P., Blair, G., Coulson, G., Grace, P., Pappenberger, F., Smith, P., Beven, K.: An experiment with reflective middleware to support gridbased flood monitoring. Conc. Comp.: Pract. Exper. 20(11), 1303–1316 (2008) 8. Hughes, D., Thoelen, K., Horr´e, W., Matthys, N., Del Cid, J., Michiels, S., Huygens, C., Joosen, W.: LooCI: a loosely-coupled component infrastructure for networked embedded systems. Technical Report CW 564, K.U.Leuven (September 2009) 9. IBBT-DEUS project (July 2009), https://projects.ibbt.be/deus 10. IWT STADiUM project 80037. Software technology for adaptable distributed middleware (July 2009), http://distrinet.cs.kuleuven.be/projects/stadium/ 11. Levis, P., Culler, D.: Mat´e: a tiny virtual machine for sensor networks. In: Proceedings of the 10th international conference on Architectural support for programming languages and operating systems, New York, USA, pp. 85–95 (2002) 12. Levis, P., Madden, S., Gay, D., Polastre, J., Szewczyk, R., Woo, A., Brewer, E.A., Culler, D.E.: The emergence of networking abstractions and techniques in tinyos. In: Proceedings of the 1st Symposium on Networked Systems Design and Implementation (NSDI 2004), March 2004, pp. 1–14 (2004) 13. OSGi Alliance. About the OSGi Service Platform, whitepaper, rev. 4.1 (June 2007) 14. OSOA. SCA Policy Framework. SCA Version 1.00 (March 2007) 15. Rellermeyer, J.S., Alonso, G.: Concierge: a service platform for resourceconstrained devices. SIGOPS Oper. Syst. Rev. 41(3), 245–258 (2007) 16. Russello, G., Mostarda, L., Dulay, N.: ESCAPE: A component-based policy framework for sense and react applications. In: Chaudron, M.R.V., Szyperski, C., Reussner, R. (eds.) CBSE 2008. LNCS, vol. 5282, pp. 212–229. Springer, Heidelberg (2008) 17. Sun Microsystems. Sun SPOT world (July 2009), http://www.sunspotworld.com/ 18. Sun Squawk Virtual Machine (July 2009), http://squawk.dev.java.net/
MapReduce System over Heterogeneous Mobile Devices Peter R. Elespuru, Sagun Shakya, and Shivakant Mishra Department of Computer Science University of Colorado, Campus Box 0430 Boulder, CO 80309-0430, USA
Abstract. MapReduce is a distributed processing algorithm which breaks up large problem sets into small pieces, such that a large cluster of computers can work on those small pieces in an efficient, timely manner. MapReduce was created and popularized by Google, and is widely used as a means of processing large amounts of textual data for the purpose of indexing it for search later on. This paper examines the feasibility of using smart mobile devices in a MapReduce system by exploring several areas, including quantifying the contribution they make to computation throughput, end-user participation, power consumption, and security. The proposed MapReduce System over Heterogeneous Mobile Devices consists of three key components: a server component that coordinates and aggregates results, a mobile device client for iPhone, and a traditional client for reference and to obtain baseline data. A prototypical research implementation demonstrates that it is indeed feasible to leverage smart mobile devices in heterogeneous MapReduce systems, provided certain conditions are understood and accepted. MapReduce systems could see sizable gains of processing throughput by incorporating as many mobile devices as possible in such a heterogeneous environment. Considering the massive number of such devices available and in active use today, this is a reasonably attainable goal and represents an exciting area of study. This paper introduces relevant background material, discusses related work, describes the proposed system, explains obtained results, and finally, discusses topics for further research in this area. Keywords: MapReduce, iPhone, Android, Mobile Platforms, Apache, Ruby, PHP, jQuery, JavaScript, AJAX.
1
Introduction
Distributed computing has come into its own in the internet age. Such a large computational pool has given rise to endeavors such as the SETI@Home [14], and Folding@Home [7] projects, which both attempt to allow any willing person to surrender a portion of their desktop computer or laptop to a much larger computational goal. In the case of SETI@Home, millions of users participate to analyze data in search of extra terrestrial signals therein, whereas Folding@Home S. Lee and P. Narasimhan (Eds.): SEUS 2009, LNCS 5860, pp. 168–179, 2009. c IFIP International Federation for Information Processing 2009
MapReduce System over Heterogeneous Mobile Devices
169
is a bit more practical. Folding@Home’s goal is ”to understand protein folding, misfolding, and related diseases”. These systems, along with others which are mentioned later, are conceptually similar to what we propose, which is a system that allows people to participate freely in these kinds of massive, computationally bound problems so that results may be quickly obtained. There are many similar approaches to solving large computationally intensive problems. One of the most famous of these is the problem of providing relevant search of the Internet itself [2]. Google has emerged as the superior provider of that capability, and a portion of that superiority comes by way of the underlying algorithms in use to make their process efficient, elegant, and reliable [13], MapReduce [4]. MapReduce is similar to other mechanisms employing parallel computations, such as parallel prefix schemes [12] and scan primitives [3], and is even fairly similar to blockedsort based indexing algorithms [16]. We believe there exists a blatant disregard of certain capable devices [11] in the context of these kinds of distributed systems. Existing implementations have neglected the mobile device computation pool, and we suspect this is due to a number of factors which hamper most current mobile devices. It seems only smart phones are powerful enough, computation wise, for most of these distributed workloads. There are many additional concerns as well that have been covered by prior work, such as power usage, security concerns [9] and potential interference with the device’s intended usage model as a phone. All of these factors limit the viability of incorporating mobile devices into a distributed system. It is our belief that despite these limitations, there are solutions that allow the inclusion of the massive smart phone population [6] into a distributed system. One logical progression of MapReduce, and other such distributed algorithms, is toward smart mobile devices primarily because there are so many of them, and they are largely untapped. Even a small scale incorporation of this class of device can have an enormous impact on the systems at large and how they accomplish their goals. Increases in data volume underscores the need for additional computational power as the world continues to create far more data than it can realistically and meaningfully process [5]. Using smart mobile devices, in addition to the more traditional set of servers, is one possible way to increase computational power for these kinds of systems, and is exactly what we attempt to prove and quantify in specific cases by leveraging prior work on MapReduce. This paper explores the feasibility of using smart mobile devices in a MapReduce system by exploring several areas, including quantifying the contribution they make to overall computation throughput, end-user participation, power consumption, and security. We have implemented and experimented with a prototype of a MapReduce system that incorporates three types of devices: a standard Linux server, an iPhone, and an iPhone simulator. Preliminary results from our performance measurements support our claim that mobile devices can indeed contribute positively in a large heterogenous MapReduce system, as well as similar systems. Given that the number of smart phones is clearly on the rise, there is immense potential in using them to build computationally-intensive parallel processing applications.
170
P.R. Elespuru, S. Shakya, and S. Mishra
The rest of the paper is organized as follows. In Section 2, we briefly outline the MapReduce system. Section 3 touches on similar endeavors. In Section 4, we describe the design of our system, and in Section 5, we describe the implementation details. In Section 7, we discuss experimental results measured from our prototype implementation. Next, we discuss some optimizations in Section 8 and then finally conclude our paper in Section 10.
2
MapReduce
MapReduce [4] is an increasingly popular programming paradigm for distributed data processing, above and beyond merely indexing text. At the highest architectural level, MapReduce is comprised of a few critical pieces and processes. If you have a large collection of documents or text, that corpus must be broken into manageable pieces, called splits. Commonly, a split is one line of a document, which is the model we follow as well. Once split, a master node must assign splits to workers who then process each piece, store some aspect of it locally, but ultimately return it to the master node or something else for reduction. The reduction then typically partitions the results for faster usage, accounting for statistics, document identification and so on. We describe the MapReduce process in three phases: Map, Emit, and Reduce (See Figure 1). In our system, the map phase is responsible for taking a large data set, and chunking it into splits. The Emit phase is entails the distributed processing nodes obtaining and work on the splits, and returning a processed result to another entity, the master node or job server that coordinates everything. Unlike most MapReduce implementations, the nature of mobile devices precludes us from using anything other than network communications to read and write data, as well as assign jobs and process them. The final phase is Reduce, which in our case further minimizes the received results into a unique set of data that ultimately gets stored in a database for simplicity. For example, given a large set of plain text files over which you may wish to search by keyword, a MapReduce system begins with a master node that takes all those text files, splits them up line by line, and parcels them out to participants. The participating computation nodes find the unique set of keywords in each line of text they were given, and emit that set back to the master node. The masFig. 1. High Level Map Reduce Explanation ter node, after getting all of the pieces back, aggregates all
MapReduce System over Heterogeneous Mobile Devices
171
of the responses to determine the overall unique set of keywords for that whole set of data, and stores the result in a database, file, or some other persistent storage medium. It is at this point the data is analyzed and searched whatever way desired from within our web application. One of the biggest strengths of MapReduce lies in its inherent distribution of phases, which results in an extremely high degree of reliable parallelism when implemented properly. MapReduce is both fault and slow-response tolerant, which are very desirable characteristics in any large distributed system.
3
Related Work
There have been a number of other explorations of heterogeneous MapReduce implementations and their performance [15], as well as some more unique expansions on the idea such as using JavaScript in an entirely client side browser processing framework [8] for MapReduce. None of this related work however focuses on using a mobile device pool as a major computation component. To complement these related works, we focus on mobile devices, and in particular, on the specifics of heterogeneity in the context of mobile devices mixed with more traditional computation resources.
4
The Heterogeneous Mobile Device MapReduce System
Our problem encompasses three areas: 1) Provide a mechanism for interested parties to participate in a smart phone distributed computational system, and ensure they are aware of the potential side effects; 2) Make use of this opt-in device pool to compute something and provide aggregate results; and 3) Provide meaningful results to interested parties, and summarize them in a timely fashion, considering the reliability of devices on wireless and cellular networks. Our solution is The Heterogeneous Mobile Device MapReduce System. Fig. 2. System Summary There are several key components in our system: 1) A server which acts as the master node and coordinator for MapReduce processing; 2) Server side client code used to provide faster more powerful client processing in conjunction with mobile devices, 3) The mobile device client which
172
P.R. Elespuru, S. Shakya, and S. Mishra
implements MapReduce code to get, work on, and emit results of data from the master node; and finally 4) The BUI, or browser user interface (web application), which lets the results be searched (See Figure 2). The MapReduce master node server leverages the Apache [17] web server for HTTP. To provide the MapReduce stack, we actually have two different implementations of our master node/job server code, one in Ruby [18] and one in PHP [19]. However, we primarily used the PHP variant during our testing. Once the master node has been seeded with some content to process, it is told to begin accepting participant connections. Once the process begins, clients of any type, mobile or traditional, may connect, get work, compute and return results. During processing, clients, whether they are mobile devices or processes running on a powerful server, can continually request work and compute results until nothing is left to do for a given collection. In this case, the server still responds to requests, but does not return work units since the cycle is complete (See Figure 3). After all the data has been processed, clients can still request work, but obviously are not furnished anything. At this point, our web application front end is used to search for keywords throughout the documents which were just processed. The web application was implemented in PHP and makes use of the jQuery [20] JavaScript framework to provide asynFig. 3. Client Flow chronous (AJAX) page updates as workers complete units, in real-time. More can be seen in Figure 2. Further, Figure 4 illustrates exactly what the entire process looks like.
Fig. 4. Work Loop
MapReduce System over Heterogeneous Mobile Devices
5
173
System Development
There are a few additional aspects of developing this system that warrant discussion. Our experience with the development environment, and lessons learned are worth sharing as well. 5.1
Mobile Client Application Development Experience
We developed our mobile client application on the iPhone OS platform using the iPhone SDK, Cocoa Touch framework and Objective-C programming language. As part of the iPhone SDK, the XCode development environment was used for project management, source code editing and debugging. To run and test the MapReduce mobile client, we used the iPhone simulator in addition to actual devices. Apple’s Interface Builder provided a drag and drop tool to develop the user interface very rapidly. All in all the experience was extremely positive [10]. 5.2
Event Driven Interruption on iPhone
Event handling on the iPhone client proved rather interesting, due largely to the fact that certain events can override an application and take control of the device against an application’s will. While the iPhone is processing data, other events like an incoming phone call, a SMS message or a calendar alert event can take control of the device. In the case of an incoming phone call, the application is paused. Once the user hangs up, the iPhone client is relaunched by iPhone OS, but it is up to the application to maintain state. While on the call, if the user goes back to the home screen or launches another application, the phone client does not resume, and again the application is responsible for maintaining state. When an SMS message or calendar event occurs the computation continues in the background unless the user clicks on the message or views the calendar dialog. In the latter case the action is same as when there is phone call. These events, which are entirely out of the control of the application, pose an interesting challenge and must be addressed during development.
6
End-User Participation
Participants are largely in two different camps, captive and voluntary. For example, if a system such as ours was deployed in a large corporation where most employees have company provided mobile devices, that company could require employees to allow their devices to participate in the system. These are what we consider captive users. Normal users on the other hand are true volunteers, and participate for different reasons. The key is to come up with methods which engage both of these types of users so that the overall experience is positive for everyone involved. There are a large number of possible solutions to entice both types of users. Both captive and voluntary users could be offered prizes for participation, or perhaps simply receive accolades for being the participant with the most computed work units. This is similar to what both SETI@Home and Folding@Home
174
P.R. Elespuru, S. Shakya, and S. Mishra
do, and has proven effective. The sense of competition and participation drives people to team up with the hopes of being the most productive participant. This topic is discussed further later on as well.
7
Results
Our results were very interesting. We created several data sets of varying sizes composed of randomly generated text files of varying sizes. Data set sizes overall ranged from 5 MB to almost 50 MB. Within those data sets, each individual text document ranged in size as well, from a few kilobytes up to roughly 64 kilobytes each. Processing throughput was largely consistent independent of both the overall Fig. 5. Client Type Comparison data set size and the distribution of included document sizes. Figure 5 illustrates exactly what we expected would be the case. The simulated iPhone clients were the fastest, followed by the traditional perl clients, and lastly the real iPhone clients, which processed data at the slowest rate of all clients tested. The reason this behavior was expected is that the simulated iPhone clients ran on the same machine as the server software during our tests. The perl clients were executed on remote Linux machines. Interestingly though, mixing and matching client types didn’t seem to impact the contribution of any one particular client type. Perl clients processed data at roughly the same rate independent of whether a given test included only perl clients, as did simulated and real iPhone clients. Figure 6, presents another visualization that clearly shows there was a fair amount of variation in the different client types. Again, the simulated iPhone clients were able to process the most data, primarily because they were run on the same machine as the server component. The traditional perl clients were not far behind, and the real iPhone Fig. 6. Min Max Average clients were the laggards of the bunch.
MapReduce System over Heterogeneous Mobile Devices
7.1
175
Interpretation of the Results
Simulated iPhone clients processed an average of 1.64 MB/sec, Perl clients processed an average of 1.29 MB/sec, and finally, real iPhone clients processed an average of 0.12 MB/sec. The simulated iPhone clients can be thought of as another form of local client, and they help highlight the difference and overhead in the wireless conFig. 7. Projected System Throughput nection and processing capabilities of the real phones. These results and averages were consistent across a variety of data sets, both in terms of size and textual content. Our results show that very consistently, the iPhones were capable of performing at roughly an order of magnitude slower than the traditional clients, which is a very exciting result. It implies that a large portion of processing could be moved to these kinds of mobile clients, if enough exist at a given time to perform the necessary work load. For example, a company could purchase one server to operate as the master node, and farm all of the processing to mobile devices within their own company. Provided they have on the order of one hundred or more employees with such devices, which is a very likely scenario. This also suggests that this system could be particularly useful for non-time sensitive computations. For example, if a company had a large set of text documents it needed processed, it could install a client on its employees mobile devices. Those devices in turn could connect and work on the data set over a long period of time, so long as they are capable of processing data faster than it is being created. Considering how easy it is to quantify the contribution each device type is capable of making, such a system could very easily monitor its own progress. In summary, there are a large number of problems to which this system is a viable and exciting solution. We had a limited number of actual devices to test with (3 to be specific) but all performed consistently across all tests and data sets, so we feel comfortable projecting forward to estimate the impact of even more devices. As you increase the number of actual devices, throughput should grow similarly to what is represented in Figure 7. If a system utilized 500 mobile devices, we expect that system would be capable of processing close to 60 MB/sec of textual data. Similarly, 10000 devices would likely yield the ability to process 1,200 MB/sec (1.2 GB/sec!) of data. This certainly suggests our system warrants further exploration, but points to the fact that other components of the system would definitely start becoming bottlenecks. For example, at those rates, it would take massive network bandwidth to even support the data transfer necessary for the processing to take place.
176
8
P.R. Elespuru, S. Shakya, and S. Mishra
Optimizations
There are a few areas where certain aspects of this system could be improved to provide a more automatic and ideal experience. It is particularly important that the end user experience be as automatic and elegant as possible. 8.1
Automatic Discovery
Currently, the client needs to know the IP address and port number of the server in order to participate. This requires prior knowledge of the server address information which may be a barrier to entry for our implementation of MapReduce. In order to allow auto-discovery, we can have Bonjour (aka mDNS), a service discovery protocol, running on the server and clients. Bonjour automatically broadcasts the service being offered for use. With Bonjour enabled on the server and clients a WiFi network is not an absolute requirement. However, there are some limitations of using Bonjour as service discovery protocol. Namely, all devices must be on the same subnet of the same local area network, which imposes maximum client limits that would minimize the viability of our system in those situations. 8.2
Device Specific Scaling
An important goal of this system is that it can be used on heterogeneous mobile devices. As such, not all mobile devices perform the same or have the same power usage characteristics. The system should ideally have the ability to know about each type of device it can run on to maintain a profile of sorts. The purpose is to allow the system to optimize itself. For example, on a Google Android device it would have one profile, and on an iPhone it would have another, and in each case the client application would taylor itself to the environment on which it is running. The ultimate goal is to maximize performance relative to power consumption. 8.3
Other Client Types
In addition to smart mobile devices of various types, and traditional clients, there are other kinds of clients which could be used in conjunction with the other two varieties. In particular, a JavaScript client would allow any web browser to connect and participate [8] in the system. The combination of these three types of clients would be formidable indeed, and form a potentially massive computational system.
9
Additional Considerations
There are a number of areas we did not explore as part of our implementation. However, the following topics would need to be considered in an actual production implementation.
MapReduce System over Heterogeneous Mobile Devices
9.1
177
Security
Security is a multi-faceted topic in this context. Our primary concerns are two fold, first can the client implementation impact security of end-user’s mobile devices, or in any way be used to compromise their devices. Second, is the data mobile devices receive in the course of participating information that would be risky to have out in the open, were a device compromised by some other means. A production implementation would need to ensure that even if a mobile device is compromised by some other means, any data associated with this processing system is inaccessible. One way to accomplish this might be to improve the client to store its local results in encrypted form, and transmit them via HTTPS to ensure the communication channel also minimizes the opportunity for compromise. Another consideration that must be made is whether to process sensitive information in the first place. In fact, certain regulations may even prevent it altogether. 9.2
Power Usage
Power usage is a very critical topic when considering a system such as this. The additional load placed on mobile devices will certainly draw more power, which is potentially even disastrous in some situations. For example, if the mobile device is an emergency phone, running down the battery to participate in a computation cycle is a very bad idea. Ultimately, power usage must be considered when deciding which devices to allow in the mix. There are a number of things which may be done to account for these concerns, such as adding code to the mobile client that would prevent it from participating if a certain power level has been passed. This may prove particularly tricky however, since not all mobile platforms include API calls that allow an application to probe for that kind of low level system information. A balance must be reached, and it is the responsibility of the client application implementation to ensure that balance is maintained. 9.3
Participation Incentives
Regardless of whether a participating end user is a captive corporate user or a truly voluntary end user, there should be an incentive structure that rewards participation in a manner that benefits all parties. There are several incentives that could be considered. One potential way would be to offer some kind of reward for participating, based on the number of work units completed for example. This would entice users to participate even more. A community or marketplace could even be set up around this concept. For example, companies could post documents they want processed and offer to pay some amount per work unit completed. Users could agree to participate for a given company, and allow their device to churn out results as quickly as possible. This would have to be a small amount of money per work unit to be viable, perhaps a few cents each. Such a marketplace could easily become quite large and be very beneficial to all involved. Amazon has a similar concept in place with its Mechanical Turk, that
178
P.R. Elespuru, S. Shakya, and S. Mishra
allows people to post work units which other people then work on for a small sum of money [1]. Another possibility would be to bundle the processing into applications where it runs in the background, such as a music player, so that work goes on continually while the media player is playing music. The incentive could be a discount of a few cents when purchasing songs through that application, relative to some number of successfully completed jobs. The possibilities are numerous.
10
Conclusions
As is clearly evident in our results, mobile devices can certainly contribute positively in a large heterogenous MapReduce system. The typical increase from even a few tens of mobile devices is substantial, and will only increase as more and more mobile devices participate. Assuming a good server implementation exists, the mobile client contribution should increase with each new mobile device added. It is expected there would be a point of diminishing returns relative to network communication overhead, but the potential benefit is still very real. If non-captive user bases could be properly motivated, there is a large potential here to process massive amounts of data for a wide range of uses. This is conceptually similar to existing cloud computing, but where computation and storage resources happen to be mobile devices, or they interoperate between the traditional cloud and a new set of mobile cloud resources.
References 1. Amazon, Inc. Amazon Mechanical Turk, https://www.mturk.com/mturk/welcome 2. Barroso, L.: Web Search for a Planet: The Google Cluster Architecture. IEEE 23(2) (March 2003) 3. Blelloch, G.E.: Scans as Primitive Parallel Operations. IEEE Transactions on Computers 38(11) (November 1989) 4. Dean, J., Ghemawat, J.: Map Reduce, Simplied Data Processing On Large Clusters. ACM, New York (2004) 5. Dubey, P.: Recognition, Mining, and Synthesis Moves Computers to the Era of Tera. Technology@Intel Magazine (February 2005) 6. Egha, G.: Worldwide Smartphone Sales Analysis, UK (February 2008) 7. Folding@Home. Folding@Home project, http://folding.stanford.edu/ 8. Grigorik, I.: Collaborative MapReduce in the Browser (2008) 9. Hunkins, J.: Will Smartphones be the Next Security Challenge (October 2008) 10. iPhone Developer Program. iphone development, http://developer.apple.com/iphone/program/develop.html 11. Krazit, T.: Smartphones Will Soon Turn Computing on its Head, CNet (March 2008) 12. Ladner, R.E., Fischer, M.J.: Parallel Prex Computation. Journal of the ACM 27(4) (October 1980) 13. Mitra, S.: Robust System Design with Built-in Soft-Error Resilience. IEEE 38(2) (February 2005)
MapReduce System over Heterogeneous Mobile Devices
179
14. SETI@Home. SETI@Home Project, http://setiathome.ssl.berkeley.edu/ 15. Zaharia, M., Konwinski, A., Joseph, A.D., Katz, R., Stoica, I.: Improving MapReduce Performance in Heterogeneous Environments. In: OSDI (2008) 16. Manning, C., Prabhakar, R., Hinrich, S.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008) 17. Apache Web Server. Apache, http://httpd.apache.org/ 18. Ruby Programming Language Ruby, http://www.ruby-lang.org/en/ 19. PHP Programming Language PHP, http://www.php.net/ 20. jQuery JavaScript Framework. jQuery, http://jquery.com/
Towards Time-Predictable Data Caches for Chip-Multiprocessors Martin Schoeberl, Wolfgang Puffitsch, and Benedikt Huber Institute of Computer Engineering Vienna University of Technology, Austria
[email protected],
[email protected],
[email protected] Abstract. Future embedded systems are expected to use chip-multiprocessors to provide the execution power for increasingly demanding applications. Multiprocessors increase the pressure on the memory bandwidth and processor local caching is mandatory. However, data caches are known to be very hard to integrate into the worst-case execution time (WCET) analysis. We tackle this issue from the computer architecture side: provide a data cache organization that enables tight WCET analysis. Similar to the cache splitting between instruction and data, we argue to split the data cache for different data areas. In this paper we show cache simulation results for the split-cache organization, propose the modularization of the data cache analysis for the different data areas, and evaluate the implementation costs in a prototype chip-multiprocessor system.
1 Introduction With respect to caching, memory is usually divided into instruction memory and data memory. This cache architecture was proposed in the first RISC architectures [1] to resolve the structural hazard of a pipelined machine where an instruction has to be fetched concurrently to a memory access. The independent caching of instructions and data has also enabled the integration of cache hit classification of instruction caches into the worst-case execution time analysis (WCET) [2]. While analysis of the instruction cache is a mature research topic, data cache analysis is still an open problem. After n accesses with unknown address to a n-way set associative cache, the abstract cache state is lost. In previous work we have argued about cache splitting in general [3]. We have argued that caches for data with statically unknown addresses shall be fully associative. In this paper we evaluate time-predictable data cache solutions in the context of the Java virtual machine (JVM). We provide simulation results for different cache organizations and sketch the resulting modular analysis. Furthermore, an implementation in the context of a Java processor shows the resource consumptions and limitations of highly associative cache organizations. Access type examples are taken from the JVM implemented on the Java processor JOP [4]. Implementation details of other JVMs may vary, but the general classification of the data areas remains valid. Part of the proposed solution can be adapted to other object-oriented languages, such as C++ and C#, as well. S. Lee and P. Narasimhan (Eds.): SEUS 2009, LNCS 5860, pp. 180–191, 2009. c IFIP International Federation for Information Processing 2009
Towards Time-Predictable Data Caches for Chip-Multiprocessors
181
2 Data Areas and Access Instructions The memory areas used by the JVM can be classified into five categories: Method area. The instruction memory that contains the bytecodes for the execution. On compiled Java systems this is the native code area Stack. Thread local stack used for stack frames, arguments, and local variables Class information. A data structure representing the different types. Contains the type description, the method dispatch table, and the constant pool. Heap. Garbage collected heap of class instances. The object header, which contains auxiliary information, is stored on the heap or in a distinct handle area. Class variables. Shared memory area for static variables. Caching of the method area and the stack area have been covered in [5] and [6]. In this paper we are interested in a data cache solution for the remaining data areas. On standard cache architectures these memory areas and the stack memory share the same data cache. 2.1 Data Access Types Data memory accesses (except stack accesses) can be classified as follows: CLINFO. Type information, method dispatch table, and interface dispatch table. The method dispatch table is read on virtual and static method invocation and on the return from a method. The method dispatch table contains two words per method. Bytecodes: new, anewarray, multianewarray, newarray, checkcast, instanceof, invokestatic, invokevirtual, invokespecial, invokeinterface, *return. CONST. Constant pool access. Is part of the class information. Bytecodes: ldc, ldc w, ldc2 w, invokeinterface, invokespecial, invokestatic, invokevirtual. STATIC. Access to static fields. Is the class variables area. Bytecodes: getstatic, putstatic. HEADER. Dynamic type, array length, and fields for garbage collection. The type information on JOP is a pointer to the method dispatch table within CLINFO. On JOP each reference is accessed via one indirection, called the handle, to simplify the compacting garbage collection. The header information is part of the handle area. Bytecodes: getfield, putfield, *aload, *astore, arraylength, *aload, *astore, invokevirtual, invokeinterface. FIELD. Object field access. Is part of the heap. Bytecodes: getfield, putfield. ARRAY. Array access. Is part of the heap. Bytecodes: *aload, *astore. 2.2 Cache Access Types The different types of data cache accesses can be classified into four classes w.r.t. the cache analysis: – The address is always known statically. This is the case for static variables (STATIC), which are resolved at link time, and for the constant pool (CONST), which only depends on the currently executed method.
182
M. Schoeberl, W. Puffitsch, and B. Huber
– The address depends on the dynamic type of the operand, but not on its value. Therefore, the set of possible addresses is restricted by the receiver types determined for the call site. The class info table, the interface table and the method table are in this category (CLINFO). – The address depends on the value of the reference. The exact address is unknown, as some value on the managed heap is accessed, but in addition to the symbolic address a relative offset is known. Instance fields and array fields, both showing some degree of spatial locality, belong to this category (FIELD, ARRAY). – The last category contains handles, references to the method dispatch table, and array lengths (HEADER). They reside on the heap as well, but we only know the symbolic address. 2.3 Cache Coherence For a chip-multiprocessor system the cache coherence protocol is the major limiting factor on the scalability. Splitting data caches also simplifies the cache coherence protocol. Most data areas are actually constant (CLINFO, CPOOL). Access into the handle area (HEADER) is pseudo-constant. The data written into the header area during object creation and can not be changed by a thread. However, the garbage collector can modify this area. To provide a coherent view of the handle area between the garbage collector and the mutators, a cache for the handle area has to be updated or invalidated appropriately. Data on the heap (FIELD, ARRAY) and in the static area (STATIC) is shared by all threads. With a write-through cache the cache coherence can be enforced by invalidating the cache on monitorenter and before reads from volatile fields.
3 Cache Benchmarks Before developing a new cache organization we run benchmarks to evaluate memory access patterns and possible cache solutions. Our assumption is that the hit rate on the average case correlates with the hit classification in the WCET analysis, when different access types are cached independently. Therefore, we can reason about useful cache sizes from the benchmark results. For the benchmarks we use two real-world embedded applications [7]: Kfl is one node of a distributed control application and Lift is a lift controller deployed in industrial automation. The Kfl example is a very static application, written in conservative, procedural style. The application Lift was written in a more object-oriented style. Furthermore, two benchmarks from an embedded TCP/IP stack (UdpIp and Ejip) are used to collect performance data. Figure 1 shows the access frequencies for the different memory areas for all benchmarks. There are no write accesses to the constant data areas and also no write access to the pseudo-constant area (HEADER). As we measure applications without object allocation at runtime, the data in the HEADER area is not mutated. The general trend is that load instructions dominate the memory traffic (between 89% and 93%).
Towards Time-Predictable Data Caches for Chip-Multiprocessors
183
Table 1. Data memory traffic to different memory areas (in % of all data memory accesses) Kfl
Lift
UdpIp
Ejip
load store load store load store load store CLINFO CONST STATIC HEADER FIELD ARRAY
31.2 11.4 28.3 14.3 0.0 3.9
0.0 7.4 0.0 14.4 0.0 2.6 0.0 13.8 7.6 2.6 0.6 8.8 0.0 50.5 0.0 39.1 0.0 24.9 0.8 6.3 3.2 4.7 5.7 10.7
0.0 0.0 1.1 0.0 1.8 4.0
10.7 12.3 12.3 39.4 6.4 10.6
0.0 0.0 3.4 0.0 1.0 4.0
For the Kfl application there are no field accesses (FIELD). Dominating accesses are to static fields (STATIC), static method invocation (CLINFO), and access to the constant pool (CONST). The rest of the accesses are related to array accesses (HEADER, ARRAY). The Lift application has a quite different access pattern: instance field accesses dominate all reads (FIELD and HEADER). There are less methods invoked than in the Kfl application and less static fields accessed. The array access frequency of both applications is similar (4%–5%), for the TCP/IP benchmark, due to many buffer manipulations, considerable higher (11% loads). 3.1 Cache Simulations As first step we simulate different cache configurations with a software simulation of JOP (JopSim) and evaluate the average case hit count. Handle Cache. As all operations on objects and arrays need an indirection through the handle we first simulate a cache for the handle. The address of the handle is not known statically, therefore we assume a small fully-associative cache with LRU replacement policy. The results of the cache is shown in Table 2 for different sizes. The size is in single words. Quite interesting to note is that even a single entry cache provides a hit rate for the handle indirection access of up to 72%. Caching a single handle should be so simple, so a single cycle hit detection including a memory read start in the same cycle should be possible. In that case, even a uniprocessor JOP with a two cycle memory read will gain some speedup. A size of just 8 entries results in a reasonable hit rate between 84% and 95%. Constants and the Method Table. Mixing access to the method table and access to the constant pool in one direct mapped cache is an option when the receiver types can be determined precisely. However, if the set of possible receiver types is large, the analysis becomes less precise. Therefore, we evaluate individual caches for the constant pool access (CPOOL) and the access to the method table (CLINFO). Table 3 shows that a small direct-mapped cache of 512 words (2 KB) gives a hit rate of 100%. Keeping the cache sizes small is important for our intended system. We are targeting chip-multiprocessor systems with private caches, even for accesses to constants, to keep the individual tasks time-predictable. A shared cache would not allow to perform any cache analysis of individual tasks.
184
M. Schoeberl, W. Puffitsch, and B. Huber Table 2. Hit rate of a handle cache, fully associative, LRU replacement Hit rate (%) Size Kfl Lift UdpIp Ejip 1 2 4 8 16 32
72 82 84 88 92 95
15 20 94 95 95 95
43 80 87 91 94 96
69 78 82 84 84 86
Table 3. Hit rate of a constant pool cache, direct mapped Hit rate (%) Size Kfl Lift UdpIp Ejip 32 68 69 77 82 64 96 69 79 95 128 98 69 88 95 256 100 100 100 95 512 100 100 100 100 Table 4. Hit rate of a method table cache, direct mapped Hit rate (%) Size Kfl Lift UdpIp Ejip 32 64 83 64 85 83 128 91 100 256 100 100
62 77 85 97
49 74 93 95
The hit rate of a direct-mapped cache for the method table (MTAB) shows a similar behavior as the constant pool caching, as shown in Table 4. A size of 256 words gives a hit rate between 95% and 100%. It has to be noted that the method table is accessed by static and virtual methods. While the MTAB entry is known statically for static methods, the MTAB entry for virtual methods depends on the receiver type. If data-flow analysis can determine most of the receiver types the combination of a single cache for the constant pool and the method table is an option further to explore. Static Fields. Table 5 shows the results for a direct mapped cache for static fields. For object-oriented programs (represented by Lift), this cache can be kept very small. Although the addresses are statically known as the addresses for the constants, a combination of these two caches is not useful. Static fields need to be kept cache coherent, constant pools entries are implicitly cache coherent. Cache coherence enforcement, with cache invalidation at synchronized blocks, limits the hit rate in UdpIp and Ejip.
Towards Time-Predictable Data Caches for Chip-Multiprocessors
185
Table 5. Hit rate of a static field cache, direct mapped Hit rate (%) Size Kfl Lift UdpIp Ejip 32 76 64 85 128 99 256 100
100 100 100 100
33 33 33 33
77 77 77 77
Table 6. Hit rate of an instance field cache, fully associative, LRU replacement Hit rate (%) Size Kfl Lift UdpIp Ejip 1 2 4 8 16 32
84 84 84 84 84 84
17 75 86 88 88 88
47 59 65 67 67 67
9 13 18 20 20 20
Object Fields. Addresses of object fields are unknown for the analysis. Therefore, we can only attack the analysis problem via a high associativity. Table 6 shows hit rates of fully-associative caches with LRU replacement policy. For the Lift benchmark we observe a moderate hit rate of 88% for a very small cache of just 8 entries. UdpIp and Ejip saturate at 8 entries due to cache invalidation during synchronized blocks of code. 3.2 Summary From the experiments with simulation of different caches for different memory areas we see that quite small caches can provide a reasonable hit rate. However, as the memory access latency for a CMP system with time-sliced memory arbitration can be quite high,1 even moderate cache hit rates are a reasonable improvement.
4 Cache Analysis In the following section we sketch cache analysis as it will be performed in a future version of our WCET analysis tool [8]. We leverage the cache splitting of the data areas for a modular analysis, e.g., analysis of heap allocated objects is independent from analysis of the cache for constants or cache for static fields. 1
Our 8 core CMP prototype with a time slot of 6 cycles per core has a worst-case latency of 48 cycles.
186
M. Schoeberl, W. Puffitsch, and B. Huber
In multithreaded programs, it is necessary to invalidate the cache when entering a synchronized block or reading from volatile variables.2 We require that accesses to shared data are properly synchronized, which is the correct way to access shared data in Java. In this case it is safe to assume that object references on the heap are not changed by another thread at arbitrary points in the program, resulting in a significantly more precise analysis. The effect of synchronization, namely invalidating some of the caches, has to be taken into account though. The running example is illustrated in Figure 1 and was taken from the Lift application. The figure comprises the source code of the method checkLevel and the corresponding control flow graph in static single assignment (SSA) form. Each basic block is annotated with the cache accesses it triggers. 4.1 Static and Type-Dependent Addresses If we only deal with statically known addresses in a data cache, the standard cache hit/miss classification (CHMC) for instruction caches delivers precise results and is therefore a good choice [9]. In the example, there is only one static variable, LEVEL POS. If we assume a direct-mapped cache for static variables, and a separate one for values on the heap, all but the first access to the field will be a cache hit every time checkLevel is executed. When the address depends on the type of the operand, we have to deal with a set of possible addresses. The straight forward extension of CHMC to sets of memory addresses is to update the abstract cache state for each possible address, and then join the resulting states. This leads to a very pessimistic classification when dynamic dispatch is used, however, and therefore is only acceptable if the exact address is known for most references. 4.2 Persistence Analysis If dynamic types are more common, a more promising approach is to classify program fragments, or partitions, where it is known that one or all memory addresses are locally persistent. If this is the case, they will be missed at most once during one execution of the program fragment. For both direct-mapped and N-way set associative caches with LRU replacement, a dataflow analysis for persistence analysis has been introduced in [10]. For FIFO caches, the concept of persistence is useful as well, but it is not safe anymore to assume that a persistent address will be loaded at the first access. Most work on persistence analysis focusses on dataflow equations and global persistence, leaving out some aspects which deserve more attention. Persistence for the whole program is rare and only of theoretical interest. We therefore identify a set of nested scopes [11], and identify for each scope which cache lines or cache sets are locally persistent. A scope is a subgraph of the control flow graph which represents a 2
The semantics of volatile variables in the Java memory model is similar to synchronized blocks: the complete global state has to be locally visible before the read access. Simply bypassing the cache for volatile accesses is not sufficient.
Towards Time-Predictable Data Caches for Chip-Multiprocessors
012
. &./(
187
9 &
1 29 3 26 =
01
:;0/ 99& !"#$"%& :(;' &4
!"#$"%& '("$&)* !"#$"% + #%%#, +)-*
!"#$"% '("$ )&* !"#$"% '("$ )9*
012&
! "##$
& 3 ;"7"6" 4 & 93 2
0/#/(: !"#$"% '("$ ) *
012&21
%&'((
) ) "*+,- %.!
,.&- /
3 "6"780 ) * 54 9
0/#/(:&9+ !"#$"% #%%#, ) *
+3&2 4 5 +) 4* 1 2
!"#$"%& '("$&)* !"#$"% + #%%#, +) 4* !"#$"% '("$ )&*
2 0 tgen_enabled=1
ready!
r_next_finish==0 finished_res=get_finished_res() Deactivate deactivate(finished_res)
finish_len>0 finished_task=get_task_finished() finish_len--
tick? inc_time() tgen_enabled=0
Finish
inactive! r_next_finish>=0 && r_next_release>=0
Idle
finish! status(j)=IDLE
active!
go? Ready
tick? inc_time()
t_next_release==0 && finish_len==0
(a) The task generator automaton
r_next_release==0 ready_res=get_ready_res() Activate activate(ready_res)
(b) The non-preemptive resource generator automaton
Fig. 1. Task and resource generators
Model-Based Analysis of Contract-Based Real-Time Scheduling
233
The task generator automaton uses a variable next release to remember the time until the next task is released. At start-up this variable is initialized with the smallest R(i) and if after that next release = 0 the automaton goes to the Ready location and selects a task τi for which R(i) = 0, updates next release, sets the shared variable ready task = i and sends the ready signal to the scheduler automaton. Once next release becomes greater than 0, the generator moves to the Idle location where it waits for the next tick of the Timer. When the tick signal arrives the transition to the Increment location is taken and inc time() updates the values status(i), E(i) and R(i) as follows: - for all tasks τi with status(i) = running E(i) = E(i) + M IN and if E(i) = W (i) then status(i) = f inished, - for all tasks τi R(i) = R(i) − M IN and next release = min(R(i)), - for all tasks τi running or ready for execution with E(i) < W (i) and P (i) − D(i) = R(i) sets status(i) = overrund. Next, for all tasks τj that have finished the variable f inished task is set to j and the f inished signal is sent to the scheduler which will free the resources used by these tasks. If any task τj has missed its deadline an overrun signal notifies the scheduler which as a result will go to an Error location. After signaling all task finish events the generator checks to see if there is any task ready for execution and goes back to the Ready location. 3.2
Resource Generator Automaton
The task generator automaton presented above can be used to generate servers which act as resources for the component level. By adding just two signals active and inactive - to notify the scheduler about the availability of the resources the task generator automaton becomes a resource generator automaton with the property that those resource are preemptible. If resources are not preemptible (i.e. the scheduling windows of a time partition) the resource generator automaton is a simplified version of the task generator. Figure 1(b) presents the non-preemptive version of the resource generator automaton. The automaton keeps a discrete clock RE(k) for each resource rk . Also RR(k) is used to remember the time until the next activation of resource rk and two variables named next release and next f inish hold the time until the next resource activation and, respectively deactivation. When resource rk is activated RE(k) = L(k) where by L(k) we denote the length of the resource’s activation period. At every tick signal received from the Timer for all active resources rk RE(k) is decreased with the value MIN and variables next release and next f inish are also decreased with the same value. When next release reaches 0 all resources rk with RR(k) = 0 are activated. If next f inish becomes 0 than all resources rk with RE(k) = 0 are deactivated. 3.3
Scheduler Automaton
As it can be seen from the definitions in the previous sections, the component scheduler and the application scheduler have rather similar behavior. Both of
234
G. Macariu and V. Cret¸u
them must schedule a set of periodic tasks/servers with deadlines less or equal to their period. The component tasks are scheduled on execution time servers which may be active or inactive. It is possible for two or more servers to be active simultaneously which implies that two or more tasks may run in parallel. For the application scheduler the tasks to be scheduled are actually the servers used by the component scheduler as resources. The servers are scheduled for execution on the scheduling windows of a time partition. The scheduling windows represent the resources allocated to the application by the system. As more scheduling windows can be active simultaneously parallel execution of the servers is also possible. A scheduler automaton for a service (i.e. application or component) contract has the following characteristics: - has a queue holding the tasks ready for execution, - implements a preemptive scheduling policy Sch representing a sorting function for the task queue, - maintains a map between active resources (servers or scheduling windows) and tasks using those resources, and - has an Error location which is reached when a task misses its deadline. To record the status of a resource, let rt map(j) be a map where rt map(j) = inactive denotes that resource j is inactive, rt map(j) = active means that resource j is active but no task is executing on it, and rt map(j) = i denotes that resource j is active and is currently used by task τi . Figure 2 shows the scheduler automaton. The locations of the automaton have the following interpretations: 1. Idle - denotes the situation when no task is ready for execution or no resources are active, 2. Prepare - a task has been released and a resource is active after a period during which either there were no tasks to schedule or no active resources, 3. Running - at least one task is currently executing, running==false overrun?
active? res_no++ rt_map[ready_res]=ACTIVE
overrun?
inactive? res_no-task_no==0 active? res_no++
Error
res_no==0 ready? task_no++
finish? task_no--
Idle
task_no>0 activate? res_no++ activated_task=Sc h()
Prepare
task_ready==true activated_task=Sch() rt_map[rk]=activated_task go! res_no>0 ready? task_no++
rt_map[ready_res]=ti go!
task_ready==false && running==true
Running
Prio(ready_task)>Prio(tj) && res_no>0 enqueue(tj) preempt!
task_ready==false && running==true ready? task_no++ enqueue(ready_task)
inactive? res_no— running==false && res_no==0
AssignTask
AssignResource Check rt_map[finished_res]!=ACTIVE enqueue(rt_map[finished_res]) preempt! rt_map[finished_res]=INACTIVE
Fig. 2. The scheduler automaton
Model-Based Analysis of Contract-Based Real-Time Scheduling
235
4. AssignTask - a task has just finished and as a result an active resource can be used to schedule another ready task, 5. AssignResource - a task has just been released or a resource has just become inactive leaving its assigned task with no resource on which to execute; consequently the task has to be enqueued and if it has the highest priority in the queue according to Sch then an active resource is assigned to it, 6. Check - a resource has become inactive, 7. Error - the task set is not schedulable with Sch. The scheduler enters the Idle location when either there are no ready tasks, no active resources or both of these conditions hold. As long as new tasks are released for execution but there are no active resources on which the tasks to be executed (i.e. task no > 0 and res no == 0) or as long as there are available resources but no ready tasks (i.e. task no == 0 and res no > 0) the scheduler stays in the Idle location. If the scheduler receives a ready signal meaning that task τready task has been released and res no > 0 the scheduler goes to the Prepare location. Leaving the Prepare location for the Running location, it assigns the task to one of the active resources by setting rt map(j) = i, sets the variable activated task = i and sends a go signal to announce the task generator automaton that task τ i is running. After the scheduler has reached the Running location, it will leave this location if one of the following situations happen: - the resource rk becomes active (signaled by the active signal and activated re-source = k): this is marked by updating rt map[k] = ACT IV E on the transition to the AssignTask location. If tasks are ready for execution than the scheduler will assign the highest priority task τj to resource rk by setting rt map[k] = j and will notify the task generator with the signal go on a transition back to the Running location. - a new task τi has been released (signaled by the ready signal and ready task = i): the task is enqueued by setting status(i) to ready on the transition to the AssignResource location. If task τi is the highest priority released task and there are active resources then τi must start executing. If there is a free active resource then task τi is assigned to it otherwise the lowest priority task is chosen from the running tasks, preempted and the automaton goes to the AssignTask location. On the transition from AssignTask to Running the resource is assigned to τi and a go signal is sent to the task generator to notify it that task τi has started running. - the resource rk becomes inactive (signaled by the inactive signal and deactivated resource = k): this is marked by updating rt map[k] = IN ACT IV E on the transition to the Check location. If the deactivated resource was free and there are still running tasks but no tasks in the queue then the transition back to Running location is taken. If a task τi was using resource rk then the scheduler must set status(i) = ready and go to AssignResource location. Should the resource rk be the last active resource the scheduler would simply preempt task τi and go back to the Idle location, otherwise an active resource is searched analog to the situation when a new task is released.
236
G. Macariu and V. Cret¸u
- the task τj finishes (signaled by the f inish signal and f inished task = j): the resource used until now by τj can be assigned to the highest priority task waiting in the queue, if there is such a task. - the task τi misses it deadline (signaled by overrun): the scheduler automaton goes into the Error location.
4
Performance Analysis
This section presents an evaluation of the performance and scalability of model checking the contract-based scheduling model. The experiments were run on a machine with Intel Core 2 Quad 2.40 GHz processor and 4 GB RAM running Ubuntu. The analysis of the model was automated using UPPAAL and the utility program memtime [13] was used for measuring the model checking time and memory usage. Although the proposed model addresses scheduling at two levels, namely task level and server level, experiments were conducted only for the server level as we consider the analysis of the task level is just a replica of the server level due to the similarities between the two levels. In all experiments, to verify schedulability we checked if property A[] not Error holds. In order to observe the behavior of the model for different number of application servers we have used randomly generated sets of servers with periods in the range [10, 100] and utilizations (i.e. budget/period) generated with a uniform distribution in the range [0.05, 1]. The offset of each server was set to a value equal to the period multiplied with a randomly generated number in the interval [0, 0.3]. Also, the servers sets were accommodated by a time partition with 9 scheduling windows and a total utilization of 4.5. Figure 3 shows how the model checking time and memory usage increase with the number of servers in the set. Also it can be noticed that for the same size of the server set the performance of the model checking can vary between rather larger limits (e.g. for sets of 30 servers the model checking time grows from 7 seconds to approximatively 25 seconds). This is due to the size of the hyper-period of the server sets, larger the hyper-period larger the model checking time and memory consumption. Next, we analyzed the scalability and performance of model checking when the number of scheduling windows in the time partition accommodating the servers varies. For this, sets with 25 servers each and parameters in the same Model checking mem mory usage [MB]
Model checkiing time [sec]
30 25 20 15 10 5 0 0
5
10
15
20
25
Server set size [no. of servers]
(a) Model checking time
30
35
30 25 20 15 10 5 0 0
5
10
15
20
25
30
Server set size [no. of servers]
(b) Model checking memory usage
Fig. 3. Influence of server set size on model checking performance
35
Model-Based Analysis of Contract-Based Real-Time Scheduling 14
14 12 10 8 6 4 2 0 0
2
4
6
8
10
Model checking mem mory usage [MB]
Model checkiing time [sec]
16
237
12 10 8 6 4 2 0 0
2
Time partition size [no. of scheduling windows]
4
6
8
10
Time partition size [no. of scheduling windows]
(a) Model checking time
(b) Model checking memory usage
Fig. 4. Influence of time partition size on model checking performance ϭϬϬ
ϴ
ZD
ϵϬ
ϳ
&
ϴϬ
ϲ
dͲ
ϳϬ
Success rate [%]
Model checkiing time [sec]
ϵ
ϱ ϰ ϯ Ϯ
ϲϬ ϱϬ ϰϬ ϯϬ
ZDͲϯϬ
ϮϬ
ϭ
ZDͲϮϬ
ϭϬ
Ϭ
Ϭ
0
5
10
15
20
25
30
35
Task set size [no. of tasks]
Fig. 5. Influence of scheduling policy on model checking time
ϭ͘ϬϬ
ϭ͘ϱϬ
Ϯ͘ϬϬ
Ϯ͘ϱϬ
ϯ͘ϬϬ
ϯ͘ϱϬ
ϰ͘ϬϬ
ϰ͘ϱϬ
Task set utilization
Fig. 6. Schedulability of task sets
limits as for the first experiment were generated and time partitions with 2, 3, 5, 7 and 9 scheduling windows were tested. In Figure 4 it can be seen that both the time for checking the model and the memory usage grow with the number of scheduling windows in the time partition. In the first two experiments the server sets were scheduled using the Rate Monotonic (RM) priority scheduling policy. The goal of our next experiment is to determine the impact of the scheduling policy on the model checking time and peak memory usage. The same time partition configuration as in the first experiment was used and sets of 5, 10, 15, 20, 25 and 30 servers were scheduled using both the Rate Monotonic, the Earliest Deadline First (EDF) and the (T-C) (i.e. the higher the difference between the period and the budget of a server the lower its priority) scheduling policies. As can be seen in Figure 5 the scheduling policy has little influence on the performance of the model checking. In the last experiment we are interested in seeing what is the influence of the task set utilization on the schedulability analysis. We have used the same time partition as in the first experiment with a total utilization of 4.5 and task sets of 20 and 30 tasks with utilizations between 1 and 4.5 scheduled using the Rate Monotonic policy. Figure 6 depicts the number of schedulable task sets identified by our analysis. It can be noticed that even if the total utilization of a task set is maximal with respect to the available resources, our analysis is able to determine its schedulability, which is a clear advantage over the pessimist schedulability bounds presented in [4].
238
5
G. Macariu and V. Cret¸u
Conclusions
In this paper we have presented a compositional approach using the timed automata formalism for schedulability analysis of component-based real-time applications which utilize multi-processor resource partitions. Starting with the assumption that the resource requirements for each application and component are stipulated in a service contract we have defined a timed automata model for specifying the contracts and shown how to use model checking as a technique for analyzing the preemptive schedulability of an hierarchy of such contracts. The performance analysis of our technique using the UPPAAL model checker showed that even with just one real-time clock used for the entire model, the applicability of the technique is limited by the state-explosion problem.
Acknowledgment This research is supported by eMuCo, a European project supported by the European Union under the Seventh Framework Programme (FP7) for research and technological development.
References 1. Lipari, G., Bini, E.: A methodology for designing hierarchical scheduling systems. Journal of Embedded Computing 1(2), 257–269 (2005) 2. Harbour, M.G.: Architecture and contract model for processors and networks. Technical Report D-AC1, Universidad de Cantabria (2006) 3. Brandenburg, B.B., Anderson, J.H.: Integrating hard/soft real-time tasks and besteffort jobs on multiprocessors. In: ECRTS 2007: Proceedings of the 19th Euromicro Conference on Real-Time Systems, Washington, DC, USA, pp. 61–70. IEEE Computer Society, Los Alamitos (2007) 4. Chang, Y., Davis, R., Wellings, A.: Schedulability analysis for a real-time multiprocessor system based on service contracts and resource partitioning. Technical Report YCS-2008-432, Computer Science Department, University of York (2008) 5. Kaiser, R.: Combining partitioning and virtualization for safety-critical systems. White Paper, SYSGO AG (2007) 6. Alur, R., Dill, D.L.: A theory of timed automata. Theoretical Computer Science 126(2), 183–235 (1994) 7. Larsen, K.G., Pettersson, P., Yi, W.: UPPAAL in a nutshell. International Journal on Software Tools for Technology Transfer 2(1), 134–152 (1997) 8. Cassez, F., Larsen, K.G.: The impressive power of stopwatches. In: Palamidessi, C. (ed.) CONCUR 2000. LNCS, vol. 1877, pp. 138–152. Springer, Heidelberg (2000) 9. Henzinger, T.A., Kopke, P.W., Puri, A., Varaiya, P.: What’s decidable about hybrid automata? In: STOC 1995: Proceedings of the 27th annual ACM symposium on Theory of computing, pp. 373–382. ACM, New York (1995) 10. Fersman, E., Krcal, P., Pettersson, P., Yi, W.: Task automata: Schedulability, decidability and undecidability. Information and Computation 205(8), 1149–1172 (2007)
Model-Based Analysis of Contract-Based Real-Time Scheduling
239
11. Krcal, P., Stigge, M., Yi, W.: Multi-processor schedulability analysis of preemptive real-time tasks with variable execution times. In: Raskin, J.-F., Thiagarajan, P.S. (eds.) FORMATS 2007. LNCS, vol. 4763, pp. 274–289. Springer, Heidelberg (2007) 12. Guan, N., Gu, Z., Deng, Q., Gao, S., Yu, G.: Exact schedulability analysis for staticpriority global multiprocessor scheduling using model-checking. In: Obermaisser, R., Nah, Y., Puschner, P., Rammig, F.J. (eds.) SEUS 2007. LNCS, vol. 4761, pp. 263–272. Springer, Heidelberg (2007) 13. Memtime utility, http://freshmeat.net/projects/memtime/
Exploring the Design Space for Network Protocol Stacks on Special-Purpose Embedded Systems Hyun-Wook Jin and Junbeom Yoo Department of Computer Science and Engineering Konkuk University Seoul 143-701, Korea {jinh,jbyoo}@konkuk.ac.kr Abstract. Many special-purpose embedded systems such as automobiles and aircrafts consist of multiple embedded controllers connected through embedded network interconnects. Such network interconnects have particular characteristics and thus have different communication requirements. Accordingly, we need to frequently implement new protocol stacks for embedded systems. Implementing new protocol stacks on embedded systems has significant design space but it has not been explored in detail. In this paper, we aim to explore the design space of network protocol stacks for special-purpose embedded systems. We survey several design choices very carefully so that we can choose the best design for a given network with respect to performance, portability, complexity, and flexibility. More precisely we discuss design alternatives for implementing new network protocol stacks over embedded operating systems, methodologies for verifying the network protocols, and the designs for network gateway. Moreover, we perform case studies for the design alternatives and methodologies discussed in this paper. Keywords: Embedded Networks, Embedded Operating Systems, Network Protocol Stacks, Formal Verification, Protocol Verification, Network Gateway.
1 Introduction Many special-purpose embedded systems consist of multiple embedded controllers connected through network interconnects. For example, machine control systems such as automobiles and aircrafts are equipped with more than hundreds embedded controllers or boards, which collaborate by communicating each other. Since such embedded systems use special network interconnects and have different communication requirements, there are many cases where new protocol stacks are needed to be implemented. Implementing new protocol stacks on embedded systems has significant design space but it has not been explored in detail. Thus it is highly desirable to analyze the possible design alternatives and present their case studies as references. In this paper, we aim to explore the design space of network protocol stacks for special-purpose embedded systems. The legacy protocol stacks such as TCP/IP have several implementations already exist, which can help significantly to reduce the time frame of design and implementation phases. Adding new protocol stacks, however, requires significant cost in terms of time and complexity from the industry perspective. S. Lee and P. Narasimhan (Eds.): SEUS 2009, LNCS 5860, pp. 240–251, 2009. © IFIP International Federation for Information Processing 2009
Exploring the Design Space for Network Protocol Stacks
241
Therefore, we need to consider several design choices very carefully so that we can choose the best design for a given network with respect to performance, portability, complexity, and flexibility. In this paper, we present various design alternatives and compare them in several aspects. Moreover, we perform the case studies for the design alternatives. The rest of the paper is organized as follows: Section 2 discusses the design alternatives for implementing new network protocol stacks over embedded operating systems. Section 3 describes the methodologies for verifying the network protocols. Section 4 addresses the network interoperability issue and discusses the designs for network gateway. Finally we conclude the paper in Section 5.
2 Protocol Stacks on Embedded Nodes In this section, we explore the design and implementation alternatives of network protocol stacks on embedded nodes. The designs can highly depend on operating systems and their task models but we try to generalize this discussion as much as possible so that the designs described can be applied to the most of embedded operating systems. One of the most important issues when implement new network protocol stacks is who takes charge of multiplexing and demultiplexing of network packets. Accordingly, we classify the possible designs into two: i) user-level design and ii) kernel-level design. 2.1 User-Level Design In this design alternative, the protocol stacks are implemented as a user-level thread or process, which performs (de)multiplexing across networking tasks. The user-level protocol stacks can be portable across several embedded operating systems as far as they follow the standard interfaces such as POSIX. The overall designs are shown in Figure 1. As we have mentioned, the way to implement new network protocol stacks are dependent on the task models of operating systems. Many embedded operating systems such as VxWorks [18] and uC/OS-II [19] define the thread-based tasks on top of the flat memory models in which the user-level protocol stacks are implemented as a user thread. On the other hand, some other embedded operating systems such as Embedded Linux and QNX [20] define isolated memory spaces between tasks. In such systems, the user-level protocol stacks are implemented as a user process in general. Though most of these process-based task models also support multiple threads, the design of processbased protocol stacks is still attractive. This is because, in this task model, if we implement the protocol stacks as a thread it can only support the threads belong to the same process. That is, the thread-based protocol stacks over the process-based task models is not suitable to support multiple networking processes. In either thread or process-based design, the protocol stacks send the network packets by accessing the network device driver directly. Thus the device drivers have to provide the interfaces (e. g., APIs or system calls) for the network protocol stacks. The user-level tasks request the protocol stacks to send their packets through InterProcess Communication (IPC). In case of thread-based design, since the protocol stacks share the memory space with other tasks, the network data can be directly accessed from the protocol thread without data copy as far as the synchronization is
242
H.-W. Jin and J. Yoo
guaranteed by an IPC such as semaphore. On the other hand, the process-based protocol stacks need to pass the network data between the networking tasks and the protocol stacks explicitly by using an IPC such as message queue. This can add message passing overhead because the messaging IPCs usually require memory copy operations to move data between two different memory spaces. On the receiver side, how it works is similar with the sender side; however, there is a critical design issue of how to detect the incoming new packets. Since the protocol stacks are implemented at the user-level, there is no proper asynchronous signaling mechanism at the device driver to notify new packet arrival to the user-level protocol stacks. Thus, the interfaces provided by the device driver are the only way to check new packet arrival. However, if the interface has a blocking semantic then the protocol stacks cannot handle other requests (e.g., sending requests) from the tasks while waiting a new packet arrived. There are two solutions to overcome this issue. One is to use asynchronous interface and the other one is to have multithreaded protocol stacks. The asynchronous interface is easy to use but it is hard to come up with an optimal strategy of calling the interface in terms of when and how frequently. Thus it is likely to achieve lower performance than what the actual network can provide or waste the processor resources. Instead, the multithreaded protocol stacks can have separate threads to serve the sending and receiving operations respectively. That is, for both thread- and process-based designs, the protocol stacks consist of a set of threads. Only one difference is that the multiple threads belong to the same process in case of the process-based design. The receiving thread can block on waiting a new packet while the sending thread handles the requests from the tasks. Once the new packet has been recognized by returning from the blocked function, the receiving thread of the protocol stacks interpret the header and pass the packet to the corresponding process through an IPC. Since the protocol stacks are implemented at the user-level, they are scheduled as other user-level tasks by the task scheduler. If we give the same priority to the protocol stacks with other tasks, the execution of the protocol stacks can get delayed, which results in high network latency. Thus it is desired that the protocol stacks have higher priority than general user-level tasks and block on waiting new packets received or sending requests, which allows other tasks to utilize the processor resources if there are no pending jobs of the protocol stacks. As a case study we have implemented Network Service of Media Oriented System Transport (MOST) [1] at the user-level over Linux [2]. MOST is an automotive highspeed network to support multimedia data streaming. The current MOST standard specifies 25Mbps ~ 150Mbps network bandwidth with QoS support. To meet the demands from various automotive applications, MOST provides three different message channels: control, stream, and packet message channels. Network Service is the transport protocol for the control messages, which covers from layer 3 to parts of layer 7 of OSI 7 layers. In order to implement Network Service, we have applied the process-based design where the protocol stacks consist of sending and receiving threads. We have utilized the ioctl() system call to provide interfaces between the protocol stacks and the device driver. We have also implemented library for applications, which provides interfaces to interact with the protocol stacks using POSIX message queue. The performance results show 0.9ms of one-way latency with 8-byte control message.
Exploring the Design Space for Network Protocol Stacks
243
Fig. 1. User-level protocol stacks: (a) thread-based design and (b) process-based design
2.2 Kernel-Level Design In this design alternative, the protocol stacks are implemented as a part of operating system. Thus we do not need to move data between the device driver and the protocol stacks. This is because both use the kernel memory space and can share the network buffer. In addition, since the kernel context has higher priority than the user context, the kernel-level protocol stacks can guarantee the network performance. Accordingly, it has more potential of achieving better performance than the user-level protocol stacks. This design however may require modifications of the kernel, which is not portable across several operating systems. As shown in Figure 2, we classify the kernel-level design into bottom half based design and device driver based design according to where the protocol stacks are implemented (especially for receiver side). The traditional protocol stacks are implemented as a bottom half in general. In such design, when a packet has been received from the network controller, the interrupt handler simply queues it to a queue shared with the bottom half. Then the bottom half takes care of most of protocol processing including demultiplexing. The bottom half is scheduled by the interrupt handler when there is no interrupts to be processed. On the other hand, in the device driver based design, the entire protocol stacks are implemented in the device driver. Therefore, if the protocol stacks are heavy like TCP/IP then the device driver based design may not be suitable. In case of the kernel-level design, the user tasks request a sending operation through a system call. The system call eventually passes the request to the device driver. On the sender side, the main difference between two design alternatives is that, in case of the bottom half based design, the kernel performs most of protocol processing before passing down the user request to the device diver. It is to be noted that the data copy operation between the user and kernel spaces should be carefully designed. In either synchronous or asynchronous interface, we can copy the user data into the kernel and return immediately; however, this results in the copy overhead. On the contrary, we can avoid the copy operation by delaying the notification of completion but this can hinder the application’s progress.
244
H.-W. Jin and J. Yoo
Fig. 2. Kernel-level protocol stacks: (a) bottom half based design and (b) device driver based design
On the receiver side, once a new packet comes in from the network controller, the interrupt handler does urgent process before passing it to the upper layer. In the bottom half based design, as we have mentioned earlier, the bottom half takes care of interpreting the header and demultiplexing. Some of operating systems such as Embedded Linux provide an interface to insert a new bottom half (more precisely tasklet in Embedded Linux) without kernel modification. The microkernel based operating systems such as QNX also allow adding new protocol stacks in a similar manner. In the device driver based design, the bottom half is not taken into account at all. In this design alternative, the protocol stacks are implemented in the system call and the interrupt handler. The distribution of weight between the system call and the interrupt handler can vary in terms of which does more protocol processing but usually the interrupt handler does majority of the protocol processing. This is because doing demultiplexing at the interrupt handler is more efficient. Otherwise, the system call needs to search the incoming packet queue internally, which requires exhaustive searching time and locking overhead between tasks. However, doing more work at the interrupt handler is not desirable because it is supposed to finish its work very quickly. Therefore, this design is valuable when the overhead for protocol processing is low. As a case study of kernel-level design, we have implemented a device driver based protocol called RTDiP (Real-Time Direct Protocol) in the Embedded Linux over Ethernet [3, 4]. RTDiP is a new transport protocol that can provide priority aware communication, communication semantics for synchronization, and low communication overhead. In the synchronous semantics, the communication protocols do not queue the packets but keep only the last packet received, which is suitable for distributed synchronization over relatively small area embedded networks. The performance results show that RTDiP reports 48us one-way latency with 8-byte message and provides better overhead prediction. We are currently implementing RTDiP over Control Area Network (CAN) as well. In addition we plan to implement it in another embedded operating system such as QNX.
3 Verification Methodologies Protocol verification [5] is an activity to assure the correctness of network communication protocols. The design alternatives we have studied in Section 2 should be verified
Exploring the Design Space for Network Protocol Stacks
245
thoroughly before proceeding to the implementation. The formal verification has been known as the prominent but cost-ineffective technique. This section introduces formal verification techniques for verifying network protocol stacks. We briefly overview formal verification techniques and then review the techniques from aspect of network protocol stacks verification. We then share our experience of verifying protocol stacks of system air conditioning system. 3.1 Formal Verification Formal verification and formal specification altogether are called as formal methods [6]. Formal specification [7] is a technique for specifying the system on the basis of mathematics and logic. It has various techniques and notations, e.g. algebra, logic, table, graphics and automata. After completing the formal specification, we can apply formal verification techniques to the specification to prove that the system satisfies required properties. There are two main approaches in formal verification: deductive reasoning and algorithmic verification. Deductive reasoning is a verification methodology using axioms and proof rules to establish the reasoning. Experts construct the proofs in hands, and it usually requires greater expertise in mathematics and logic. Even if tools called theorem prover have been developed to provide a certain degree of automation, its inherent characteristic makes it difficult to be used widely for verifying recent network protocol stacks. Second methodology is algorithmic verification, usually called model checking [8]. Model checking is a technique verifying finite state systems through exhaustively searching all states space to check whether specified correctness condition is satisfied. It is carried out automatically without almost any intervention of experts, but restricted to the verification of finite state systems. The deductive reasoning, on the other hand, has no such limitations. With respect to protocol verification, the latter - model checking is more efficient and cost-effective than the former - theorem proving. The former’s main drawback, requiring considerable expertise, makes the model checking techniques better suited for protocol stacks verification. Indeed, as the performance of model checking technique has increased rapidly, it can do various verifications more efficiently than when it had been firstly proposed. 3.2 Formal Verification Techniques for Network Protocol Stacks The formal verification techniques for network protocol stacks fall into several categories. General-purpose model checkers such as Cadence SMV [9] and SPIN [10] can verify protocols efficiently. General-purpose proof tools which are not the model checker but conduct formal verification such as UPPAAL [11] are useful too. We can also use specialized protocol analysis tools (e.g., Meadows’ NRL [12] and Cohen’s TAPS [13]) Formal specification should be prepared before conducting formal verification. Finite State Machine (FSM) based formal specification technique has been widely used for specifying network protocols and stacks. FSM mainly consists of a set of transition rules. In the traditional FSM model, the environment of the FSM consists of two finite and disjoint sets of signals, input signals and output signals. A number of papers
246
H.-W. Jin and J. Yoo
using FSM based formal specification have been reported. Especially, network protocols can be well specified using communicating FSM or extended FSM as reported in [14, 15]. With respect to the formal verification of network protocol stacks, we have to consider two tasks: specification of protocol stacks and modeling of the system implementing the protocol. In the first step, we have to model the protocol algorithm and stack hierarchy using a formal specification method. Then the modeling of the whole embedded network system which includes the implementations of the protocol stacks can proceed. Therefore, verifying the network protocol stacks requires not only the formal specification for the protocol stacks but also the encompassing environment where the protocol stacks are implemented and used. Formal verification for network protocol stacks totally depends on the formal specification developed beforehand. If we use FSM-based formal specifications (e.g., Statecharts [16] and Specification and Description Language (SDL) [17]), most general-purpose model checkers are available. In case that exact timing constraints should be preserved, timed automata based formal specification like UPPAAL is a good choice. We can also use specialized protocol verification tools, but it is not easy to model the whole system with them. Therefore, the combination of FSM based formal specification and general-purpose model checking tools will be more effective than others. 3.3 SDL-Based Verification of Protocol Stacks SDL is a formal specification language and tool suite widely used to model the system which consists of a number of independently communicating subsystems. The SDL specification can be translated into FSM forms, and then used as an input for general-purpose model checkers such as SMV and SPIN. Figure 3 describes the architecture of system air conditioning system. We performed the formal verification of the
Fig. 3. The architecture of system air conditioning system
Exploring the Design Space for Network Protocol Stacks
247
network protocol between distributed controllers called DMS (Distributed Management System) and a personal controller called MFC (Multi-Function Controller). A DMS controls all indoor air conditioners, outdoor compressors and network routers under its control. An MFC is a touch-screen based personal controller like PDA. In our experience, special-purpose embedded network system such as the above can be well specified with SDL and verified formally through general-purpose model checkers such as SPIN. We implemented an automatic translator from SDL into PROMELA, SPIN’s input program language, and conducted SPIN model checking. We verified several properties, categorized as feasibility test, responsiveness, environmental assumptions and consistency checking. In addition to the SPIN model checking, the SDL tool has its own validation tool, which checking syntax errors and completeness of the specification.
4 Network Interoperability Since various network interconnects can be utilized on a distributed embedded system, the network interoperability is a critical requirement in such system. For example, in modern automobile systems, several network interconnects such as CAN, LIN, FlexRay, and MOST are widely used in an integrated manner. In such systems, we need a gateway for interoperation between different networks [2, 21, 22], which is similar with bridges or routers on Internet. Thus the gateway needs to understand several network protocols and convert one into another. In this section, we explore the design alternatives for embedded network gateways. Especially, we classify the gateway designs into two based on how to operate the operating system on the gateway. 4.1 Single OS Based Gateway In this design alternative, the gateway architecture has a single or multiple homogeneous Micro-Controller Units (MCUs) that run single operating system’s image. The MCU can include the network controllers for different network interconnects supported by the gateway or can be connected with the network controllers on the same board through buses such as Serial Peripheral Interface (SPI), Inter-Integrated Circuit (I2C), etc. The protocol stacks can be designed and implemented as any of the design choices described in Section 2 but a layer of protocol stacks is required to perform the gateway functions. If the network layer performs the gateway functions, it can be transparent to the networking processes running on the embedded nodes. The network protocols for embedded systems, however, usually have no strict distinction between the network and transport layers because their network layers do not suppose to allow arbitrary transport layers while the Internet Protocol (IP) layer does. In addition, even if the gateway performs the protocol conversion at the network layer, in many cases it is hard to conserve the end-to-end communication semantics due to significant differences between transport protocols of embedded networks.
248
H.-W. Jin and J. Yoo
Fig. 4. Single OS based gateway: (a) transport layer based design and (b) global addressing layer based design
Another solution is to introduce a gateway module at the transport layer as shown in Figure 4(a). In this design alternative, the gateway should manage the protocol conversion tables that map between message headers of both network and transport layers for different networks. Since the transport layer translates the protocols internally the legacy applications do not need any modifications. A drawback of this design is the limitation of scalability. The number of possible header patterns can be numerous in some embedded systems, which can result in memory shortage on the gateway node. Therefore, this design is useful only when the number of entries of the protocol conversion tables is predictable. Fortunately, in many embedded systems, we can figure out the number of embedded devices that need to collaborate (i.e., communicate) each other across different networks at the design phase. We can also add a new layer on top of the transport layer as shown in Figure 4(b). The new layer defines global addressing and APIs to access the layer from the applications. If the gateway uses global addressing the networking processes on every embedded node have to be aware about it. Thus the applications need to be modified but if once this is done they can run transparently on any networks in which the additional layer is inserted. In this case, the gateway only requires managing the routing table and thus the scalability in terms of memory space can be better than the previous design. However, if the most of embedded nodes perform intra-network communication, the overhead of additional layering can harm the performance. Therefore, the decision among the design alternatives described can vary based on the system requirements and characteristics. As a case study of single OS gateway, we have implemented a gateway between MOST and CAN networks based on the transport layer based design [2]. In this case study, we utilize the MOST Network Service implemented in Section 2.1. The communication semantics of MOST control message are very different with the traditional send/receive semantics. The MOST control message invokes a function on a MOST device. However, CAN does not provide such communication semantics while providing multicast like communication semantics which is not in MOST Network Service. Thus, simple message forwarding with address translation at the network layer does not work. To provide transparent conversion of communication semantics we have suggested a gateway module. In addition, we have implemented the protocol conversion table and defined some entries for performance measurement. The performance results show that the suggested design hardly adds additional
Exploring the Design Space for Network Protocol Stacks
249
overhead, which is about 15% of pure communication overhead, and can deliver control messages very efficiently. 4.2 Multi-OS Based Gateway Since the embedded nodes on different networks can have different requirements the desirable operating systems can vary. For example, the automobile gateway node can have many kinds of peripheral interfaces such as USB and wireless network for supporting infotainment applications over MOST. Therefore, an operating system that has fluent device drivers such as Embedded Linux is highly expected. On the other hand, the electronic units such as chassis, powertrain and body controllers connected to CAN or LIN demand to guarantee the real-time requirements and thus an RTOS is desirable. Since the gateway needs to meet such various requirements we can consider having multiple operating systems on the gateway node. The address translation issue discussed in Section 4.1 is still applied in the similar manner even in this design alternative. However, an efficient scheme for communication between operating systems has to be taken into account. A gateway node can be equipped with multiple heterogeneous MCUs that have different network controllers as shown in Figure 5(a). Each MCU can run its own operating system that satisfies the requirements of responsible networks. The MCUs on the gateway node can collaborate by communicating each other through a bus or shared memory module. Since an MCU may do not have all network controllers required to a specific embedded system, we can need several MCUs, which makes the connection architecture between MCUs very complicate. Thus the architecture based on multiple MCUs can be applied to limited cases. Another approach is to exploit the virtualization technology, which allows running several operating systems on the same MCU as shown in Figure 5(b). The virtualization technology can isolate the system fault from propagating to other operating systems and provide service assurance. In addition, the state-of-the-art virtualization technologies enable low overhead virtualization and better resource scheduling, which lead to high scalability. In addition to the existing optimization technologies, a lighter I/O virtualization can be suggested because the network controllers on the gateway
Fig. 5. Multi-OS based gateway: (a) multiple MCUs based design and (b) system virtualization based design
250
H.-W. Jin and J. Yoo
node may not be shared between operating systems. An important issue is how efficiently the operating system domains can communicate each other. In general, the portion of inter-domain communication on a virtualized node is not dominant compared with inter-node communication. However, on the gateway node, many of network messages cause inter-domain communication because they are supposed to be forwarded to another network interface of which another operating system domain may take care. As a case study of the gateway with multiple operating systems, we are implementing a MOST-CAN gateway using virtualization technology provided by Adeos [23]. Adeos provides a flexible environment for sharing hardware resources among multiple operating systems by forwarding hardware events to appropriate operating system domain. We run Linux and Xenomai [24], a parasitic operating system to Linux, over Adeos. The Linux operating system takes charge of the MOST interface while Xenomai does the CAN interface. The gateway processes are running on each operating system and communicate each other through inter-domain communication interface provided by Xenomai. Since the protocol stacks for MOST and CAN run on different operating systems we perform the protocol conversion at the above of the transport layer but we do not use global addressing. Instead we define a protocol conversion table that maps network connections over different networks.
5 Conclusions In this paper, we have explored the design space of network protocol stacks for special-purpose embedded systems. We have surveyed several design choices very carefully so that we can choose the best design for a given network with respect to performance, portability, complexity, and flexibility. More precisely we have discussed design alternatives for implementing new network protocol stacks over embedded operating systems, methodologies for verifying the network protocols, and the designs for network gateway. Moreover, we have performed case studies for the design alternatives and methodologies. Acknowledgments. This work was partly supported by grants #NIPA-2009-C10900902-0026 and #NIPA-2009-C1090-0903-0004 by the MKE (Ministry of Knowledge and Economy) under the ITRC (Information Technology Research Center) support program supervised by the NIPA (National IT Industry Promotion Agency) and grant #R33-2008-000-10068-0 by MEST (Ministry of Education, Science and Technology) under the WCU (World Class University) support program.
References 1. MOST Cooperation.: MOST Specification. Rev 3.0 (2008) 2. Lee, M.-Y., Chung, S.-M., Jin, H.-W.: Automotive Network Gateway to Control Electronic Units through MOST Network (2009) (under review) 3. Lee, S.-H., Jin, H.-W.: Real-Time Communication Support for Embedded Linux over Ethernet. In: International Conference on Embedded Systems and Applications (ESA 2008), pp. 239–245 (2008)
Exploring the Design Space for Network Protocol Stacks
251
4. Lee, S.-H., Jin, H.-W.: Communication Primitives for Real-Time Distributed Synchronization over Small Area Networks. In: IEEE International Symposium on Object/component/service-oriented Real-Time distributed Computing (ISORC 2009), pp. 206–210 (2009) 5. Palmer, J.W., Sabnani, K.: A Survey of Protocol Verification Techniques. In: Military Communications Conference - Communications-Computers, pp. 1.5.1–1.5.5 (1986) 6. Peled, D.: Software Reliability Methods. Springer, Heidelberg (2001) 7. Wing, J.M.: A specifier’s introduction to formal methods. IEEE Computer 23(9) (1990) 8. Clarke, E., Grumberg, O., Peled, D.: Model Checking. MIT Press, Cambridge (1999) 9. SMV, http://w2.cadence.com/webforms/cbl_software/index.aspx 10. SPIN, http://spinroot.com/spin/whatispin.html 11. UPPAAL, http://www.uppaal.com/ 12. Meadows, C.: Analysis of the Internet Key Exchange protocol using the NRL Protocol Analyzer. In: SSP 1999, pp. 216–231 (1999) 13. Cohen, E.: TAPS: A first-order verifier for cryptographic protocols. In: 13th IEEE Comp. Sec. Found. Workshop, pp. 144–158 (2000) 14. Aggarwal, S., Kurshan, R.P., Sabnani, K.: A Calculus for Protocol Specification and Verification. In: Int. Workshop on Protocol Specification, Testing and Verification (1983) 15. Sabnani, K., Wolper, P., Lapone, A.: An Algorithmic Procedure for Protocol Verification. In: Globecom (1985) 16. Harel, D.: Statecharts: A Visual Formalism for complex systems. Science of Computer Programming 8, 231–274 (1987) 17. SDL, http://www.telelogic.com/products/sdl/index.cfm 18. Wind River, http://windriver.com 19. Labrosse, J.: MicroC/OS-II: The Real-Time Kernel. CMP Books (1998) 20. QNX Software Systems, http://www.qnx.com 21. Hergenhan, A., Heiser, G.: Operating Systems Technology for Converged ECUs. In: 7th Embedded Security in Cars Conference (2008) 22. Obermaisser, R.: Formal Specification of Gateways in Integrated Architectures. In: Brinkschulte, U., Givargis, T., Russo, S. (eds.) SEUS 2008. LNCS, vol. 5287, pp. 34–45. Springer, Heidelberg (2008) 23. Yaghmour, K.: Adaptive Domain Environment for Operating Systems (2001), http://www.opersys.com/adeos 24. Xenomai, http://www.xenomai.org
HiperSense: An Integrated System for Dense Wireless Sensing and Massively Scalable Data Visualization Pai H. Chou1,2,4 , Chong-Jing Chen1,2 , Stephen F. Jenks2,3 , and Sung-Jin Kim3 1
2
4
Center for Embedded Computer Systems, University of California, Irvine, CA Electrical Engineering and Computer Science, University of California Irvine, CA 3 California Institute for Telecommunications and Information Technology, Irvine, CA Department of Computer Science, National Tsing Hua University, Hsinchu, Taiwan
Abstract. HiperSense is a system for sensing and data visualization. Its sensing part is comprised of a heterogeneous wireless sensor network (WSN) as enabled by infrastructure support for handoff and bridging. Handoff support enables simple, densely deployed, low-complexity, ultra-compact wireless sensor nodes operating at non-trivial data rates to achieve mobility by connecting to different gateways automatically. Bridging between multiple WSN standards is achieved by creating virtual identities on the gateways. The gateways collect data over Fast Ethernet for data post-processing and visualization. Data visualization is done on HIPerWall, a 200-megapixel display wall consisting of 5 rows by 10 columns of 30-inch displays. Such a powerful system is designed to minimize complexity on the sensor nodes while retaining high flexibility and high scalability.
1
Introduction
Treating the physical world as part of the cyber infrastructure is no longer just a desirable feature. Cyber-physical systems (CPS) are now the mandate of many national funding agencies worldwide. CPS entails more than merely interfacing with the physical world. The goal is to form synergy between the cyber and the physical worlds by enabling cross pollination of many more features. A wireless sensor network (WSN) is an example of cyber-physical interface. A sensor converts a physical signal into a quantity to enable further processing and interpretation by a computing machine. However, it is still mostly an interface, rather than a system in the CPS sense. Most WSNs today lack the cyber part, which would leverage the vast amount of information available on the network to synthesize new views of data in ways never possible before. An example of a system that is one step towards CPS is SensorMap [1], which offers a GoogleEarth-style application to be augmented with sensor data collected at the corresponding positions. SensorMap provides an abstraction in the form of the sensor-to-map programming interface (API), so that data providers can S. Lee and P. Narasimhan (Eds.): SEUS 2009, LNCS 5860, pp. 252–263, 2009. c IFIP International Federation for Information Processing 2009
HiperSense: An Integrated System
253
leverage the powerful cloud-computing backend without having to re-invent yet another tool for each highly specialized application. However, data visualization can be much more than merely superimposing data on geographical or topological maps by a cloud computing system to be rendered on a personal computer. In fact, the emergence of large, high-resolution displays, high-performance workstations, and high-speed interconnection interfaces give rise to large display walls as the state-of-the-art visualization systems. An example is the HIPerWall, a 200-megapixel tiled screen driven by 25 PowerMac G5s interconnected by two high-speed networks [2, 3]. Such a system has found use in ultra-high-resolution medical imaging and appears as a prime candidate for visualization of a wide variety of sensor data as well. This paper describes work in progress on such a massive-scale sensing and visualization system, called HiperSense. On the sensing side, we aim to further develop a scalable infrastructure support, called EcoPlex [4]. It consists of a tiered network, where the upper tier contains the gateways and the lower tier includes the sensor nodes. The gateways support handoff for mobility and bridging of identity for integrating heterogeneous radio and networking standards. On the data visualization side, we feed the data to the HIPerWall, which can then render the data in an interactive form across 50 screens as one logical screen. This paper reports on the technologies developed to date and discusses practical issues that we have encountered.
2
Related Work
Several multiple-access protocols that use multiple frequency channels have been proposed for wireless sensor networks [5, 6]. Some have been evaluated only by simulation, while others have been adopted for researchers in ad-hoc network domains. Many protocols for WSNs have been implemented on the popular MicaZ or TelosB platforms. Y-MAC [7] is a TDMA-based protocol that schedules the receivers in the neighborhood by assigning available receiving time slots. Its light-weight channelhopping mechanism can enable multiple pairs of nodes to communicate simultaneously, and it has been implemented on the RETOS operating system running on a TmoteSky-class sensor node. However, a problem with Y-MAC is the relatively low performance due to time synchronization and overhead of channelswitching time. Base on previous experimental results from another team [8] on the Chipcon CC2420 radio transceiver, which implements the IEEE 802.15.4 MAC as used on MicaZ and TelosB, the channel switching time is nearly equal to the time it takes to transmit a 32-byte data packet. Therefore, changing to another frequency channel dynamically and frequently can become a damper on system performance. Le et al proposed and implemented another multi-channel protocol on the MicaZ [8]. Their protocol design does not require time synchronization among the sensor nodes. They also take the channel-switching time into consideration for sensor nodes that are equipped with only a half-duplex radio transceiver. A
254
P.H. Chou et al.
distributed control algorithm is designed to assign accessible frequency channels for each sensor node dynamically to minimize the frequency-changing time. The compiled code is around 9.5 KB in ROM and 0.7 KB in RAM on the top of TinyOS. Although it is smaller than others’ solutions, it is still too big to fit in either the Eco or µPart sensor node, both of which have much smaller RAM and ROM. After data collection, showing sensing data in a meaningful way in real-time is another emerging issue. Several solutions have been proposed to integrate the sensing data with a geographic map [9, 10]. SensorMap [1] from Microsoft Research is for displaying sensor data from SenseWeb [11] on a map interface. They are able to provide tools to query sensor nodes and visualize sensing data in real-time. Google Map with traffic can show not only the live traffic but also the history traffic at day and time [12]. However, these works currently assumes limited screen resolution and has not been scaled to the 200-megapixel resolution of the HIPerWall.
3
System Architecture
Fig. 1 shows the architecture of HiperSense. It consists of HIPerWall as the visualization subsystem and EcoPlex as the sensing infrastructure. This section summarizes each subsystem and their interconnection. 3.1
HIPerWall
HIPerWall
HIPerWall is a tiled display system for interactive visualization, as shown in Fig. 2(d). The version we use consists of fifty 30-inch LCD monitors arranged in five rows by ten columns. Each monitor has a pixel count of 2560 × 1600 at 100 dots per inch of resolution, and therefore the entire HIPerWall has a total resolution of 204.8 million pixels. The tiled display system is driven by 25 PowerMac G5 workstations interconnected by two networks to form a highperformance computing cluster. One network uses Myrinet for very high-speed 50 tiled displays (5 rows x 10 cols) driven by 25 PowerMac G5s, connected by Myranet & Gigabit Ethernet front-end node
EcoPlex
G
ZigBee mesh
Legend Eco nodes, connected to via a Gateway G
Gateway for ZigBee and Eco nodes, w/ Ethernet uplink
Fast Ethernet G
G
ZigBee mesh
G
ZigBee mesh
Fig. 1. The HiperSense architecture
HiperSense: An Integrated System
(a) Eco Node
(b) Base Station
255
(c) EZ-Gate
(d) HIPerWall Fig. 2. Components of HiperSense
data transfer, and the other network uses Gigabit Ethernet for control. The HIPerWall software is portable to other processors and operating systems, and it can be configured for a wide variety of screen dimensions. A user controls the entire HIPerWall from a separate computer called the front-end node. It contains a reduced view of the entire display wall, enabling the user to manipulate the display across several screens at a time. The frontend node also serves as the interface between the sensing subsystem and the visualization subsystem. 3.2
EcoPlex
EcoPlex is a tiered, infrastructure-based heterogeneous wireless sensor network system. Details of EcoPlex can be found in another paper [4], and here we highlight the distinguishing features. At the bottom tier are the wireless sensor nodes. The top tier consists of a network of gateway nodes. Lower Tier Sensor Nodes. EcoPlex currently supports two types of nodes with different communication protocols: ZigBee and Eco. Our platform allows other protocols such as Bluetooth and Z-Wave to be bridged without any inherent difficulty. ZigBee is a wireless networking protocol standard primarily targeting lowduty-cycle wireless sensing applications [13]. In recent years, it is also targeting home automation domain. ZigBee is a network protocol that supports ad hoc mesh networking, although it also defines roles for not only end devices but also
256
P.H. Chou et al.
routers and coordinators. ZigBee is built on top of the IEEE 802.15.4 media access control (MAC) layer, which is based on carrier-sense multiple access with collision avoidance (CSMA/CA). Currently, many wireless sensor applications are built on top of 802.15.4, though not necessarily with the ZigBee protocol stack, since it occupies about 64–96KB of program memory. Another type of wireless sensor node supported in EcoPlex is Eco [14, 15], our own ultra-compact wireless sensing platform, as shown in Fig. 2(a). It is 1 cm3 in volume including the MCU, RF, antenna, and sensor devices. It contains 4 KB RAM and 4 KB EEPROM. The radio is based on Nordic VLSI’s ShockBurst protocol at 1 Mbps, and it is a predecessor of Wibree, also known as Bluetooth Low Energy Technology as a subset of Bluetooth 3.0 [16]. Eco is possibly the world’s smallest, self-contained, programmable, expandable platform to date. A flex-PCB connector enables an Eco node to be connected to other I/O devices and power. Eco is meant to complement ZigBee in that Eco nodes can be made much smaller and cheaper than ZigBee ones, and thus they can be deployed where ZigBee nodes cannot, especially in some highly wearable or size-constrained applications. Upper Tier: Gateways. The upper tier of EcoPlex consists of a network of gateway nodes called EZ-Gates. An EZ-Gate is essentially a Fast Ethernet router based on an ARM-9-core network processor running Linux 2.6. It is augmented with the radio transceivers that are needed for supporting the protocols used by the wireless sensor nodes. In this case, one ZigBee transceiver and two Eco transceivers are added to each EZ-Gate. For ZigBee support, the EZGate implements the protocol stack of a ZigBee coordinator. Since Eco is more resource-constrained and can afford to implement only a much simpler protocol, the gateway provides relatively more support in the form of handoff and virtual identity. Eco nodes connect to the gateways and not to each other, the same way cellular phones connect to base stations but not to each other. Just as cellular towers perform handovers based on the proximity of the mobile, our EZ-Gates perform handoffs based on the link quality of the Eco nodes. Unlike cell phones, which are treated as independently operated units, EcoPlex supports the concepts of clusters, which are groups of wireless sensor nodes that work together, are physically close to each other, and move as a group [17]. Instead of performing handoff for each node individually, cluster handoff would rely on a node as a representative for the entire cluster and has been shown to be effective especially in dense deployments. EZ-Gates also support bridging in the form of virtual identity. That is, for every Eco node connected to EcoPlex, the owner gateway maintains a node identity in the ZigBee space. This way, an Eco node appears just like any other ZigBee node and can communicate logically with other ZigBee nodes, without having to be burdened with the heavy ZigBee stack. A simpler base station without the handoff and virtual identity support was also developed. It is based on the Freescale DEMO9S12NE64 evaluation board connected to a Nordic nRF24L01 transceiver module, as shown in Fig. 2(b). It has a Fast Ethernet uplink to the front-end node. It was used for the purpose
HiperSense: An Integrated System
257
of developing code between the Ethernet and Eco sides before porting to the EZ-Gate for final deployment.
4
System Integration
HiperSense is more than merely connecting the EcoPlex sensing subsystem with the HIPerWall tiled display system. It entails design of a communication and execution scheme to support the needs of sensing and visualization. This section first discusses considerations for HiperSense to support CPS-style visualization. Then, we describe the communication scheme for system integration. 4.1
Visualization Styles and Support
Visualization is the graphical rendering of data in a form that helps the user gain new insights into the system from which the data is collected. Unlike many other visualization systems that render only static data that has been collected and stored in advance, HiperSense is designed to support visualization of both static data files and live data streams. More importantly, we envision a visualization system that synthesizes views from static or live data and other sources available on the Internet. As an example, consider a WSN that collects vibration data from sensor nodes on a pipeline network. The user is not interested in vibration per se but wants to non-invasively measure the propagation speed of objects traveling inside the pipeline based on the peak vibration. In this case, time-history plots of raw vibration data is not so meaningful to the user; instead, the data streams must be compared and processed to derive the velocity. The visualization system then renders the velocity by superimposing colored-encoded regions over highresolution images of the pipeline network and its surroundings. Moreover, the user may want the ability to not only navigate spatially similar to GoogleEarth but also temporally by seeing how the peak velocity shifts over time. To support smooth spatial navigation, HIPerWall relies on replication or prefetching of large data (e.g., patches of GoogleEarth photos) from adjacent nodes. The data could be information in the local database system, images or videos. These data shown on the screens are treated as independent objects and can be zoomed in, zoomed out or rotated arbitrarily. The front-end node simply sends out commands to every computing node to indicate which object should be displayed, which position the object is and the other properties to control the object. This mechanism reduces the traffic between the front-end node and the cluster of computing nodes. To support real-time access to data, supervisory control and data acquisition (SCADA) systems, which are commonly found in factories for up to thousands of sensing and actuation channels per site, have used real-time databases for the purpose of logging and retrieval of data. A similar set up can be built for HiperSense. Historic data can also be retrieved from the real-time database via a uniform interface. However, one difference between a conventional SCADA and
258
P.H. Chou et al.
HiperSense is that the former is commonly handled by one or a small number of computers, while the latter relies on a cluster of 25 computers to parallelize the handling of the massive graphical bandwidth. For the purpose of tiled visualization of live sensor data, we program the frontend node of the HIPerWall to also be a fat client for data collection from wireless sensor nodes via the gateways in EcoPlex. The front-end node then broadcasts the collected data to the cluster of computing nodes inside HIPerWall. Every computing node decodes the whole data packet but shows only the portion that is visible on the two LCD screens that it controls. This broadcasting mechanism removes the time synchronization between all workstations and ensures that all sensing data can be shown on the tiled displays at the same time. If the backbone of Intranet and the cluster of workstations both support Jumbo Frame [18], we can increase the overall system performance and deploy more wireless sensor nodes at once. 4.2
Protocols for the Tiers
EcoPlex currently supports both ZigBee and Eco as two complementary wireless protocols. The ZigBee standard is designed for sporadic, low-bandwidth communication in an ad hoc mesh network, whereas Eco is capable of high-bandwidth, data-regular communication on ultra-compact hardware in a star network. Of course, it is possible for each platform to implement each other’s characteristics, but they would be less efficient. For the purpose of integration with HiperSense, ZigBee does not pose a real problem due to its lack of timing constraints. We therefore concentrate our discussion on integration of Eco nodes and gateways. On many wireless sensor nodes, the software complexity is dominated by the protocol stack. The code sizes of the typical protocol stacks of Bluetooth, ZigBee, and Z-Wave are 128KB, 64–96KB, and 32KB, respectively, whereas the main program of a sensing application can be as little as a few kilobytes. Our approach to minimizing complexity on the Eco nodes is to externalize it: we implement only the protocol-handling mechanisms on Eco and move most policies out to the infrastructure. This can be accomplished by making nodes passive with a thin-server, fat-client organization. That is, the sensor nodes are passive and the host actively pulls data from them. This effectively takes care of arbitration1 and effective acknowledgment. The core mechanism can be implemented in under 40 bytes of code. After adding support for multi-gateway handoff and joining, channel switching, and a number of other performance enhancement policies (e.g., amortize the pulling overhead by returning multiple packets), the code size can still be kept around 2KB. Once the reliable communication primitives are in place, then we can add another layer of software for dynamic code update and execution [19] on these nodes. Our vision is host-assisted dynamic compilation, where the host or its delegate dynamically generates optimized code to be loaded into the node for execution. This will be much more energy efficient than a general-purpose protocol stack that must anticipate all possibilities. An example is the protocol stack 1
No intra-network collision unless there is a malfunctioning node.
HiperSense: An Integrated System
259
for network routing. Since our gateway as well as many commercially available gateways run Linux and have plenty of storage available, the gateways should also be able run synthesizer, compiler, and optimizer without difficulty. The front-end node of the HIPerWall acts as a client to query data from all gateways. Each gateway is connected to the front-end node via a wired interface with higher bandwidth than the wireless interface. At beginning, all wireless sensor nodes communicate with the gateway via the control channel. The frontend node issues frequency switching commands to the wireless sensor nodes based on the sampling rate of each wireless sensor node and the available bandwidth of each wireless frequency channel. Later on, the front-end node issues a command packet to the gateway to get data from the wireless sensor nodes controlled by the gateway. The gateway in turn broadcasts the command packet to all Eco sensor nodes on the gateway’s own frequency channel. The gateway packs all pulled data together and forwards the data to the front-end node, which in turn broadcasts the data to the real-time databases for visualization. ZigBee nodes push data rather than being pulled. This way, the ZigBee network can coexist with the Eco network by sporadically taking up bandwidth only when necessary. For HiperSense, the front-end node can resend the command packet to a wireless sensor node if it does not get any reply packet within certain amount of time. In order to improve the system performance, we have the retransmission mechanism inside the gateways instead of having it at the front-end node. A gateway resends the pulling command packet that it received from the front-end node if it does not receive any reply packet from a wireless sensor node after a pre-defined timeout period.
5
Evaluation
This section presents experimental results on a preliminary version of HiperSense. The experimental setup consists of 100 Eco nodes (Fig. 3(a)) and two gateways densely deployed in an area from 2 to 16 m2 . The larger setup is for a miniature-scale water pipe monitoring system, where nodes measure vibration at different junctions. The gateways are connected to the HIPerWall’s front-end node (Fig. 3(b)) via Fast Ethernet. We compare the performance of our system with other works in terms of measured throughput, latency, and code sizes for different sets of features included. 5.1
Latency and Throughput
Fig. 4 shows the measured aggregate throughput and per-node latency results for one and two gateways over different numbers of reply packets per pull. Returning more packets per pull enables amortization of pulling overhead, though at the expense of increased latency. The lower and upper curves for each chart shows the result for one and two gateways, respectively. In the case of one gateway, the throughput ranges from 6.9 KB/s to 15.8 KB/s for 20 reply packets per pull, though the latency increases linearly from around 100 ms to over 1.5 seconds.
260
P.H. Chou et al.
(a) 100 Eco Nodes
(b) Front-End Node and HIPerWall
Fig. 3. Experimental Setup 30000
2500
25000
20000
Response Time (ms)
Performance (Bytes/S)
2000
15000
10000 2 Base Stations 2 Transceivers 5000
1500
1000
2 Base Stations 2 Transceivers
500
1 Base Station 1 Transceiver
1 Base Station 1 Transceiver
0
0 1
2
3
4
5
6
7
8
9
10 11 12 13 14 15 16 17 18 19 20
1
2
3
4
5
Number of Reply Packet(s)
(a) Throughput
6
7
8
9
10 11 12 13 14 15 16 17 18 19 20
Number of Reply Packet(s)
(b) Latency
Fig. 4. Performance Results
By doubling the number of gateways, the aggregate throughput ranges from 13.5KB/s for one reply per pull to 27.9 KB/s for 20 reply packets per pull. In the latest case, the total throughput increases by 81% while the latency increases by 9%. The data rate is rather low compared to the bandwidth within HIPerWall. With nonoverlapping frequencies, EcoPlex can scale up to 50 nodes/gateway × 14 channels/gateway = 6200 nodes with only 0.1% utilization of the bandwidth on the Gigabit Ethernet or 1% of Fast Ethernet. 5.2
Comparison
The closest work to ours in the area of communication protocol for the wireless sensor nodes is the Multi-Channel MAC (MCM) protocol as proposed by Le et al from the Cyber Physical Computing Group [8]. Their protocol was built on the top of TinyOS v2.x, whose minimum code size is around 11 KB [20]. Depending
HiperSense: An Integrated System
261
Table 1. Comparison Between HiperSense for Eco nodes and the Multi-Channel MAC protocol for TinyOS 2 [8] Code Size HiperSense MCM Protocol MAC layer 31 bytes 9544 bytes Runtime Support 1.1 KB 11-20 KB Dynamic Execution 430 bytes N/A Total Code Size 1.56 KB + loaded code 20.5-29.5 KB
on the hardware and software configurations, the compiled code size of TinyOS v2.x could exceed 20 KB [20]. Table 1 shows the required code size in the ROM after compilation. In HiperSense, the gateways and the front-end node handle most of the protocol policies originally handled by the sensor nodes. This enables the sensor nodes to be kept minimally simple with only the essential routines. Moreover, the processing time on a sensor node can be shortened, and the firmware footprint in the ROM is also minimized. We implemented a dynamic loading/dispatching layer, which occupies 430 bytes, enabling the node to dynamically load and execute code fragments that can be highly optimized to the node in its operating context [19]. In contrast, the MCM protocol occupies 9544 bytes on the top of TinyOS, which can increase the total code size to 29.5KB. That is over an order of magnitude larger than our code size.
6
Conclusion and Future Work
This paper reports the progress on the HiperSense sensing and visualization system. The sensing aspect is based on EcoPlex, which is an infrastructure-based, tiered network system that supports heterogeneous wireless protocols for interoperability and handoff for mobility. We keep node complexity low by implementing only bare minimum mechanisms and either externalize the policies to the host side or make them dynamically loadable. The visualization subsystem is based on HIPerWall, a tiled display system capable of rendering 200 megapixels of display data. By feeding the data streams from EcoPlex to the front-end node of the HIPerWall and replicating them among the nodes within HIPerWall, we are making possible a new kind of visualization system. Unlike previous applications that use static data, now we can visualize both live and historic data. Scalability in a dense area was shown with 100 wireless sensor nodes in a 2m2 area. By utilizing all frequency channels, we expect HiperSense to handle 6200 independent streams of data. Applications include crowd tracking, miniature-scale pipeline monitoring, and a wide variety of medical applications. Future work includes making the protocol more adaptive and power manageable on the wireless sensor nodes. Dynamic code loading and execution has been implemented but still relies on manual coding, and it is a prime candidate for automatic code synthesis and optimization.
262
P.H. Chou et al.
Acknowledgments The authors would like to thank Seung-Mok Yoo, Jinsik Kim, and Qiang Xie for their assistance with this work on the Eco protocol, and Chung-Yi Ke, NaiYuan Ko, Chih-Hsiang Hsueh, and Chih-Hsuan Lee for their work on EcoPlex. The authors also would like to thank Duy Lai for his assistance on HIPerWall. This research project is sponsored in part by the National Science Foundation CAREER Grant CNS-0448668, UC Discovery Grant itl-com05-10154, the National Science Council (Taiwan) Grant NSC 96-2218-E-007-009, and Ministry of Economy (Taiwan) Grant 96-EC-17-A-04-S1-044. HIPerWall was funded through NSF Major Research Instrumentation award number 0421554. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.
References [1] Nath, S., Liu, J., Miller, J., Zhao, F., Santanche, A.: SensorMap: a web site for sensors world-wide. In: SenSys 2006: Proceedings of the 4th international conference on Embedded networked sensor systems, pp. 373–374. ACM, New York (2006) [2] Kuester, F., Gaudiot, J., Hutchinson, T., Imam, B., Jenks, S., Potkin, S., Ross, S., Sorooshian, S., Tobias, D., Tromberg, B., Wessel, F., Zender, C.: HIPerWall: A high-performance visualization system for collaborative earth system sciences (2004), http://dust.ess.uci.edu/prp/prp_Kue04.pdf [3] Jenks, S.: HIPerWall, http://hiperwall.calit2.uci.edu/ [4] Ke, C.Y., Ko, N.Y., Hsueh, C.H., Lee, C.H., Chou, P.H.: EcoPlex: Empowering compact wireless sensor platforms via roaming and interoperability support. In: Proceedings of the Sixth Annual International Conference on Mobile and Ubiquitous Systems: Computing, Networking, and Services (MobiQuitous 2009), Toronto, Canada, July 13-16 (2009) [5] Wu, S.L., Lin, C.Y., Tseng, Y.C., Sheu, J.P.: A new multi-channel MAC protocol with on-demand channel assignment for multi-hop mobile ad hoc networks. In: ISPAN, p. 232 (2000) [6] So, H.S.W., Walrand, J., Mo, J.: McMAC: A parallel rendezvous multi-channel MAC protocol. In: Wireless Communications and Networking Conference, March 2007, pp. 334–339 (2007) [7] Kim, Y., Shin, H., Cha, H.: Y-MAC: An energy-efficient multi-channel MAC protocol for dense wireless sensor networks. In: IPSN 2008: Proceedings of the 2008 International Conference on Information Processing in Sensor Networks, Washington, DC, USA, pp. 53–63. IEEE Computer Society, Los Alamitos (2008) [8] Le, H.K., Henriksson, D., Abdelzaher, T.: A practical multi-channel media access control protocol for wireless sensor networks. In: IPSN ’08: Proceedings of the 2008 International Conference on Information Processing in Sensor Networks (IPSN 2008), Washington, DC, USA, pp. 70–81. IEEE Computer Society, Los Alamitos (2008)
HiperSense: An Integrated System
263
[9] Krause, A., Horvitz, E., Kansal, A., Zhao, F.: Toward community sensing. In: IPSN 2008: Proceedings of the 7th international conference on Information processing in sensor networks, Washington, DC, USA, pp. 481–492. IEEE Computer Society, Los Alamitos (2008) [10] Ahmad, Y., Nath, S.: COLR-Tree: Communication-efficient spatio-temporal indexing for a sensor data web portal. In: IEEE 24th International Conference on Data Engineering, April 2008, pp. 784–793 (2008) [11] Grosky, W., Kansal, A., Nath, S., Liu, J., Zhao, F.: SenseWeb: An infrastructure for shared sensing. IEEE Multimedia 14(4), 8–13 (2007) [12] Google Traffic, http://maps.google.com/ [13] ZigBee Alliance, http://www.zigbee.org/ [14] Park, C., Chou, P.H.: Eco: Ultra-wearable and expandable wireless sensor platform. In: Third International Workshop on Body Sensor Networks (BSN 2006) (April 2006) [15] Ecomote, http://www.ecomote.net/ [16] Bluetooth Low Energy Technology, http://www.bluetooth.com/Bluetooth/Products/low_energy.htm [17] Lee, C.H.: EcoFlock: A clustered handoff scheme for ultra-compact wireless sensor platforms in EcoPlex network. Master’s thesis, National Tsing Hua University (2009) [18] Jumbo Frame, http://en.wikipedia.org/wiki/Jumbo_frame [19] Hsueh, C.H.: EcoExec: A highly interactive execution framework for ultra compact wireless sensor nodes. Master’s thesis, National Tsing Hua University (2009) [20] Cha, H., Choi, S., Jung, I., Kim, H., Shin, H., Yoo, J., Yoon, C.: RETOS: resilient, expandable, and threaded operating system for wireless sensor networks. In: IPSN 2007: Proceedings of the 6th international conference on Information processing in sensor networks, pp. 148–157. ACM, New York (2007)
Applying Architectural Hybridization in Networked Embedded Systems Antonio Casimiro, Jose Rufino, Luis Marques, Mario Calha, and Paulo Verissimo FC/UL {casim,ruf,lmarques,mjc,pjv}@di.fc.ul.pt Abstract. Building distributed embedded systems in wireless and mobile environments is more challenging than if fixed network infrastructures can be used. One of the main issues is the increased uncertainty and lack of reliability caused by interferences and fading in the communication, dynamic topologies, and so on. When predictability is an important requirement, then the uncertainties created by wireless networks become a major concern. The problem may be even more stringent if some safety critical requirements are also involved. In this paper we discuss the use of hybrid models and architectural hybridization as one of the possible alternatives to deal with the intrinsic uncertainties of wireless and mobile environments in the design of distributed embedded systems. In particular, we consider the case of safety-critical applications in the automotive domain, which must always operate correctly in spite of the existing uncertainties. We provide the guidelines and a generic architecture for the development of these applications in the considered hybrid systems. We also refer to interface issues and describe a programming model that is “hybridization-aware”. Finally, we illustrate the ideas and the approach presented in the paper using a practical application example.
1
Introduction
Over the last decade we have witnessed an explosive use of wireless technologies to support various kinds of applications. Unfortunately, when considering real-time systems, or systems that have at least some properties whose correctness depends on the timely and reliable communication, the communication delay uncertainty and unreliability characteristic of wireless networks becomes a problem. It is not possible ignore uncertainty and simply wait until a message arrives, hoping it will arrive soon enough. Our approach to address this problem is considering a hybrid system model, in which a part of the system is asynchronous, namely the part that encompasses the wireless networks and the related computational subsystems, and another part that is always timely, with well defined interfaces to the asynchronous
Faculdade de Ciˆencias da Universidade de Lisboa. Navigators Home Page: http://www.navigators.di.fc.ul.pt. This work was partially supported by FCT through the Multiannual Funding and the CMU-Portugal Programs.
S. Lee and P. Narasimhan (Eds.): SEUS 2009, LNCS 5860, pp. 264–275, 2009. c IFIP International Federation for Information Processing 2009
Applying Architectural Hybridization in Networked Embedded Systems
265
subsystem. In this paper we discuss applicability aspects of this hybrid model, considering in particular safety-critical applications in the automotive domain. In vehicles, safety-critical functions related to automatic speed and steering control are implemented using real-time design approaches, with dedicated controllers that are connected to car sensors and actuators through predictable networks. Despite all the advances in wireless communication technologies, relying on wireless networks to collect information from external sources and using this information in the safety-critical control processes, seems to be too risky. We argue that this may be possible if a hybrid system model and architecture are used. The advantage is the following: with the additional information it may be possible to improve some quality parameters of the control functions, possibly optimizing speed curves and fuel consumption or even improving the overall safety parameters. One fundamental aspect to make the approach viable is to devise appropriate interfaces between the different parts of the architecture. On the other hand, special care must be taken when programming safety-critical applications, as we illustrate by providing the general principles of a “hybridization-aware” programming model. The presented ideas and principles have been explored in the HIDENETS European project [9], in which a proof-of-concept prototype platooning application has been developed. We use this example to briefly exemplify the kind of benefits that may be achieved when using a hybrid system and architecture to build a networked embedded system. The paper is structured as follows. Some related work is address in the next section. Then, Section 3 motivates the idea of using hybrid distributed system models and highlights their main advantages. In Section 4 we discuss the applicability of the model in the automotive context and in Section 5 address interface issues and introduce the hybrid-aware programming model. The platooning example case is then provided in Section 6 and we end the paper with some conclusions and future prospects.
2
Related Work
The availability of varied and increasingly better technologies for wireless communication explains the pervasiveness of these networks in our everyday life. In the area of vehicular applications, new standards like the one being developed by IEEE 802.11p Task Group for Wireless Access in Vehicular Environments (WAVE) will probably become the basis on many future applications. The 802.11p standard provides a set of seven different logical communication channels among which one is a special dedicated control channel which specifically aims at allowing some more critical vehicular applications to be developed [1]. In fact, improving the baseline technologies and standards in one of the ways to be able to implement safety-critical systems that operate over wireless networks. And there is a large body of research concerned with studying and proposing solutions to deal with the reliability and temporal uncertainties of wireless communication.
266
A. Casimiro et al.
A line of research consists in devising new protocols for the MAC level using specific support at the physical level (e.g., Dedicated Short-Range Communications, DSRC) [17] or adopting decentralized coordination techniques, such as using rotating tokens [6]. In fact, the possibility of characterizing communication delays with a reasonable degree of confidence is sufficient for a number of applications that provide safety-related information to the driver, for instance to avoid collisions [5,14]. However, these applications are not meant to autonomously control the vehicles and therefore the involved criticality levels are just moderate. In general, and in spite of all improvements, we are still a few steps away of ensuring the levels of reliability and timeliness that are required for the most critical systems. A recent approach that indeed aims at dealing with safety requirements and allow autonomous control of vehicles in wireless and mobile environments is proposed in [3]. The approach relies on the cooperation and coordination between involved entities, and defines a coordination model that builds on a real-time communication model designated as the Space Elastic model [2]. The Space Elastic model is actually defined to represent the temporal uncertainties associated real wireless communication environments. The work presented in [13] also addresses the coordination of automated vehicles in platoons. In particular, it focuses on the feasibility of coordination scenarios where vehicles are able to communicate with their closest neighbors. The authors argue that in these scenarios, communication between a vehicle and its leader/follower is possible, as supported by simulation results presented in [7]. In contrast with these works, we consider a hybrid system model, which accommodates both the asynchrony of the wireless environments and the predictable behavior of the embedded control systems and local networks. In the area of wireless sensor networks efforts have also been made in devising architectures and protocols to address the temporal requirements of applications. One of the first examples is the RAP architecture [11], which defines query and event services associated to new network scheduling policies, with the objective of lowering deadline miss ratios. More recent examples include VigilNet, for real-time target tracking [8] in large-scale sensor networks, and TBDS [10], a mechanism for node synchronization in cluster-tree wireless sensor networks. Our focus is at a higher conceptual level, abstracting from the specific protocols, network topologies and wireless technologies that are used.
3
Hybrid System Models
Classical distributed system models range from purely asynchronous to fully synchronous, assume different failure models, from crash to Byzantine. But independently of the particular synchrony or failure model that is assumed, they are typically homogeneous, meaning that the assumed properties apply to the entire system, and do not change over time. However, in many real systems and environment, we observe that synchrony or failure modes are not homogeneous: they vary with time or with the part of the system being considered.
Applying Architectural Hybridization in Networked Embedded Systems
267
Therefore, in the last few years we have been exploring the possibility of using hybrid distributed system model approaches, in which different parts of the system have different sets of properties (e.g. synchronism [16] or security [4]). Using hybrid models has a number of advantages when compared to approaches based on homogeneous models. The main advantages include more expressiveness with respect to reality, the provision of a sound theoretical basis for crystal-clear proofs of correctness, the possibility of being naturally supported by hybrid architectures and, finally, the possibility of enabling concepts for building totally new algorithms. One example of a hybrid distributed system model is the Wormholes model [15]. In essence, this model describes systems in which it is possible to identify a subsystem that presents exceptional properties allowing overcoming fundamental limitations of overall system if seen as a whole. For instance, a distributed system in which nodes are connected by a regular asynchronous network, but in which there is also a separate real-time network connecting some synchronization subsystem in each node, can be well described by the Wormholes model. Another very simple example, in which the wormhole subsystem is only local to a node, is a system equipped with a watchdog. Despite the possible asynchrony of the overall system, the watchdog is synchronous and always resets the system in a timely manner whenever necessary. We must note that designing systems based on the Wormholes model is not just a matter of assuming that uncertainty is not ubiquitous or does not last forever. The design philosophy also builds on the principle that predictability must be achieved in a proactive manner, that is, the system must be built in order to make predictability happen at the right time and right place.
4
Application in Automotive Context
The wormhole concept can in fact be instantiated in different ways, and here we discuss the possible application of the concept to car systems. Therefore, we first provide an overview of system components that are found in modern cars, and then we explain how an hybrid architecture can be projected over these systems. 4.1
In-Vehicle Components
Modern cars include a wide set of functions to be performed by electronics and microcontrollers complementing and in many cases totally replacing the traditional mechanical and/or hydraulic mechanisms. These functions include both hardware and software components and are usually structured around Electronic Control Units (ECUs), using the terminology of the automotive industry, which are subsystems composed of a microcontroller complemented with an appropriate set of sensors and actuators. The functions being replaced by these components are quite diverse and aim to assist the driver in the control of the vehicle. They range from critical functions associated to the control of the power train (e.g. engine and transmission related
268
A. Casimiro et al.
functions), traction (e.g. driving torque), steering or braking, to less critical ones to control the different devices in the body of the vehicle, such as lights, wipers, doors and windows, seats, climate systems, just to name a few. Recently, a slightly different set of functions is also being incorporated. They are related to information, communication and entertainment (e.g. navigation systems, radio, audio and video, multimedia, integrated cellular phones, etc). The implementation of these functions is supported in specialized ECUs. However, many of these functions are distributed along the car infrastructure. Thus, there is a need for those functions to be distributed over several ECUs that exchange information and communicate through in-vehicle networking. Furthermore, there may be required to exchange information between ECUs implementing different functions. For example, the vehicle speed obtained from a wheel rotation sensor may be required for gearbox control or for the control of an active suspension subsystem, but it may also be useful for other subsystems. Given the different functional domains have different requirements in terms of safety, timeliness and performance guarantees, the interconnection of the different ECUs is structured along several networks, classified according to their available bandwidth and function. There are four classes of operation, including one (Class C) with strict requirements in terms of dependability and timeliness, and another (Class D) for high speed data exchanges such as those required for mobile multimedia applications. The combination of the functions typically provided by each one of those four networking classes involves network interconnection through gateways, as illustrated in Figure 1.
Fig. 1. Typical In-Vehicle Networking
The in-vehicle ECUs provide support for the different functions implemented in nowadays cars. Each ECU is composed of a computing platform where the ECU software is executed. The computing platform is typically complemented with some specific hardware, a set of sensors, for gathering data from the system under control and a set of actuators which allows to act over the given car subsystem. The support of drive by wire functions integrating a set of sensors (e.g. proximity sensor) and actuators (e.g. speed and brake control) are just one example with relevance for the platooning application that we refer in Section 6. Others ECUs may exhibit a slightly different architecture because they are intended to support different functions. One example is illustrated in Figure 2,
Applying Architectural Hybridization in Networked Embedded Systems
269
Fig. 2. Example of In-Vehicle Infotainment Functions
intended to support the integration of infotainment functions. In this case, the architecture of the computing platform is designed to interface and to integrate the operation of multiple gadgets (radio, cellular phone) and technologies. 4.2
Architectural Hybridization in Vehicles
Given the description provided above, it is clear that there is a separation between what may be called a general computing platform, able to run general purpose local and distributed applications connected through wireless networks, and embedded systems dedicated to the execution of specific car functions. Interestingly, there exist gateways between these different subsystems, which allow for information to flow across their boundaries. For example, the information provided by a proximity sensor in the car electronic subsystem may be highly relevant for a driver warning application running in the general computing platform. However, sending information from the general purpose system to a critical control system is not so trivial, and as far as we know is typically avoided. We argue that in this context it is interesting and useful to apply the wormholes hybrid model in order to explicitly assume the existence of a general (payload) system, asynchronous, but in which complex applications can be executed without special concern for timeliness, and a wormhole subsystem, which is timely, reliable, and in which it is possible to execute critical functions to support interactions with the payload system. The wormhole must provide at least one Timely Timing Failure Detection (TTFD) service, available to payload applications, to allow the detection of timing failures in the payload part or in the payload-wormhole interactions. This TTFD service must also be able to timely trigger the execution of fault handling operations for safety purposes. These handlers have to be implemented as part of the wormhole subsystem and will necessarily be application dependant. With these settings it is possible to deal with information flows from the payload side to the critical subsystems, thus allowing developing applications that run in the general computing platform, which are able to exploit the availability of wireless communication, and which are still able to control critical systems
270
A. Casimiro et al.
in a safe way. Of course that in order for this to be possible, the applications must be programmed in a way that is “hybridization-aware”, explicitly using the TTFD service provided by the wormhole subsystem and being complemented by safety functions that must be executed on predictable subsystems. In the following section we describe the architectural components that constitute the hybrid system, focusing on these interfacing and programming issues.
5 5.1
Designing Applications in Hybrid Systems Generic Architecture
Asynchronous control task
Admission layer Control Task
Shared memory 1/0 Safety Task
TTFD Task
synchronous real-time subsystem
Gateway
asynchronous payload
In the proposed approach for the design of safety-critical applications in hybrid systems, the system architecture must necessarily encompass the two realms of operation: the asynchronous payload and the synchronous real-time subsystem, as illustrated in Figure 3.
Sensors, actuators
Fig. 3. System architecture for asynchronous control
A so called asynchronous control task executes in the payload part, possibly interacting with external systems through wireless or other non real-time networks. Interestingly, this asynchronous control task can perform complex calculations using varied data sources in order to achieve improved control decisions. On the real-time (or wormhole) part of the system, several tasks will be executed in a predictable way, always satisfying (by design) any required deadline. In order to exploit the synchronism properties of the wormhole part of the system, the interface to access wormhole services must be carefully designed. The solution requires the definition of a wormhole gateway, much like the gateways between the different network classes that are defined in car architectures. This wormhole gateway includes an admission layer, which restricts the patterns of service requests as a means to secure the synchrony properties of the wormhole subsystem (we assume that the payload system can be computationally powerful, and the number of requests sent to the wormhole subsystem is not
Applying Architectural Hybridization in Networked Embedded Systems
271
bounded a priori). Some service requests may be delayed, rejected or simply not executed because of lack of resources. This behavior is admissible because from the perspective of the asynchronous system, no guarantees are given anyway. Several interface functions may be made available, some of which specifically related to the application being implemented (e.g., functions for control commands to be sent to actuators or ECUs, and for sensor information to be read). At least it is necessary to provide a set of functions to access and use the TTFD service. The role of the TTFD service is fundamental: in simple terms, it is a kind of “enhanced watchdog” programmed by the payload application, and it works as a switching device that gives control to the wormhole part when the payload becomes untimely. A more detailed description of the TTFD service and how it must be used is provided in Section 5.2. A control task is defined within the gateway, which will implement the specific functions and will also interact with the TTFD service, forwarding start and stop commands received from the payload. The task may also decide whether an actuation command can effectively be applied or not, depending on the timeliness status of the payload. A safety task must also be in place, which will be in charge of ensuring a safe control whenever the asynchronous control task is prevented from taking over the control. This safe control will be done using only the locally available information, collected from local sensors. This control task can be designed to keep the system in a safe state, but this will be a pessimistic control in the sense that it will be based only on local information. The effective activation of this task is controlled by the TTFD service, using a status variable in a shared-memory structure, or some equivalent implementation. Quite naturally, each specific application must have its own associated safety task. Therefore, although the architecture is generically applicable to safety-critical applications in hybrid systems, some components must be instantiated on a case-by-case basis. In Figure 3 we also represent the sensors and actuators, which are necessarily part of the real-time subsystem. 5.2
Using the TTFD Service
A fundamental idea underlying the approach is to use the TTFD service to monitor the timeliness of a payload process. The TTFD service provides the following functions: startTFD, stopTFD and restartTFD. The startTFD function specifies a start instant to start the monitoring of a timed action and a maximum duration for that action. The handling functions that are executed when a timing failure is detected must be programmed a priori as part of the wormhole. A specific handler may be specified when starting a timing failure monitoring activity. The stopTFD function stops the on-going monitoring activity, returning an indication of whether the timed execution was timely terminated or not. The restartTFD function allows the atomic execution of a stopTFD request followed by a startTFD request. Before starting a timed execution, the TTFD service is requested to monitor this execution and a deadline is provided. If the execution is timely, then the
272
A. Casimiro et al.
TTFD monitoring activity will be stopped before the deadline. Otherwise, when the deadline is reached the TTFD service will raise a timing fault condition (possibly a boolean variable in the shared memory, as shown in Figure 3). From a programmers view perspective, and considering that we are concerned with the development of asynchronous control applications, there are two important issues to deal with: a) determining the deadline values provided to the TTFD service; b) use the available functions in a way that ensures that either the execution is timely (thus allowing control commands to be issued) or else a timing failure is always detected (and safety handler can at least be executed). The deadline must be such that the application is likely able to perform the necessary computations within that deadline. In control, there is a tradeoff between reducing the duration of the control cycle and the risk of not being able to compute a control decision within the allowed period. On the other hand, specifying large deadlines will have a negative influence on the quality of control. The other restriction for the deadline is determined by safety rules and by the characteristics of a fail-safe real-time control task that will be activated when the deadline is missed. The deadline must be such that the fail-safe control task, when activated, is still able to fulfill the safety requirements. The second issue concerns the way in which interactions between the payload and the wormhole must be programmed, which we discuss in what follows. 5.3
Payload-Wormhole Interactions
In the proposed architecture, TTFD requests are in fact directed to the control task, along with actuation commands. That is, when the asynchronous control task sends an actuation command, it is implicitly finishing an on-going timed action, and it must explicitly start a new one by specifying a deadline for the next actuation instant. The idea is the following: when an actuation command is sent from the payload to the wormhole, it is supposed to be sent before a previously specified deadline. Therefore, when the command is received by the control task, this task first has to stop the on-going TTFD monitoring activity. Depending on the returned result, the control task will know if the actuation command is timely (and hence can be safely used and applied to actuators) or if it is untimely (in which case, it will just be discarded). In the latter case, the TTFD must already have triggered the failure indication. In fact, this indicator is used by the safety task to decide if it should indeed become active and take over the control of the system. As soon as a timing failure occurs, the indicator is activated, and the safety task will take over the next time it is released. This means that a late command received from the payload will be ignored, and it will be ensured that the safety task will already be in control. In a steady state, the asynchronous control task will be continuously sending commands to the the wormhole, timely stopping the on-going TTFD monitoring activity, atomically restarting the TTFD for a future point in time (the next actuation deadline), and applying the control command.
Applying Architectural Hybridization in Networked Embedded Systems
6
273
Platooning Example
Let us consider the example of a platooning application, in which the objective is to achieve a better platoon behavior (keep cars close together at the maximum possible speed), using not only the information available from local car sensors, but also other information collected from other cars or from fixed stations along the road. The hybrid architecture will encompass an asynchronous platooning control task running in some on-board general purpose computer, processing all the available information and producing control decisions that must be sent to the vehicle ECUs. The information exchanged between vehicles (through the wireless network) includes time, position and speed values. This information is relevant for the platooning control application, since it will know where to position each other car in a virtual map and hence what to do regarding the own car speed. Clocks are assumed to be synchronized through GPS receivers and accelerations (positive and negative) are bounded. In this way, worst case scenarios can be considered when determining the actuation commands. Every car in the platoon periodically retrieves the relevant information from local sensors (through the wormhole interface), disseminates this information, and hopefully receives the same information from the other cars. In the platooning application case, failures in the communication will not have serious consequences. In fact, if a car does not receive information from the preceding car, it will use old information and will “see” that car closer than it is in reality. The consequence is that the follower car will stop, even if not necessary. Given the periodic nature of the payload message exchanges, the asynchronous control tasks may become aware of lost or very delayed messages (if timeouts are used) and refrain from sending actuation commands to the wormhole. In this case, or if the payload becomes to slow (remember that this is a general purpose computing environment), the actuation commands expected by the wormhole will not be received or will arrive too late, and meanwhile the safety task is activated to take over the control of the car. From the platooning application perspective, the proposed implementation provides some clear improvements over a traditional implementation. The latter is pessimistic in the sense that it must ensure larger safety distances between cars, in particular at high speeds, since no information is available about the surrounding environment and in particular about the speed of the preceding car. On the other hand, in the prototype we implemented it is possible to observe that independently of the platoon speed, the distance between every two cars is kept constant because follower cars are able to know the distance to the preceding car, and its speed also. We implemented a prototype of this platooning application, which was demonstrated using emulators for the physical environment and for the wireless network. Figure 4 illustrates some of the hardware and a graphical view of the platoon in the physically emulated reality. The interested reader can refer to [12], which provides additional details about this demonstration.
274
A. Casimiro et al.
Fig. 4. Platooning demonstration
7
Conclusions
The possibility of using wireless networks for car-to-car or car-to-infrastructure interactions is very appealing. The availability of multiple sources of information can be used to improve the quality of control functions and implicitly the safety or the fuel consumptions. The problem that we addressed in this paper is concerned with the potential lack of timeliness and with the unreliability of wireless networks, which make it difficult to consider their use when implementing real-time applications or safety-critical systems. We propose an approach that is based on the use of a hybrid system model and architecture. The general idea is to allow applications to be developed on a general purpose part of the system, typically asynchronous, and provide the means to ensure that safety-critical properties are always secured. Since we focus on applications for the vehicular domain, typically control applications, we first explain why the considered hybrid approach is very reasonable in this context. Then we provide the guidelines for designing asynchronous control applications, explaining in particular how the interactions between the payload and the wormhole subsystems should be programmed. From the experience we had in the development of the platooning example application and from the observations we made while executing our demonstration system, we conclude that the proposed approach constitutes a potentially interesting alternative for the implementation of optimized safety-critical systems in wireless environments.
References 1. IEEE P802.11p/D3.0, Part 11: Wireless LAN Medium Access Contrl (MAC) and Physical Layer (PHY) Specifications: Amendment: Wireless Access in Vehicular Environments (WAVE), Draft 3.0 (July 2007) 2. Bouroche, M., Hughes, B., Cahill, V.: Building reliable mobile applications with space-elastic adaptation. In: WOWMOM 2006: Proceedings of the 2006 International Symposium on on World of Wireless, Mobile and Multimedia Networks, Washington, DC, USA, pp. 627–632. IEEE Computer Society, Los Alamitos (2006)
Applying Architectural Hybridization in Networked Embedded Systems
275
3. Bouroche, M., Hughes, B., Cahill, V.: Real-time coordination of autonomous vehicles. In: Proceedings of the IEEE Intelligent Transportation Systems Conference 2006, September 2006, pp. 1232–1239 (2006) 4. Correia, M., Ver´ıssimo, P., Neves, N.F.: The design of a COTS real-time distributed security kernel. In: Bondavalli, A., Th´evenod-Fosse, P. (eds.) EDCC 2002. LNCS, vol. 2485, pp. 234–252. Springer, Heidelberg (2002) 5. Elbatt, T., Goel, S.K., Holland, G., Krishnan, H., Parikh, J.: Cooperative collision warning using dedicated short range wireless communications. In: VANET 2006: Proceedings of the 3rd international workshop on Vehicular ad hoc networks, pp. 1–9. ACM, New York (2006) 6. Ergen, M., Lee, D., Sengupta, R., Varaiya, P.: WTRP - Wireless Token Ring Protocol. IEEE Transactions on Vehicular Technology 53(6), 1863–1881 (2004) 7. Halle, S., Laumonier, J., Chaib-Draa, B.: A decentralized approach to collaborative driving coordination. In: Proceedings of the 7th International IEEE Conference on Intelligent Transportation Systems, October 2004, pp. 453–458 (2004) 8. He, T., Vicaire, P., Yan, T., Luo, L., Gu, L., Zhou, G., Stoleru, R., Cao, Q., Stankovic, J.A., Abdelzaher, T.: Achieving real-time target tracking using wireless sensor networks. In: RTAS 2006: Proceedings of the 12th IEEE Real-Time and Embedded Technology and Applications Symposium, Washington, DC, USA, pp. 37–48. IEEE Computer Society, Los Alamitos (2006) 9. HIDENETS, http://www.hidenets.aau.dk/ 10. Koubˆ aa, A., Cunha, A., Alves, M., Tovar, E.: Tdbs: a time division beacon scheduling mechanism for zigbee cluster-tree wireless sensor networks. Real-Time Syst. 40(3), 321–354 (2008) 11. Lu, C., Blum, B.M., Abdelzaher, T.F., Stankovic, J.A., He, T.: Rap: A real-time communication architecture for large-scale wireless sensor networks. In: Eighth IEEE Real-Time and Embedded Technology and Applications Symposium, Washington, DC, USA, pp. 55–66. IEEE Computer Society, Los Alamitos (2002) 12. Marques, L., Casimiro, A., Calha, M.: Design and development of a proof-ofconcept platooning application using the HIDENETS architecture. In: Proceedings of the 2009 IEEE/IFIP Conference on Dependable Systems and Networks, pp. 223–228. IEEE Computer Society Press, Los Alamitos (2009) 13. Michaud, F., Lepage, P., Frenette, P., Letourneau, D., Gaubert, N.: Coordinated maneuvering of automated vehicles in platoons. IEEE Transactions on Intelligent Transportation Systems 7(4), 437–447 (2006) 14. Misener, J.A., Sengupta, R.: Cooperative collision warning: Enabling crash avoidance with wireless. In: 12th World Congress on ITS, New York, NY, USA (November 2005) 15. Verissimo, P.: Travelling through wormholes: a new look at distributed systems models. SIGACT News 37(1), 66–81 (2006) 16. Ver´ıssimo, P., Casimiro, A.: The Timely Computing Base model and architecture. IEEE Transactions on Computers - Special Issue on Asynchronous Real-Time Systems 51(8) (August 2002); A preliminary version of this document appeared as Technical Report DI/FCUL TR 99-2, Department of Computer Science, University of Lisboa (April 1999) 17. Xu, Q., Mak, T., Ko, J., Sengupta, R.: Vehicle-to-vehicle safety messaging in dsrc. In: VANET 2004: Proceedings of the 1st ACM international workshop on Vehicular ad hoc networks, pp. 19–28. ACM Press, New York (2004)
Concurrency and Communication: Lessons from the SHIM Project Stephen A. Edwards Columbia University, New York, ny, usa
[email protected] Abstract. Describing parallel hardware and software is difficult, especially in an embedded setting. Five years ago, we started the shim project to address this challenge by developing a programming language for hardware/software systems. The resulting language describes asynchronously running processes that has the useful property of schedulingindependence: the i/o of a shim program is not affected by any scheduling choices. This paper presents a history of the shim project with a focus on the key things we have learned along the way.
1
Introduction
Shim, an acronym for “software/hardware integration medium,” started as an attempt to simplify the challenges of passing data across the hardware-software boundary. It has since turned into a language development effort centered around a scheduling-independent (i.e., race-free) concurrency model and static analysis. The purpose of this paper is to lay out the history of the shim project with a special focus on what we learned along the way. It is deliberately light on technical details (which can be found in the original publications) and instead tries to contribute intuition and insight. We begin by discussing the original motivations for the project, how it evolved into a study of concurrency models, how we chose a particular model, and how we have added language features to that model. We conclude with a section highlighting the central lessons we have learned along with open problems.
2
Embryonic shim
We started developing shim in 2004 after observing the difficulties our students were having building embedded systems [1,2] that communicated across the hardware/software boundary. The central idea was to provide variables that could be accessed equally easily by either hardware processes or software functions, both written in C-like dialect. Figure 1 shows a simple counter in this dialect of the language. The count function resides in hardware; the other two are in software. When a software function like get time would reads the hardware register counter, the compiler would automatically insert code that would fetch its value from the hardware and synthesize vhdl that could send the data S. Lee and P. Narasimhan (Eds.): SEUS 2009, LNCS 5860, pp. 276–287, 2009. c IFIP International Federation for Information Processing 2009
Concurrency and Communication: Lessons from the SHIM Project
277
module timer { shared uint:32 counter; /∗ Hardware register visible from software ∗/ hw void count() { /∗ Hardware function ∗/ counter = counter + 1; /∗ Direct access to hardware register ∗/ } out void reset timer() { /∗ Software function ∗/ counter = 0; /∗ Accesses register through bus ∗/ }
}
out uint get time() { /∗ Software function ∗/ return counter; /∗ Accesses register through bus ∗/ }
Fig. 1. An early fragment of shim [1]
on a bus when requested. We wrote a few variants of an i2 c bus controller in the language, starting with an all-software version and ending with one that implemented byte-level communication completely in hardware. The central lesson of this work was that the shared memory model, while simple, was a very low-level way to implement such communication. Invariably, it is necessary to layer over it another communication protocol (e.g., some form of handshake) to ensure coherence. We had not included an effective mechanism for implementing communication libraries that could hide this fussy code, so it was back to the drawing board.
3
Kahn, Hoare, and the shim Model
We decided we wanted reliable communication, including any across the hardware/software boundary, to be a centerpiece of the next version of shim. Erroneous communication is a central source of bugs in hardware designs: our embedded system students’ favorite mistake was to generate a value in one cycle and read it in another. This rarely produces even a warning in usual hardware simulation, so it can easily go unnoticed. We also found the inherent nondeterminism of the first iteration of shim a key drawback. The speed at which software runs on processors is rarely known, let alone controlled. Since software and hardware run in parallel and communicate using shared variables, the resulting system was nondeterministic, making it difficult to test. It also ran counter to what we had learned from Esterel [3]. Table 1 shows our wishlist. We wanted a concurrent, deterministic (i.e., independent of scheduling) model of computation and started looking around. The synchronous model [4] was unsuitable because it generally assumes either a single or harmonically related clocks and would not work well with software.
278
S.A. Edwards Table 1. The shim Wishlist
Trait
Motivation
Concurrent Mixes synchronous and asynchronous styles Only requires bounded resources Formal semantics Scheduling-independent
Hardware/software systems fundamentally parallel Software slower and less predictable than hardware; need something like multirate dataflow Fundamental restriction on hardware No arguments about meaning or behavior i/o should not depend on program implementation
Steve Nowick steered us toward the body of work on delay-independent circuits (e.g., van Berkel’s Handshake circuits [5]). We compared this class of processes to Kahn’s networks [6] and found them to be essentially the same [7]. We studied how to characterize such processes [8], finding that we could characterize them as functions that, when presented with more inputs or output opportunities, never produced less or different data. In their classic form, the unbounded buffers of Kahn networks actually make them Turing-complete [9] and difficult to schedule [10], so we decided on a model in which Kahn networks were restricted to csp-like rendezvous [11]. Others, such as Lin [12] had also proposed using such a model. In 2005, we presented our new shim model and a skeleton of the language, “Tiny-shim,” and its formal semantics [13]. It amounted to read and write operations sewn together with the usual C-like expressions and control-flow statements. We later extended this work with further examples, a single-threaded C implementation, and an outline of a hardware translation [14]. In 2006, we published our first real research result with shim: a technique for very efficient single-threaded code generation [15]. The centerpiece of this work was an algorithm that could compile arbitrary groups of processes into a single automaton whose states abstracted the control states of the processes. Our goal was to eliminate synchronization overhead, so the automaton captured which processes were waiting on which channels, but left all other details, such as variable values and details of the program counters, to the automaton. Figure 2 demonstrates the algorithm from Edwards and Tardieu [15]. The automaton’s states are labeled with a number (e.g., S0), the state of each channel in the system (ready “-”, blocked reading “R”, or blocked writing “W”), and, for √ each process, whether it is runnable ( ) or blocked on a channel (×), and a set of possible program counters. From each state, the automaton generator (a.k.a., the scheduler) nondeterministically chooses one of the runnable processes to execute and generates a state by considering each possible pc value for the process. The code generated for a state with multiple pc values begins with a C switch statement that splits control depending on the pc value.
Concurrency and Communication: Lessons from the SHIM Project
process sink(int32 B) { for (;;) B; }
sink 0 PreRead 1 1 PostRead 1 tmp3 2 goto 0
process buffer(int32 &B, int32 A) { for (;;) B = A; }
buffer 0 PreRead 0 1 PostRead 0 tmp2 2 tmp1 := tmp2 3 Write 1 tmp1 4 goto 0
process source(int32 &A) { A = 17; A = 42; A = 157; A = 8; } network main() { sink(); buffer(); source(); } (a)
source 0 tmp4 := 17 1 Write 0 tmp4 2 tmp5 := 42 3 Write 0 tmp5 4 tmp6 := 157 5 Write 0 tmp7 6 tmp8 := 8 7 Write 0 tmp8 8 Exit (b)
S0 -√ {0} √ {0} √ {0}
sink
279
S1 -R ×{1} √ {0, 4} √ {0, 2, 4, 6, 8} buffer
source
S2 RR ×{1} ×{1} √ {0, 2, 4, 6, 8} source
S5 RR ×{1} ×{1} ×{8}
S3 WR ×{1} √ {1} ×{2, 4, 6, 8} buffer S4 -W √ {1} ×{4} sink √ {2, 4, 6, 8} (c)
Fig. 2. An illustration of the shim language and its automaton compilation scheme from Edwards and Tardieu [15]. A source program (a) is dismantled into intermediate code (b), then simulated to produce an automaton (c). Each state is labeled with its name, the state of each channel (blocked on read, blocked on write, or idle), and the state of each process (runnable, and possible program counter values).
At this point, the language fairly closely resembled the Tiny-shim language of the Emsoft paper [13]. A system consisted of a collection of sequential processes, assumed to all start when the system began. It could also contain networks—groups of connected processes that could be instantiated hierarchically. One novel feature of this version, which we later dropped, was the ability to instantiate processes and networks without supply explicit connections. Instead, the compiler would examine the interface to each instantiated process and make sure its environment supplied such a signal. Connections were made implicitly by name, although this could be overridden. This feature arose from observing how in vhdl it is often necessary to declare and mention each channel many times: once for each process, once for each instantiation of each process, and once in the environment in which it is instantiated. However, in the process of writing more elaborate test cases, such as a jpeg decoder [16], we decided that this connection-centric specification style (which we adopted from hardware description languages) was inadequate for any sort of interesting software. We wanted function calls.
280
4
S.A. Edwards
Recursion
In 2006, we introduced function calls and recursion to shim, making it very Clike [17]. Our main goal was to make basic function calls work, allowing the usual re-use of code, but we also found that recursion, especially bounded recursion, was a useful mechanism for specifying more complex structures. void buffer( int i, int &o) { void fifo(int i, int &o, int n) { for (;;) { int c; int m = n − 1; recv i; if (m) o = i; buffer(i, c) par fifo(c, o, m); send o; else } buffer(i, o); } } Fig. 3. An n-place fifo specified using recursion, from Tardieu and Edwards [17]
Figure 3 illustrates this style. The recursive fifo procedure calls itself repeatedly in parallel, effectively instantiating buffer processes as it goes. This recursion runs only once, when the program starts, to set up a chain of single-place buffers.
5
Exceptions
Next, we added exceptions [18], certainly the most technically difficult addition we have made. Inspired by Esterel [3], where exceptions are used not just for occasional error handling but as widely as, say, if-then-else, we wanted our exceptions to be widely applicable and be concurrent and scheduling-independent. For sequential code, the semantics of exceptions were clear: throwing an exception immediately sends control to the most-recently-entered handler for the given exception, terminating any functions that were called in between. For concurrently running functions, the right behavior was less obvious. We wanted to terminate everything leading up to the handler, including any concurrently running relatives, but we insisted on maintaining shim’s scheduling independence, meaning we had to carefully time when the effect of an exception was felt. Simply terminating siblings when one called an exception would be nondeterministic: the behavior would then depend on the relative execution rates of of the processes and thus not be scheduling independent. Our solution was to piggyback the exception mechanism on the communication system, i.e., a process would only learn of an exception when it attempted to communicate, the only point at which processes agree on the time. To accommodate exceptions, we introduced a new, “poisoned,” state for a process that represents when it has been terminated by an exception and is waiting for its relatives to terminate. Any process that attempts to communicate with a poisoned process will itself become poisoned. In Figure 5, the first thread throws
Concurrency and Communication: Lessons from the SHIM Project
281
void main() { int i; i = 0; try { i = 1; throw T; i = i ∗ 2; // is not executed } catch(T) { i = i ∗ 3; } // i = 3 }
void main() { int i; i = 0; try { // thread 1 throw T; } par { // thread 2 for (;;) i = i + 1; // runs forever } catch(T) {} }
(a)
(b)
Fig. 4. (a) Sequential exception semantics are classical. (b) Thread 2 never feels the effect of the exception because it never communicates. From Tardieu and Edwards [18].
void main() { chan int i = 0, j = 0; try { // thread 1 while (i < 5) next i = i + 1; throw T; // poisons itself } par { // thread 2 for (;;) next j = next i + 1; // poisoned by thread 1 } par { // thread 3 for (;;) recv j; // poisoned by thread 2 } catch (T) {} } Fig. 5. Transitive Poisoning: throw T poisons the first process, which poisons the second when the second attempts next i. Finally the third is poisoned when it attempts recv j and the whole group terminates.
an exception; the second thread is poisoned when it attempts to rendezvous on i, and the third is poisoned by the second when it attempts to rendezvous on j. The idea was simple enough, and the interface it presented to the programmer could certainly be used and explained without much difficulty, but implementing it turned out to be a huge challenge, despite there being fairly simple set of structural operational semantics rules for it. The real complexity came from having to consider exception scope, which limits how far the poison propagates (it does not propagate outside the scope of the exception) and the behavior of multiple, concurrently thrown exceptions.
6
Static Analysis
Shim has always been designed for aggressive compiler analysis. We have attempted to keep its semantics simple, scheduling-independent, and restrict it to finite-state models. Together, these have made it easier to analyze.
282
S.A. Edwards
We developed a technique for removing bounded recursion from shim programs [19]. One goal was to simplify shim’s translation into hardware, where general recursion would require memory for a stack and choosing a size for it, but it has found many other uses. In particular, if a program has only bounded recursion, it is finite-state, simplifying other analysis steps. The basic idea of our work was to unroll recursive calls by exactly tracking the behavior of variables that control the recursion. Our insight was to observe that for a recursive function to terminate, the recursive call must be within the scope of a conditional. Therefore, we need to track the predicate of this conditional, see what can affect it, and so forth. Figure 6 illustrates what this procedure does to a simple fifo. To produce the static version in Figure 6(b), our procedure observes that the n variable controls the predicate around fifo’s recursive call of itself. It then notices that n is initially bound to 3 by fifo3 and generates three specialized versions of fifo—one with n = 3, n = 2, and n = 1—simplifies each, then inlines each function, since each is only called once. Of course, in the worst case our procedure could end up trying to track every variable in the program, which would be impractical, but in many examples we tried, recursion control only involved a few variables, making it easy to resolve. A key hypothesis of the shim project has been that scheduling independence should be a property of any practical concurrent language because it greatly simplifies reasoning about a program, both by the programmer and by automated tools. Our work on static deadlock detection reinforces this key point. Shim is not immune to deadlocks (e.g., { recv a; recv b; } par { send b; send a; } is about the simplest example), but they are simpler in shim because of its scheduling-independence. Deadlocks in shim cannot occur because of race conditions. For example, because shim does not have races, there are no racevoid fifo3(chan int i, chan int &o) { fifo(i, o, 3); } void fifo(chan int i, chan int &o, int n) { if (n > 1) { chan int c; buf(i, c); par fifo(c, o, n−1); } else buf(i, o); } void buf(chan int i, chan in &o) { for (;;) next o = next i; } (a)
void fifo3(chan int i, chan int &o) { chan int c1, c2, c3; buf(i, c1); par buf(c1, c2); par buf(c2, o); } void buf(chan int i, chan in &o) { for (;;) next o = next i; } (b)
Fig. 6. Removing bounded recursion, controlled by the n variable, from (a) gives (b). After Edwards and Zeng [19].
Concurrency and Communication: Lessons from the SHIM Project
283
induced deadlocks, such as the “grab locks in opposite order” deadlock race present in many other languages. In general, shim does not need to be analyzed under an interleaved model of concurrency since most properties, including deadlock, are the same under any schedule. So all the clever partial order tricks used by model checkers such as spin [20], are not necessary for shim. We first used the synchronous model checker nusmv [21] to detect deadlocks in shim [22]—an interesting choice since shim’s concurrency model is fundamentally asynchronous. Our approach was to abstract away data operations and choose a specific schedule in which each communication event takes a single cycle. This reduced the shim program to a set of communicating state machines suitable for the nusmv model checker. We continue to work on deadlock detection in shim. Most recently [23], we took a compositional approach where we build an automaton for a complete system piece by piece. Our insight is that we can usually abstract away internal channels and simplify the automaton without introducing or avoiding deadlocks. The result is that even though we are doing explicit model-checking, we can often do it much faster than a state-of-the art symbolic model checker such as nusmv. We have also used model checking to search for situations where buffer memory can be shared [24]. In general, each communication channel needs storage for any data being communicated over it, but in certain cases, it is possible to prove that two channels can never be active simultaneously. We use the nusmv model checker to identify these cases, which allow us to share potentially large buffers across multiple channels. Because this is optimization, if the model checker becomes overloaded, we can we safely analyze the system in smaller pieces.
7
Backends
We have developed a series of backends for the shim compiler; each works off a slightly different intermediate representations. First, we developed a code generator that produced single-threaded C [14] for a variant of Tiny-shim, which had only point-to-point channels. The runtime system maintained a linked list of runnable processes, and for each channel, tracked what process, if any, was blocked on it. Each process was compiled into a separate C function, which stored its state as a global integer and used an switch statement to restore it. This worked well, although we could improve runtimes by compiling away communication overhead through static scheduling [15]. To handle multi-way rendezvous, exceptions, and recursion on parallel hardware we needed a new technique. Our next backend [25] generated C code that made calls to the posix thread library to ask for parallelism. The challenge was to minimize overhead. Each communication action would acquire the lock on a channel, check whether every process connected to it had also blocked (i.e., whether the rendezvous could occur), and then check if the channel was connected to a poisoned process (i.e., a relevant exception had been thrown). All of these checks ran quickly; actual communication and exceptions took longer.
284
S.A. Edwards
We also developed a backend for ibm’s cell processor [26]. A direct offshoot of the pthreads backend, it allows the user to assign computationally intensive tasks to the cell’s synergistic processing units (spus); remaining tasks run on the cell’s powerpc core (ppu). Our technique replaces the offloaded functions with wrappers that communicate across the ppu-spu boundary. Cross-boundary function calls are technically challenging because of data alignment restrictions on function arguments, which we would have preferred to be stack-resident. This, and many other fussy aspects of coding for the cell, convinced us that such heterogeneous multicore processors demand languages at a higher level than C.
8 8.1
Lessons and Open Problems Function Calls
Early version of the language did not support classical software-like function calls. However, these are extremely useful, even in dataflow-centric descriptions, that they really need to be part of just about any language. We were initially deceived by the rare use of function calls in vhdl and Verilog, but we suspect this is because they do not fit easily into the register-transfer model. 8.2
Two-Way vs. Multi-way Rendezvous
Initial versions of shim only used two-way rendezvous, but after a discussion with Edward Lee, we became convinced that multi-way rendezvous was useful to provide at the language level. Debugging was one motivation: with multiway rendezvous, it becomes easy to add a monitor that can observe data flowing through a channel; modeling the clock of a synchronous system was another. Unfortunately, implementing multiway rendezvous is much more complicated than implementing two-way rendezvous, yet we found that most communication in shim programs is point-to-point, so we are left with a painful choice: slow down the common case to accommodate the uncommon case, or do aggressive analysis to determine when we can assume point-to-point communication. We would like to return shim to point-to-point communication only but provide multiway rendezvous as a sort of syntactic sugar, e.g., by introducing extra processes responsible for communication on channels. How to do this correctly and elegantly remains an open question, unfortunately. 8.3
Exceptions
Exceptions have been an even more painful feature than multi-way rendezvous. They are extremely convenient from a programming standpoint (e.g., shim’s rudimentary i/o library wraps each program in an exception to allow it to terminate gracefully; virtually every compiler testcase includes at least a single exception), but extremely difficult to both implement and reason about. We have backed away from exceptions for now (all our recent work addresses the exception-free version of shim); we see two possibilities for how to proceed.
Concurrency and Communication: Lessons from the SHIM Project
285
One is to restrict the use of exceptions so that the complicated case of multiple, concurrent exceptions is simply prohibited. This may prohibit some interesting algorithms, but should greatly simplify the implementation, and probably also analysis, of exceptions. Another alternative is to turn exceptions into syntactic sugar layered on the exception-free shim model. We always had this in the back our minds: an exception would just put a process into an unusual state where it would communicate its poisoned state to any process that attempts to communicate with it. The problem is that the complexity tends to grow quickly when multiple, concurrent exceptions and scopes are considered. Again, exactly how to translate exceptions into a simpler shim model remains an open question. 8.4
Semantics and Static Analysis
We feel we have proven one central hypothesis of the shim project: that simple, deterministic semantics helps both programming and automated program analysis. That we have been able to devise truly effective mechanisms for clever code generation (e.g., static scheduling) and analysis (e.g., deadlock detection) that can gain deep insight into the behavior of programs vindicates this view. The bottom line: if a programming language does not have simple semantics, it is really hard to analyze its programs quickly or precisely. We have also validated the utility of scheduling independence. Our test suite, which consists of many parallel programs, has reproducible results that lets us sleep at night. We have found few cases where the approach has limited us. Algorithms where there is a large number of little, variable-sized, but independent pieces of work to be done do not mesh well with shim’s schedulingindependent philosophy as it currently stands. The obvious way to handle this is to maintain a bucket of tasks and assign each task to a processor once it has finished its last task. The order in which the tasks is performed, therefore, depends on their relative execution rates, but this does not matter if the tasks are independent. It would be possible to add scheduling-independent task distribution and scheduling to shim (i.e., provided the tasks are truly independent or, equivalently, confluent); exactly how is an open research question. 8.5
Buffers
That buffering is mandatory for high-performance parallel applications is hardly a revelation; we confirmed it anyway. The shim model has always been able to implement fifo buffers (e.g., Figure 3), but we have realized that they are sufficiently fundamental to be a first-class type in the language. We are currently working on a variant of the language that replaces pure rendezvous communication with bounded, buffered communication. Because it will be part of the language, it will be easier to map to unusual environments, such as the dma mechanism for inter-core communication on the cell processor.
286
8.6
S.A. Edwards
Other Applications
The most likely future role of shim will be as inspiration for other languages. For example, Vasudevan has ported its communication model into the Haskell functional language [27] and proposed a compiler that would impose its schedulingindependent view of the work on arbitrary programs [28]. Certain shim ideas, such as scheduling analysis [29], have also been used in ibm’s x10 language.
Acknowledgments Many have contributed to shim. Olivier Tardieu created the formal semantics, devised the exception mechanism, and instigated endless (constructive) arguments. Jia Zeng developed the static recursion removal algorithm. Nalini Vasudevan has pushed shim in many new directions; Baolin Shao has just started pushing. The nsf has supported the shim project under grant 0614799.
References 1. Edwards, S.A.: Experiences teaching an fpga-based embedded systems class. In: Proceedings of the Workshop on Embedded Systems Education (wese), Jersey City, New Jersey, September 2005, pp. 52–58 (2005) 2. Edwards, S.A.: Shim: A language for hardware/software integration. In: Proceedings of synchron, Schloss Dagstuhl, Germany (December 2004) 3. Berry, G., Gonthier, G.: The Esterel synchronous programming language: Design, semantics, implementation. Science of Computer Programming 19(2), 87–152 (1992) 4. Benveniste, A., Caspi, P., Edwards, S.A., Halbwachs, N., Guernic, P.L., de Simone, R.: The synchronous languages 12 years later. Proceedings of the IEEE 91(1), 64–83 (2003) 5. van Berkel, K.: Handshake Circuits: An Asynchronous Architecture for vlsi Programming. Cambridge University Press, Cambridge (1993) 6. Kahn, G.: The semantics of a simple language for parallel programming. In: Information Processing 74: Proceedings of ifip Congress 74, Stockholm, Sweden, pp. 471–475. North-Holland, Amsterdam (1974) 7. Edwards, S.A., Tardieu, O.: Deterministic receptive processes are Kahn processes. In: Proceedings of the International Conference on Formal Methods and Models for Codesign (memocode), Verona, Italy, July 2005, pp. 37–44 (2005) 8. Tardieu, O., Edwards, S.A.: Specifying confluent processes. Technical Report cucs– 037–06, Columbia University, Department of Computer Science, New York, USA (September 2006) 9. Buck, J.T.: Scheduling Dynamic Dataflow Graphs with Bounded Memory using the Token Flow Model. PhD thesis, University of California, Berkeley (1993); Available as ucb/erl M93/69 10. Parks, T.M.: Bounded Scheduling of Process Networks. PhD thesis, University of California, Berkeley (1995); Available as ucb/erl M95/105 11. Hoare, C.A.R.: Communicating sequential processes. Communications of the ACM 21(8), 666–677 (1978) 12. Lin, B.: Software synthesis of process-based concurrent programs. In: Proceedings of the 35th Design Automation Conference, San Francisco, California, June 1998, pp. 502–505 (1998)
Concurrency and Communication: Lessons from the SHIM Project
287
13. Edwards, S.A., Tardieu, O.: Shim: A deterministic model for heterogeneous embedded systems. In: Proceedings of the International Conference on Embedded Software (Emsoft), Jersey City, New Jersey, September 2005, pp. 37–44 (2005) 14. Edwards, S.A., Tardieu, O.: Shim: A deterministic model for heterogeneous embedded systems. IEEE Transactions on Very Large Scale Integration (vlsi) Systems 14(8), 854–867 (2006) 15. Edwards, S.A., Tardieu, O.: Efficient code generation from Shim models. In: Proceedings of Languages, Compilers, and Tools for Embedded Systems (lctes), Ottawa, Canada, June 2006, pp. 125–134 (2006) 16. Vasudevan, N., Edwards, S.A.: A jpeg decoder in Shim. Technical Report cucs– 048–06, Columbia University, Department of Computer Science, New York, USA (December 2006) 17. Tardieu, O., Edwards, S.A.: R-shim: Deterministic concurrency with recursion and shared variables. In: Proceedings of the International Conference on Formal Methods and Models for Codesign (memocode), Napa, California, July 2006, p. 202 (2006) 18. Tardieu, O., Edwards, S.A.: Scheduling-independent threads and exceptions in Shim. In: Proceedings of the International Conference on Embedded Software (Emsoft), Seoul, Korea, October 2006, pp. 142–151 (2006) 19. Edwards, S.A., Zeng, J.: Static elaboration of recursion for concurrent software. In: Proceedings of the Workshop on Partial Evaluation and Program Manipulation (pepm), San Francisco, California, January 2008, pp. 71–80 (2008) 20. Holzmann, G.J.: The model checker spin. IEEE Transactions on Software Engineering 23(5), 279–294 (1997) 21. Cimatti, A., Clarke, E.M., Giunchiglia, E., Giunchiglia, F., Pistore, M., Roveri, M., Sebastiani, R., Tacchella, A.: NuSMV 2: An openSource tool for symbolic model checking. In: Brinksma, E., Larsen, K.G. (eds.) CAV 2002. LNCS, vol. 2404, pp. 359–364. Springer, Heidelberg (2002) 22. Vasudevan, N., Edwards, S.A.: Static deadlock detection for the schim concurrent language. In: Proceedings of the International Conference on Formal Methods and Models for Codesign (memocode), Anaheim, California, June 2008, pp. 49–58 (2008) 23. Shao, B., Vasudevan, N., Edwards, S.A.: Compositional deadlock detection for rendezvous communication. In: Proceedings of the International Conference on Embedded Software (Emsoft), Grenoble, France (October 2009) 24. Vasudevan, N., Edwards, S.A.: Buffer sharing in csp-like programs. In: Proceedings of the International Conference on Formal Methods and Models for Codesign (memocode), Cambridge, Massachusetts (July 2009) 25. Edwards, S.A., Vasudevan, N., Tardieu, O.: Programming shared memory multiprocessors with deterministic message-passing concurrency: Compiling Shim to Pthreads. In: Proceedings of Design, Automation, and Test in Europe (date), Munich, Germany, March 2008, pp. 1498–1503 (2008) 26. Vasudevan, N., Edwards, S.A.: Celling Shim: Compiling deterministic concurrency to a heterogeneous multicore. In: Proceedings of the Symposium on Applied Computing (sac), Honolulu, Hawaii, March 2009, vol. III, pp. 1626–1631 (2009) 27. Vasudevan, N., Singh, S., Edwards, S.A.: A deterministic multi-way rendezvous library for Haskell. In: Proceedings of the International Parallel and Distributed Processing Symposium (ipdps), Miami, Florida, April 2008, pp. 1–12 (2008) 28. Vasudevan, N., Edwards, S.A.: A determinizing compiler. In: Proceedings of Program Language Design and Implementation (pldi), Dublin, Ireland (June 2009) 29. Vasudevan, N., Tardieu, O., Dolby, J., Edwards, S.A.: Compile-time analysis and specialization of clocks in concurrent programs. In: de Moor, O., Schwartzbach, M. (eds.) CC 2009. LNCS, vol. 5501, pp. 48–62. Springer, Heidelberg (2009)
Location-Aware Web Service by Utilizing Web Contents Including Location Information YongUk Kim, Chulbum Ahn, Joonwoo Lee, and Yunmook Nah Department of Computer Science and Engineering, Dankook University, 126 Jukjeon-dong, Suji-gu, Yongin-si, Gyeonggi-do, 448-701, Korea {yukim,ahn555,jwlee}@dblab.dankook.ac.kr,
[email protected] Abstract. Traditional search engines are usually based on keyword-based retrievals, where location information is simply treated as text data, thus resulting in incorrect search results and low degree of user satisfaction. In this paper, we propose a location-aware Web Service system, which adds location information to web contents, usually consisting of text and multimedia information. For this purpose, we describe the system architecture to enable such services, explain how to extend web browsers, and propose the web container and the web search engine. The proposed methods can be implemented on top of traditional Web Service layers. The web contents which include location information can use their location information as a parameter during search process and therefore they can increase the degree of search correctness by using actual location information instead of simple keywords. Keywords: Location-Based Service, Web Service, GeoRSS, search engine, GeoWEB.
1 Introduction With the rapid development of related technologies, Web has become an important medium to provide and share information. Various groups from different industries produce and use web contents. Especially, the means of information providing and sharing have been enlarged from large scale Web Service providers into small scale and even individual providers by Web 2.0 technology [1]. The Web now deals with contents such as detail information closely related with casual life of ordinary people as well as traditional contents related with professional research and commercial promotion. The volume of information is much larger than the data volume handled by any other media. We usually produce and consume web contents related with our normal lives while moving continuously in our real life. The production and consumption of information related with specific locations are also ever increasing. It is now very common to search for restaurants by typing ‘the most favorite Spaghetti restaurant near Kangnam subway station’ on the keyword box of search engines, which then make search engines to check for user blogs and return appropriate results. But, such location information is usually treated as simple text and such retrievals related with positional information are usually processed by only text matching techniques, resulting in incorrect search results. S. Lee and P. Narasimhan (Eds.): SEUS 2009, LNCS 5860, pp. 288–295, 2009. © IFIP International Federation for Information Processing 2009
Location-Aware Web Service by Utilizing Web Contents
289
Search engines usually return huge volume of unnecessary sites because they simply contain the same keyword with the given positional information. In this paper, we propose how to extend web contents so that their own location information can be built in them. The proposed method prevents location information from being treated as simple text and allows the exact web contents having strong relationship with the given location can only be retrieved. Here, the web contents mean the general extension of HTML/CSS contents [2]. The proposed system can collect location information from extended web contents and search and provide web contents related with the given query location. To support and utilize location-aware web contents, we propose how to extend web browsers and how to build web containers and web search engines. The remainder of this paper is organized as follows. The common problems of current search systems are described in Section 2. Section 3 shows the overall structure of location-aware Web Service system and describes detail structures and behaviors of the proposed system. Section 4 describes some implementation details and experimental results. Finally, section 5 concludes the paper.
2 Overview In the location-related web contents retrieval by keyword matching, the location of contents are included in the search keyword or input form and that query is processed by simple keyword matching. Let’s consider the query ‘Kangnam station BBQ house.’ The term ‘Kangnam station’ is the positional information describing the location of the contents and ‘BBQ house’ is the normal query term. In the current search engines, the term ‘Kangnam station’ and ‘BBQ house’ are all handled as text information and documents including both query terms are included in the search result. The term ‘Kangnam station’ has a meaning related with position, but it is treated as a simple keyword without any special consideration on its meaning during the search processing. In the current web contents service, there is no difference between the location-related web contents retrieval and the keyword-based retrieval. The problems become more severe when multiple location-related keywords are randomly contained in keyword-based retrievals. In such case, it is very difficult to eliminate unrelated documents. For example, documents that contain information on the Subway Line 2 will contain the names of 44 subway stations and such documents can be treated as documents having 44 location-related keywords, even though these documents are not directly related with 44 specific locations. Therefore, the search engines will return all the documents containing information related with the Subway Line 2 for the query asking about one location among 44 subway stations of the Subway Line 2, resulting in incorrect search results and low degree of user satisfaction. The local search services are services provided by content providers, which show the contents related with specific location on the map. The portal sites, such as Naver and Daum, provide such services, by showing the map on the one side of the screen, while listing the neighboring store list on the other side of the screen. The listed stores are the ones that have direct contracts with the portal sites. When users select a link, the summary information containing the store name, the address and the telephone number of the selected store appears on the pop-up layer [3, 4].
290
Y. Kim et al.
The local search service of Naver shows the map of ‘Kangnam station’ on the left side of the screen and displays the store list, with telephone number and grade, in alphabetical order with alphabet balloon symbol on the right side of the screen. When users select a link, the review information is displayed. The search results consist of information provided by contents providers and the short summary information is only provided instead of full web contents. The contents of the search results depend on the contents providers and, therefore, the information provided by local search services is not enough to users in terms of volume and quality. To relieve this problem, some providers also provide links to the review information posted by users in blog services. But, the main problem of this approach is that it only provides the information intentionally prepared by the contents providers and it shows only quick summary and review instead of full web pages. The research on the Geospatial Web (or Geoweb), which means the combination of the location-based information and the Web, was started in 1994 by U.S. Army Construction Engineering Research Laboratory [5]. This research can be divided into subareas, such as geo-coding, geo-parsing, GCMS(Geospatial Contents Management Systems) and retrievals. The geo-coding is a technique to verify the location of a document by other tools, by recording the location information within the document as shown in Figure 1. For geo-coding, EXIF information of image files, meta information of web sites and meta tag information on text or picture data can be used [6, 7]. GPS Latitude : 57 deg 38' 56.83" N GPS Longitude : 10 deg 24' 26.79" W GPS Position : 57 deg 38' 56.83" N, 10 deg 24' 26.79" W Fig. 1. Example of EXIF information of JPEG picture
The geo-parsing is a technique which translates location information in text into real spatial coordinates, thus enabling spatial query processing [8, 9]. For example, if users input ‘Hannam-dong Richensia North 50m’, that location information is parsed into the real coordinate. The Geospatial Contents Management Systems supports location information by extending traditional CMS. The Geospatial Web technologies support coordinate-based and region-based retrieval and they provide location information of individual documents or domains. However, these technologies depend on specific documents or specific domains and they put focus on region-based retrieval.
3 Structure of Location-Aware Web Service Systems The overall structure of the system to easily change and service general web contents according to the location information is shown in Figure 2. This system consists of the Client, the Web Container and the Search Engine. The Web Container module manages and transfers web contents with location information to provide location-aware web contents to users. It supports both static pages and dynamic web applications. It is able to notify update status of contents made by web applications actively to the Search Engine by using pingback methods. The correctness of information returned by the Search Engine can be improved by this mechanism.
Location-Aware Web Service by Utilizing Web Contents
291
Fig. 2. The location-aware Web Service system structure
The Search Engine module collects information from the Web Container and allows users to retrieve location-aware web contents. It recognizes changes in dynamic web contents by using the Pingback Retriever and periodically collects static documents by invoking Web Crawlers using the Timer. The information is updated by the Information Updater. Location-based queries are delivered through the Query Handler and the Information Finder finally processes these queries. The Client is the module for users to use location-aware web contents and it provides tools to support easy retrieval. It is an extension of common web browsers and includes modules, such as Map Generator, Location Information Parser, Query Generator, Location Detector, etc. It is able to show the location of contents intuitively to users by using Location Information Parser and Map Generator. The Query Generator is used to allow users more convenient location-aware retrieval. The Location Detector is a module which captures location information by utilizing external hardware or Web Services. This module can directly get location information by using GPS or devices supporting Global Navigation Satellite System, like Beidou [10]. Also, it can indirectly get location information by using Web Services supporting Geolocation API Spec [13], such as Mozilla Firefox Geode [11] and Yahoo FireEagle [12]. Currently, the utilization of the Location Detector is not high because general PCs do not provide location information. However, it will be possible to provide more exact location information when the IP and GPS information are more commonly utilized. 3.1 Web Contents Extension to Support Location-Aware Information In the specification of traditional Web contents standards, such as HTML and XHTML[14], the markups to specify location information are not included and the
292
Y. Kim et al.
namespaces for such extension are not also predetermined. Therefore, the format of web contents needs to be extended to deal with location-aware information. One method is to add new markups to the HTML/XHTML format and another method is to include links to external documents. In the previous cases, the formats of web contents were usually extended by linking external documents using tag [15]. However, the standard committees have recently turned their attitude in a conservative way and they are eliminating the indiscriminately added markup tags, such as <marquee> and <embed>, to manage namespaces clearly [16]. The method adding new markup tags will face some difficulties in maintaining future web contents. Therefore, in this paper, we extended the markup tag to include location information in GeoRSS [17] format, as shown in Figure 3. 45.256-71.92 Fig. 3. Location information format
The <where> markup tag, having the ‘georss’ namespace, is used to represent the object containing the location information. The markup tags located within the <where> tag are used to describe location using geospatial languages, such as GML (Geography Markup Language) [18]. Documents holding location information are referenced within HTML/XHTML documents using markup tag and the file type of such documents is ‘application/geoweb.’ The documents having this file type can interpret data in GeoRSS format. We will represent coordinates using WGS84 format [19], which is one of coordinate formats supported by GeoRSS. The WGS84 format is consistent with the GPS format and, thus, it can be effectively used to support interoperability with mobile devices. Also, that format can be easily transformed into the KATECH TM128 coordinate developed by the National Geographic Institute. 3.2 Web Client The Web Client provides facilities for web contents retrieval and visually shows documents stored in the Web Container with location information. Figure 4 shows the interaction between the Web Client and the Web Container. The Web Client module which is an extension of web browsers consists of HTML Renderer, Location Information Parser and Map Generator as shown in Figure 4. When users provide URI of contents (1 of Figure 4), the HTML Renderer sends request to and receive the required contents from the Web Container (2 and 3). The Web Client then receives the location information (5 and 6) and generates the appropriate map (7, 8, 9). The steps from 5 to 9 are repeated if there are more external links in the web contents.
Location-Aware Web Service by Utilizing Web Contents
User
HTML Renderer
Location Info. parser
1 : Type URI()
Map Generator
293
Web Container
2 : Request the web page.() 3 : web page 4 : Get map info.()
loop exist out links?
5 : Request the location.() 6 : Location 7 : Request a map.() 8 : Generate a map() 9 : map
10 : Render all contents.() 11 : all contents
Fig. 4. Basic operations of the Web Client
Web Container
Pingback Retriever
1 : Notify update()
Web Crawler
Info. Updater
2 : Crawl it()
3 : Request the web contents() 4 : HTML page and location info. 5 : Update() loop frequently 6 : Request the changed set() 7 : HTML/XHTML & location info.
8 : update()
Fig. 5. Information collection by the Search Engine
3.3 Search Engine When some information is updated by Web applications, the Web Container notifies that fact to the Search Engine (1 of Figure 5). The Pingback Retriever makes request to the Web Crawler to check the updated web contents. The Web Crawler then visits the Web Container to see the updated information (2, 3, 4) and updates the information in the database (5). The Web Crawler continuously visits the Web Container, checks the newly updated information and downloads such updates, according to the timer event (6, 7, 8).
294
Y. Kim et al.
The Search Engine can handle keyword-based queries, coordinate and keywordbased queries, location and keyword-based queries and keyword plus keyword-based queries.
4 Implementation and Experiments Both the Web Container and the Search Engine are implemented on the same computer, using AMD Athlon 64 Dual Core 4000+, 1GB memory, Ubuntu 4.1.2-16 Linux 2.6.22-14-server version, Apache2 and MySQL 5.0.45. The Web Client is developed on a computer with Pentium 4, 1GB memory, Windows XP, Firefox 3.0.8. The proposed functions are implemented using NPAPI [20] and extended features [21]. We compared the search results of location and keyword-based queries by using general search engines and the proposed Search Engine. The query is finding information related with ‘Subway Line 2.’ We first got results using general search engines and then got improved results using location-aware web contents search engine. Table 1. Filtering rate and error rate
Filtering rate Error rate
N search engine 66.67% 3.33%
D search engine 47.22% 10.56%
Y search Engine 53.33% 15.56%
As shown in Table 1, we can eliminate 66.67% unnecessary search results by N search engine. However, there still exist wrong answers (3.33%), which can’t be eliminated by location information and which contain the given keywords meaningless in the search results. The most common meaningless results were caused by the keywords contained in the titles of the bulletin boards included in the web pages.
5 Conclusion In this paper, we proposed a location-aware Web Service system, which adds location information to traditional web contents. We explained the overall structure of location-aware Web Service system, which consists of the Web Client, the Web Container and the Search Engine. We also described detail structures and behaviors of the proposed systems. The Web Container module manages and transfers web contents with location information to provide location-aware web contents to users. The Search Engine module collects information from the Web Container and allows users to retrieve location-aware web contents. The Web Client provides facilities for web contents retrieval and visually provides documents stored in the Web Container with location information. The proposed methods can be implemented on top of traditional Web Service layers. The web contents which include location information can use their location information as a parameter during search process and therefore they can increase the degree of search correctness by using actual location information instead of simple keywords. To show the usefulness our schemes, some experimental results were shortly provided.
Location-Aware Web Service by Utilizing Web Contents
295
Acknowledgments. This research was supported by the Ministry of Knowledge Economy, Korea, under the Information Technology Research Center support program supervised by the Institute of Information Technology Advancement (grant number IITA-2008-C1090-0801-0031). This work was supported by the Korea Science and Engineering Foundation (KOSEF) grant number R01-2007-000-20958-0 funded by the Korea government (MOST). This research was also supported by Korea SW Industry Promotion Agency (KIPA) under the program of Software Engineering Technologies Development and Experts Education.
References 1. Millard, D., Ross, M.: Web 2.0: Hypertext by Any Other Name? In: Proc. ACM Conference on Hypertext and Hypermedia, pp. 22–25. ACM Press, New York (2006) 2. HTML Spec., http://www.w3.org/TR/REC-html40/ 3. Daum local information, http://local.daum.net/ 4. Naver local information, http://local.naver.com/ 5. An Architecture for Cyberspace: Spatialization of the Internet, http://citeseerx.ist.psu.edu/viewdoc/ summary?doi=10.1.1.37.4604 6. geo-extension-nonWGS84, http://microformats.org/wiki/geo-extension-strawman 7. Geographic registration of HTML documents, http://tools.ietf.org/id/draft-daviel-html-geo-tag-08.txt 8. NGA GEOnet Names Server, http://earth-info.nga.mil/gns/html/ 9. U.S. Board on Geographic Names, http://geonames.usgs.gov/domestic/index.html 10. Beidou, http://www.globalsecurity.org/space/world/china/beidou.htm 11. Mozilla Firefox Geode, http://labs.mozilla.com/2008/10/introducing-geode/ 12. Yahoo FireEagle, http://fireeagle.yahoo.net 13. Geolocation API Spec., http://dev.w3.org/geo/api/spec-source.html 14. XHTML Spec., http://www.w3.org/TR/xhtml11/ 15. Important change to the LINK tag, http://diveintomark.org/archives/2002/06/02/ important_change_to_the_link_tag 16. How to upgrade markup code in specific cases: <embed>, , <marquee>, , https://developer.mozilla.org/en/Using_Web_Standards_in_your _Web_Pages/ 17. GeoRSS(Geographically encoded objects for RSS), http://www.georss.org/ 18. GML Specification, http://portal.opengeospatial.org/modules/admin/ 19. WGS 84 Implementation Manual, EUROCONTROL and ifEN (1998) 20. NPAPI, https://developer.mozilla.org/en/Gecko_Plugin_API_Reference 21. Firefox Extensions, https://developer.mozilla.org/En/Extensions
The GENESYS Architecture: A Conceptual Model for Component-Based Distributed Real-Time Systems Roman Obermaisser and Bernhard Huber Vienna University of Technology, Austria
Abstract. This paper proposes a conceptual model and terminology for component-based development of distributed real-time systems. Components are built on top of a platform, which offers core platform services as the basis for the implementation and integration of components. The core platform services enable emergence of global application services of the overall system out of local application services of the constituting components. Therefore, the core platform services provide elementary capabilities for the interaction of components, such as message-based communication between components or a global time base. Also, the core services are the instrument via which a component creates behavior that is externally visible at the component interface. In addition, the specification of a component’s interface builds upon the concepts and operations of the core platform services. The component interface specification constrains the use of these operations and assigns contextual information (e.g., semantics in relation to the component environment) and significant properties (e.g., reliability requirements, energy constraints). Hence, the core platform services are a key aspect in the interaction between integrator and component developer.1
1 Introduction It is beyond doubt that complex embedded real-time systems can only be reasonably built by following a component-oriented approach. The overall system is divided into components that can be independently developed and serve as building blocks for the ensuing computer system. A platform is required both as a baseline for the component developers and for the integrator to combine the independently developed components to the overall system. For the latter purpose, the platform needs mechanisms for the interaction between components. However, these elementary interaction mechanisms not only serve for the coupling between components. Beyond that, the elementary interaction mechanisms of the platform determine how component behavior comes into existence. The use of these mechanisms is the instrument via which a component generates behavior that is externally visible. From the point of view of a component’s interface to the platform, the use of these elementary interaction mechanisms (along with associated meaning in the application context) is what constitutes a component. In this paper, we define the notion of a platform and its elementary interaction mechanisms that we denote core platform services. We argue that the core platform services 1
This work has been supported in part by the European research project INDEXYS under the Grant Agreement ARTEMIS-2008-1-100021.
S. Lee and P. Narasimhan (Eds.): SEUS 2009, LNCS 5860, pp. 296–307, 2009. c IFIP International Federation for Information Processing 2009
A Conceptual Model for Component-Based Distributed Real-Time Systems
297
are the result of emergence from local implementations of platform functionality at the components in conjunction with shared platform functionality (e.g., network switches, network-on-a-chip). On their behalf, the core platform services permit the interaction between application services and enable another form of emergence: the emergence of application services out of more elementary local application services of components. The presented conceptual model is a result of discussions on embedded system architectures within the GENESYS project and the ARTEMIS Strategic Research Agenda. Experts from five domains of embedded systems (i.e., automotive, avionics, industrial control, mobile systems, consumer electronics) were involved in devising the underlying concepts. The model serves as the conceptual basis of the cross-domain architecture developed within the GENESYS project. The contribution of this paper is the analysis and conceptualization of the relationship between embedded platforms and applications. In this regard, we go beyond related work on specific component-based frameworks (e.g., CORBA [17], AUTOSAR [19] to name two widely-used frameworks). We analyze platform services and their implications on the development of application services, ranging from platform capabilities, to instruments for behavior generation, to concepts for specifying behavior.
2 Basic Concepts This section introduces fundamental concepts that will be used in the paper. System. We use the definition of a system introduced in [2]: an entity that is capable of interacting with its environment and is sensitive to the progression of time. The environment is in principle another system. The environment takes advantage of the existence of a system by producing input for the system and acting on the output of the system. A system combines physical and logical aspects. As a consequence, behavior can be associated with a system taking into account both the value and time domain. This definition excludes, for example, a software module without associated hardware. In general, systems are hierarchic and can on their behalf be recursively decomposed into sets of interacting constituting systems. The constituting elements of a system are denoted as components. Time. The temporal awareness of systems requires a model of time. We assume a model based on Newtonian time, where the continuum of real-time can be modeled by a directed timeline consisting of an infinite set of instants [20]. In a distributed computer system, the progression of time is captured with a set of physical clocks. Since any two physical clocks will employ slightly different oscillators, clock synchronization is required to bring the time of clocks into close relation with respect to each other. A measure for the quality of clock synchronization is the precision, which is defined as the maximum offset between any two clocks during an interval of interest. Service. A service is what a system delivers to its environment according to the specification. Through its service, a system can support the environment, i.e., other systems that use the service. The specification for a system defines the service. Given a concrete paradigm of interaction between systems, the notion of a service can be refined. For example, in context of message-based interaction, the service of a system can be defined
298
R. Obermaisser and B. Huber
LIF
Local Appl. Services
LIF
Component
Local Appl. Services
Component
Local Appl. Services
Component
Component
Global Appl. Services
LIF
Local Appl. Services LIF Components
Core Platform Services
Platform
Fig. 1. Architectural Model
as the sequence of intended messages that is produced by a system in response to the progression of time, input and state [2, page 28]. An overview of formalisms for the definition of services in different interaction paradigms can be found in [4] (e.g., Statecharts, Specification and Description Language (SDL)). Behavior. In the presence of faults (e.g., design fault in the implementation), a system can violate its specification. In this case, the system exhibits a failure [1] instead of its specified service. We use the term externally visible behavior (or behavior for short) as a generalization of the notions of service and failure: behavior = service ∪ failure. The correct behavior (as defined by the specification) is the system’s service. The faulty behavior (violation of the specification) is a failure. In the absence of a specification, we can only reason about the behavior of systems.
3 Architectural Model The overall system consists of two types of constituting systems: a set of components and an underlying platform (cf. Figure 1). Based on this system structure, we can distinguish different types of services: global application services of the overall system, local application services of components, and platform services. 3.1 Users of the Architectural Model When developing a system that follows this architectural model, we can differentiate two roles: component developers and the integrator. The component is the unit of delegation and the unit of integration. Using the platform, the integrator is responsible for binding together the components to an overall system with global application services. The platform offers the means to integrate the components based on the specification of the components’ local application services. The component developers are concerned with the design and implementation of individual components. A component developer can delegate subtasks for the realization of a component to suppliers/subcontractors. Nevertheless, the component developer delivers an entire component with a local application service to the integrator. The integrator need not be aware of the inner structure of the component and the involvement of suppliers/subcontractors.
A Conceptual Model for Component-Based Distributed Real-Time Systems
299
3.2 Components and Application Services A component is a self-contained building block of the computer system. The borderline between a component and the platform is called the component’s Linking Interface (LIF) [9]. At the LIF, the component provides its local application services to the other components. A local application service is the intended behavior of a component at the LIF. The component exchanges information with other components at the LIF and the specification of the local application service must cover all aspects that are relevant for the integration of the component with other components: (1) Values. The syntax of the information exchanged at the LIF needs to be defined. In addition, relationships between inputs and outputs are specified. (2) Timing: In a real-time system, the specification of the LIF encompasses temporal constraints, e.g., for consuming inputs or producing outputs. (3) Relationship to the (natural) environment of the computer system: For each component that interacts with the environment of the computer system, the LIF specification must capture the (semantic) relationship between the information exchange at the LIF and the interaction with the environment. While abstracting from the details of the component’s local interfaces (e.g., I/O interfaces or fieldbuses), the semantics of the LIF interaction in relation to the component environment need to be described. Due to the inability to fully formalize the relationship to the environment [10], natural language or ontologies are examples of suitable specification methods. For example, information provided at the LIF to other components can resemble entities in the natural environment with a given delay and originate from a sensor. Likewise, information consumed at the LIF can cause an effect on the natural environment via an actuator. In addition to the value domain, this relationship must be specified in the temporal domain. For example, the lag between sensory information at the LIF and the state of the environment is of concern (cf. temporal accuracy [8]). In addition, many other aspects can be relevant for the specification of the LIF, e.g., reliability, energy, security. 3.3 Platform and Platform Services The platform is the foundation for development and integration of components. It consists of platform services, which we denote as core platform services (in order to discriminate them from platform-like services within components that we will call optional platform services). More precisely, a platform is essential for two reasons: (1) Baseline for development of components. Using the platform services, the component developer establishes the local application services of a component. Component developers need a starting point for realizing components. The platform offers a foundation on top of which application-specific functionality can be established. This foundation consists of generic services, which are required to be useful in many specific components. Although most of these services could also be realized within the components, their availability in the platform simplifies the component development. As an example, consider a sensor component that periodically samples the lateral acceleration in a car and produces a message on the LIF with this measurement. The local application service of this component would be ’acceleration measurement’. An
300
R. Obermaisser and B. Huber
example of a platform service that can be used to construct this application service would be a time service. Such a time service can provide the periodic sampling points (e.g., with respect to a global time base). Using platform services, recurring problems are solved once-and-for-all in the platform without the need to redevelop them in every component. Principally, the development of components becomes easier if more functionality is offered by the platform. However, overloading the platform with a plethora of functionality is likely to lead to a high overhead. The reason for the overhead is that part of the functionality will be too specific to be applicable except for a few very specialized components. Furthermore, the complexity of the platform will increase. Thus, the likelihood of design faults in the platform will increase. Such a design fault in the platform is of particular severity. While a design fault of a component would affect this specific component, potentially all components can be affected by a design fault in the platform. All components build upon the platform and depend on its correctness. The issue of the complexity of the platform and the susceptibility to design faults is of particular importance for safety-critical applications that need to be certified. (2) Framework for integration of components. Besides serving as the baseline for the establishment of local application services of components, the platform services are an instrument for emergence. The platform services enable the emergence from local application services of the components to global application services of the system. Therefore, the platform offers mechanisms to compose the overall system out of the independently developed components. These mechanisms include communication services enabling the exchange of information between components. In addition, other services can serve as a useful basis for integration, e.g., fault isolation services [12] that prevent component failures from propagating between components or clock synchronization services [13] to establish a common notion of time. Following up on the previous example, the introduced component can be integrated with other in-vehicle components, thereby composing the speed measurement service with local application services of other components. The result is the emergence of a global application service (e.g., passive safety in the example with the lateral acceleration measurement) out of local application services (e.g., brakes, steering, and suspension in addition to the mentioned acceleration measurement service).
4 Core Platform Services This section characterizes the constituting elements of the platform: the core platform services. The introduced concept is exemplified using the GENESYS architecture. 4.1 Three Roles of Platform Services: Platform Capabilities, Instrument for Behavior Generation and Concepts for Service Specification A core platform service is an elementary building block of the platform. Inversely, the set of core platform services defines the platform. Each core platform service is a capability of the platform that is offered to the components. An example of a core platform service is ’message multicasting’ that enables
A Conceptual Model for Component-Based Distributed Real-Time Systems
301
components to interact by the exchange of messages. In its simplest form, this service would consume messages from one port and transport these messages to a set of destination ports with defined properties (e.g., latency, reliability). This core platform service would enable a component to deposit messages at a port in order to be delivered via multicast communication to ports belonging to other components. Secondly, the core services are the instrument by which a component generates behavior at the LIF. The use of the core services results in activities that can be perceived at the LIF from outside the component. The core services offer elementary operations, a sequence of which forms behavior at run-time. For example, in case of the platform service ’message multicasting’, the instrument would be the ability to send messages. The instrument for behavior generation is a part of the capability. The capability is, however, more than empowering the component to generate behavior. The capability is also concerned with the reaction to the component behavior in the platform. In particular, the capability links behavior at different components (e.g., output of one component becomes input of another component). In addition, the core services provide the underlying concepts for the specification of component services. This statement about behavior relates to behavior on a meta-level. Given a set of core services, different service specification languages are possible that use these concepts. 4.2 Core Platform Services of the GENESYS Architecture To give examples of core platform services, we will give an overview of the services of the GENESYS architecture [16]: (1) Periodic and sporadic messaging. The GENESYS architecture uses the communication paradigm of messaging. The concept of a message is an atomic structure that is formed for the purpose of inter-component communication. A messages encompasses data and control information, where the control information addresses timing, acknowledgement, and addressing. In addition, the syntax of the data is defined and the message is ascribed meaning in the application context. Messaging has been chosen as a core platform service of GENESYS, because messaging provides an ideal abstraction level for embedded real-time systems. In particular, messages offer inherent consistency, synchronization, and the timing is explicit compared to shared memory abstractions. Nevertheless, messaging is a universal model and other communication paradigms can be realized on top of messages (e.g., a virtual shared memory [11]). Depending on the message timing, the core service distinguishes periodic and sporadic messages. The timing of periodic messages is defined by a period and phase. Sporadic messages possess a minimum message interarrival time. In addition to the concept of messaging (as explained above), the platform service is associated with a platform capability and an instrument for behavior generation. The capability associated with this platform service is the transport of messages from a specific sender-component to a set of receiver-components (i.e., multicast). The capability involves non-functional properties, such as the the reliability (e.g., probability for message omission failure) and temporal properties (e.g., end-to-end latency, bandwidth). The instrument for behavior generation is the ability for the transmission of a message, i.e., the production of a message by the sender-component. In case of periodic
302
R. Obermaisser and B. Huber
communication, the sender-component performs an update-in-place of a state message. For sporadic communication, the sender-component places an event message into an outgoing message queue. (2) Global time. The concept of a global time is a counter value that is globally synchronized across components. If the counter value is read by a component at a particular point in time, it is guaranteed that this value can only differ by one tick from the value that is read at another component at the same point in time. The concept of a global time base enables service specifications with temporal constraints w.r.t. global time (e.g., adherence to the action lattice of a sparse time base [7], specification of phase of periodic messages, etc.). In addition, activities at different components can be temporally coordinated, e.g., avoiding collisions by design or creating phase-aligned transactions [14]. The capability of the platform is the execution of a clock synchronization algorithm. For example, in case of a distributed clock synchronization algorithm, local views regarding the global time are exchanged. Thereafter, a convergence function is applied and the local clocks at the components are adjusted (e.g., using rate correction). The instrument for behavior generation comprises the use of the global time in activities at the LIF, such as the transmission or reception of a message at specific instants of the global time base. (3) Network diagnosis and management. The membership vector contains globally consistent information about the operational state of every component (e.g., correct or faulty) within a given membership delay. The platform capability involved in this core platform service is the construction of a membership vector. Firstly, the platform performs error detection for individual components. For example, the platform can determine whether the temporal properties of the messages are satisfied. Secondly, the platform needs to perform agreement on a globally consistent view on the operation state of the components. A membership vector offers an instrument to generate behavior by taking different actions depending on the operational state of components. For example, in a brake-by-wire car with four components associated with the four wheels of the car, components can react differently to messages with braking actuation values when they possess knowledge about the failure of the component at another wheel. (4) Reconfiguration. Reconfiguration is concerned with adapting the platform depending on application contexts, the occurrence of faults (e.g., permanent failure of a component) and the availability of resources (e.g., low battery power). To accomplish these goals, the other platform services are altered within given limits, e.g., by introducing a new message or altering the timing in the first core service. The capability of the platform involves the processing of triggers for reconfigurations, such as the acceptance of requests for reconfigurations or monitoring resource availability. Secondly, reconfiguration needs to be executed, e.g., by creating a new communication schedule for periodic messaging. The instruments for behavior generation include the operations to request reconfiguration. For example, a component can request the ability to send an additional periodic message. 4.3 Realization of Platform Services Two structural elements can be distinguished in the realization of the platform: component-specific and shared constituents of the platform (cf. Figure 2).
A Conceptual Model for Component-Based Distributed Real-Time Systems Networked Node
Networked Node
Component
Component
LIF
LIF
Component-specific Platform Constituent
Component-specific Platform Constituent
303
Application
Platform
(Realized Node-Internally)
(Realized Node-Internally)
Technology-Dependent Interface
Technology-Dependent Interface
Shared Platform Constituent (Realized Node-Externally)
Fig. 2. Deployment of Core Platform Services: A computer system consists of networked nodes (nodes for short) that are conjoined through a shared platform constituent
Firstly, the platform comprises a shared part that is not specific to any component in particular. This shared platform constituent includes the communication medium for the communication between the components. The communication medium can be a bus (e.g., CAN), a routed network (e.g., IP network with routers), a starcoupler (e.g., FlexRay starcoupler), or even air or vacuum in case of a wireless connection. In addition to the shared part of the platform, each component is associated with a component-specific platform constituent. Core platform services can be realized using distributed implementations. For example, a common time service usually depends on a distributed clock synchronization capability. The common time service provides each component with access to a local clock, which is repeatedly adjusted to ensure a bounded precision in relation to all other clocks of the system. In addition to distributed implementations of core platform services, the componentspecific platform constituents serve a second purpose. They map the technologyindependent LIF to a technology-dependent interface. The platform maps the technology-independent abstractions provided by the core platform services (e.g., messaging concept) to the physical world (e.g., a specific communication protocol and physical layer). The benefit of a technology-independent LIF is the abstraction over different implementation technologies, which are not fixed when the LIFs are specified. Thus tradeoffs between different implementation technologies can be performed later in the development process. For example, a general purpose CPU implementation offers more flexibility for modifications and often involves less implementation effort. An FPGA, on the other hand, can have superior non functional properties (e.g., energy efficiency, silicon area). In addition, as a system evolves the component implementation for a given LIF can be changed without requiring changes at other components. A component together with its component-specific platform constituent typically forms a self-contained physical unit (called networked node). The integrator will provide to component developers a component-specific platform constituent as a hardware/software container for the incorporation of a particular component. For this purpose, the integrator can use platform suppliers (similar to Integrated Modular Avionics (IMA) platform suppliers in the avionic domain [3]), which provide
304
R. Obermaisser and B. Huber
implementations of the core services. The networked node is the result of combining the supplier’s component with the component-specific platform constituent. During integration the networked node (e.g., an ECU in the automotive domain) is combined with the shared platform-constituent. For example, the networked node is connected to an Ethernet network in case of Ethernet as a communication medium. Figure 2 depicts the layout of the platform and its structural elements graphically. In order to illustrate the introduced concepts, a few examples will be outlined in the following: Let’s first consider wireless sensor devices as networked nodes. The component-specific platform constituent contains a wireless transceiver/receiver that maps the message-based abstraction to radio transmissions. The wireless sensor device will also contain a host (e.g., general purpose CPU, FPGA) as the component providing the application services. The shared platform-constituent can simply be the communication medium (e.g., air) or comprised of wireless base stations. A second example of a networked node is an Electronic Control Unit (ECU) in a car. The shared platform-constituent can be an automotive communication system (e.g., Controller Area Network (CAN) [6], FlexRay starcoupler [18]). The componentspecific platform constituent is a communication controller (e.g., CAN controller, FlexRay controller) to the respective communication network. Another example is an IP core on a Multi-Processor System-on-a-Chip (MPSoC). Here the shared platform-constituent is the interconnect for the communication between the IP cores, e.g., a network-on-a-chip. The component-specific platform constituent provides the access to the on-chip interconnect (e.g., Trusted Interface Subsystem (TISS) [15], Memory Flow Controller [5]). 4.4 Emergence of Core Platform Services The core platform services are emergent services of the services provided by the shared and component-specific platform constituents. Consider for example the global time service. Each component-specific platform constituent comprises a local clock synchronization capability. The local clock synchronization capability is responsible for repeatedly computing correction terms to be applied via state or rate correction to the local clock. For this purpose, the difference of the local clock to other clocks is determined via synchronization messages or the implicitly known reception times of messages. Through the interplay of the local clock synchronization capabilities, a global notion of time emerges. As discussed in Section 3.3, this emerging service can be used to specify the LIF and as a basis for the realization of application services. Likewise, the periodic message transport service is an emerging service. The periodic message transport service is more than the sum of the capabilities of the component-specific platform constituents to perform periodic transmissions/receptions of messages.
5 Component Model From the point of view of the LIF, a component provides application services expressed w.r.t. the core platform services. The integration of a system out of components is helped through the existence of the core platform services. The core platform services introduce a uniform instrument across all components for generating component behavior.
A Conceptual Model for Component-Based Distributed Real-Time Systems
305
However, component developers can favor different sets of platform services to express application behavior. Firstly, legacy applications have been developed for different platforms with different sets of platform services. In order to avoid a complete redevelopment of these legacy applications, a mapping of the legacy platform services to the uniform core platform services is desirable. In addition, different domains can have unique requirements regarding the capabilities of the platform. For example, a safetyrelated control subsystem can build upon platform services for active redundancy. A multimedia system, on the other hand, might have to cope with a large number of different configurations and usage scenarios. Hence, a multimedia system can need dynamic reconfiguration capabilities that go beyond the support of the core platform services. We introduce a layered component model in order to resolve this discrepancy between uniform core services and the need for application-specific platform services. For providing application services, the component can employ an intermediate form of the application services: application services expressed w.r.t. optional platform services. 5.1 Optional Platform Services On top of the core services, the optional platform services establish higher-level capabilities for certain domains (e.g., control systems, multimedia). Besides, the optional services within a component establish an instrument for behavior generation just as the core services do. In contrast to the core platform services, the optional platform services provide additional constructs that are not always needed or useful in all types of components. The optional services reflect the heterogeneity of a system by no longer enforcing uniformity of platform services throughout the system. A particular optional platform service can prove to be useful in one component, whereas the deployment of this service in another component might impede the component development. Consider for example an optional service for dynamic reconfiguration, which would make difficult the certification of a safety-related component. The instrument for behavior generation of the optional platform services must have a defined mapping to the underlying instrument for behavior generation of the core platform services. Hence, the optional platform services transform the application services towards the core platform services. 5.2 Equivalence of Optional Services and Application Services In the following we will point out that optional platform services are application services that have been internalized in a component. An application service exploits another application service by using the core platform services. In contrast, an application service exploits the optional platform services directly, i.e., without using the core platform. However, from a logical point of view it is equivalent whether a particular capability is introduced into a system via an application service or via an optional platform service. Figure 3 depicts that a certain capability α can be provided by an optional platform service or as an application service in a supporting component.
306
R. Obermaisser and B. Huber Component
Component Implementation of platform capability as supporting component
vs Implementation of platform capability as optional platform service
OSs CSs
OS CSs
Fig. 3. Optional Platform Service vs. Supporting Component
Preexisting Standard Components. A preexisting standard component is a component that provides an application service that can be used for the construction of other application components. The integrator defines which preexisting standard components exist in a system and provides their LIF specifications to the suppliers. Preexisting standard components are visible to the integrator and their services are expressed w.r.t. to the core platform services. An example of a preexisting standard component would be a voting component for a Triple Modular Redundancy (TMR) configuration. This component accepts three redundant messages and outputs a single message based on a majority decision. Another example of a preexisting standard component would be an encryption component. Such a component accepts ’plaintext’ messages and outputs encrypted messages by applying a suitable ciphering method (e.g., DES). Such a component would facilitate the interaction between a subsystem with trusted components and a subsystem with untrusted components (e.g., a public network or the Internet). Optional Platform Services. The component developer is responsible for the optional platform services. For the integrator, who combines components based on their LIFs, the optional platform services are invisible. The integrator cannot discern a component that implements the LIF behavior directly from a component that employs optional platform services.
6 Conclusion The contribution of the paper is a conceptual model for component-based distributed real-time systems. In particular, the notion of a platform in relation to components has been analyzed. The platform enables the emergence of global application services out of local application services of components. Starting from a decomposition of a platform into core platform services, we have elaborated on the implications for the design and implementation of components. The core platform services represent the concepts for the specification of interfaces, the instrument for the generation of component behavior, and the capabilities for the interaction of components. In addition, different organizational roles in a component-based development were analyzed: the integrator, component suppliers and platform suppliers.
A Conceptual Model for Component-Based Distributed Real-Time Systems
307
References 1. Avizienis, A., Laprie, J., Randell, B.: Fundamental concepts of dependability. In: Proc. of ISW 2000. 34th Information Survivability Workshop, pp. 7–12. IEEE, Los Alamitos (2000) 2. Jones, C., et al.: DSoS conceptual model. DSoS Project, IST-1999-11585 (December 2002) 3. Conmy, P., McDermid, J., Nicholson, M., Purwantoro, Y.: Safety analysis and certification of open distributed systems. In: Proc. of the Inter. System Safety Conference Denver (2002) 4. Davis, A.M.: A comparison of techniques for the specification of external system behavior. Communication of the ACM 31 (1988) 5. IBM, Sony, and Toshiba. Cell broadband engine architecture. Technical report (2006) 6. Int. Standardization Organisation, ISO 11898. Road vehicles – Interchange of Digital Information – Controller Area Network (CAN) for High-Speed Communication (1993) 7. Kopetz, H.: Sparse time versus dense time in distributed real-time systems. In: Proc. of 12th Int. Conference on Distributed Computing Systems, Japan (June 1992) 8. Kopetz, H.: Real-Time Systems, Design Principles for Distributed Embedded Applications. Kluwer Academic Publishers, Boston (1997) 9. Kopetz, H., Suri, N.: Compositional design of rt systems: a conceptual basis for specification of linking interfaces. In: Proc. of the Sixth IEEE Int. Symposium on Object-Oriented RealTime Distributed Computing, pp. 51–60. IEEE, Los Alamitos (2003) 10. Kopetz, H., Suri, N.: On the limits of the precise specification of component interfaces. In: Proc. of the Ninth IEEE Int. Workshop on Object-Oriented Real-Time Dependable Systems (WORDS 2003), pp. 26–27 (2003) 11. K¨uhn, E.: Virtual Shared Memory for Distributed Architectures. Nova Science Pub. Inc. (2002) 12. Lala, J.H., Harper, R.E.: Architectural principles for safety-critical real-time applications. Proc. of the IEEE 82(1), 25–40 (1994) 13. Lamport, L.: Using time instead of timeouts for fault-tolerant distributed systems. ACM Trans. on Programming Languages and Systems, 254–280 (1984) 14. Obermaisser, R., El-Salloum, C., Huber, B., Kopetz, H.: Modeling and verification of distributed real-time systems using periodic finite state machines. Journal of Computer Systems Science & Engineering 22(6) (November 2007) 15. Obermaisser, R., El Salloum, C., Huber, B., Kopetz, H.: The time-triggered system-on-a-chip architecture. In: Proc. of the IEEE Int. Symposium on Industrial Electronics (2008) 16. Obermaisser, R., El Salloum, C., Huber, B., Kopetz, H.: Fundamental design principles for embedded systems: The architectural style of the cross-domain architecture GENESYS. In: Proc. of the 8th IEEE ISORC, Tokyo, Japan, March 2009, pp. 3–11 (2009) 17. Object Management Group, Needham, MA 02494, U.S.A. The Common Object Request Broker: Architecture and Specification (July 2002) 18. Belschner, R., et al.: FlexRay – requirements specification. Technical report, BMW AG., DaimlerChrysler AG., Robert Bosch GmbH, and General Motors/Opel AG (2002) 19. Rolina, T.: Past, present, and future of real-time embedded automotive software: A close look at basic concepts of AUTOSAR. In: Proc. of SAE World Congress (April 2006) 20. Whitrow, G.J.: The Natural Philosophy of Time, 2nd edn. Oxford University Press, Oxford (1990)
Approximate Worst-Case Execution Time Analysis for Early Stage Embedded Systems Development Jan Gustafsson1, , Peter Altenbernd2, , Andreas Ermedahl1, , and Bj¨ orn Lisper1, 1
School of Innovation, Design and Engineering, M¨ alardalen University, V¨ aster˚ as, Sweden {jan.gustafsson,andreas.ermedahl,bjorn.lisper}@mdh.se 2 Department of Computer Science, University of Applied Sciences, Darmstadt, Germany
[email protected] Abstract. A Worst-Case Execution Time (WCET) analysis finds upper bounds for the execution time of programs. Reliable WCET estimates are essential in the development of safety-critical embedded systems, where failures to meet timing deadlines can have catastrophic consequences. Traditionally, WCET analysis is applied only in the late stages of embedded system software development. This is problematic, since WCET estimates are often needed already in early stages of system development, for example as inputs to various kinds of high-level embedded system engineering tools such as modelling and component frameworks, scheduling analyses, timed automata, etc. Early WCET estimates are also useful for selecting a suitable processor configuration (CPU, memory, peripherals, etc.) for the embedded system. If early WCET estimates are missing, many of these early design decisions have to be made using experience and “gut feeling”. If the final executable violates the timing bounds assumed in earlier system development stages, it may result in costly system re-design. This paper presents a novel method to derive approximate WCET estimates at early stages of the software development process. The method is currently being implemented and evaluated. The method should be applicable to a large variety of software engineering tools and hardware platforms used in embedded system development, leading to shorter development times and more reliable embedded software.
Supported by the ALL-TIMES FP7 project, grant no. 2215068, by the KKfoundation grant 2005/0271, and by the Swedish Foundation for Strategic Research (SSF), via the strategic research centre PROGRESS. Supported by Deutscher Akademischer Austauschdienst (DAAD).
S. Lee and P. Narasimhan (Eds.): SEUS 2009, LNCS 5860, pp. 308–319, 2009. c IFIP International Federation for Information Processing 2009
Approximate Worst-Case Execution Time Analysis
1
309
Introduction
Embedded systems are special-purpose computer systems designed to perform one or a few dedicated functions, often with timing constraints. They are usually embedded into devices including hardware and mechanical parts. Different types of (and even parts within) embedded systems may have different timing requirements. Some safety-critical embedded systems have hard real-time requirements, i.e., failures to meet timing deadlines can have catastrophic consequences. For these, a safe (i.e., never underestimated) WCET of the software is a key measure. 1.1
The Need for Early WCET Estimates
The specific properties of real-time embedded systems puts certain demands on the development of such systems. Often, hardware and software are developed in parallel (hardware-software co-design). Since embedded systems often are in a high-volume market, it is important for those systems to choose a suitable processor configuration (CPU, memory, peripherals, etc.) which is just powerful enough, in order not to use a too costly hardware. Thus, embedded systems have smaller resource margins than, e.g., desktop computers, and special attention has to be given to the timing of the software. Consequently, it is important to assess the worst-case timing (i.e., the WCET) of the software to be able to choose a suitable processor configuration. There are several software development models to be used for embedded realtime systems, e.g., the traditional waterfall model (Requirements – Design – Development – Testing), or the more advanced V-Model [1] shown in conceptual form in Fig. 1. In the V-model, the requirements, architecture and design in the left part of the model will be verified in the activities in the right part of the model (integration, test, verification, system verification and validation). This is also true for the timing domain of the system. For example, the timing requirements defined early in the model (on the left side) will be verified against
Fig. 1. The V-Model (graph from [1])
310
J. Gustafsson et al.
the system on the same level in the model on the right side. Software timing decided during module design will be tested during module test etc. Early WCET estimates would be useful in the early stages of real-time embedded systems development (like the requirements, architecture and design stages in the V-model) for a number of reasons. Modern embedded system development typically includes a large variety of software engineering tools, such as modelling and component frameworks, schedulability analysis, timed automata, etc. The tools are used to, e.g., decide how tasks should be generated from larger software components or models, how to distribute tasks to computer nodes, what hardware to use on the different nodes, what priorities to assign to different tasks, etc. To be able to model, validate and verify the real-time properties of early sketches of the system, these tools need to associate some type of execution time bounds to their inherent high-level code constructs. Existing WCET analysis methods often fail to provide early stage WCET estimates. The main reason is that the following requirements must be fulfilled, which is often not the case at the early stages of the project: − the program to be analysed must be compiled and linked to an executable binary, and − either a useful input data set must be available, and the actual hardware (or a timing accurate simulator) must be available in a setup that allows for correct measurements, or − a (safe) timing model of the actual hardware must be available. See further Section 2. For modern embedded system development this is a problem, since as described above, good approximate WCET estimates are needed already during early system development. Moreover, if the WCET bounds derived on the executable program violates the timing bounds assumed in earlier development stages, it may result in costly system re-design, including code re-implementation, re-testing and re-analysis. It is a well-known fact that the cost for correcting functional errors and doing larger system changes becomes higher at later stages of the development process [2,3]. There are all reasons to believe that this cost curve also holds for errors in the timing domain. 1.2
Paper Contribution and Content
In this paper, we will present an analysis method to provide early approximate WCET estimates. The method consists of two main steps; 1) timing model identification using a set of programs, and 2) WCET estimate calculation using flow analysis and the derived timing model. Our method is not guaranteed to always provide safe WCET bounds. However, we believe that during early phases of system development it is often sufficient to provide reasonably accurate WCET estimates. Such estimates will allow the system designer to make a good initial system model, where the approximate WCET estimates of different code parts are taken into account. At later development stages a more precise WCET analysis tool can be used, allowing the rough WCET es-
Approximate Worst-Case Execution Time Analysis
311
timates and the initial system model to be refined. The main advantages of our method, when compared to existing WCET analysis methods, are the following: − no timing model of the actual hardware has to be available; it will be created in the model identification step. − the actual hardware/simulator only has to be available during an initial timing model identification step. − model identification only has to be performed once for each compiler/hardware combination. − the program for which a WCET estimate will be calculated does not have to be executable; it does not even have to be compiled. The rest of this paper is organized as follows. Section 2 gives an overview of WCET analysis methods. Section 3 gives short descriptions of related work in the area. Section 4 presents the proposed method. Section 5 describes the implementation that we are currently working on. Finally, in Section 6, we discuss the approach and give ideas for evaluation and further research.
2
Worst-Case Execution Time Analysis
A Worst-Case Execution Time (WCET) analysis finds an upper bound to the worst possible execution time of a program. WCET analysis must handle the fact that a program typically has no fixed execution time. Different execution time may occur for different inputs, due to the characteristics of the software, as well as of the computer upon which the software is run. Thus, both the possible inputs as well as the properties of the software and the hardware must be considered in order to predict the WCET of a program. An overview of WCET analysis is given in [4]. Here, we will give just a short introduction of the three main approaches used: measurements, static timing analysis, and hybrid analysis. Measurements, also known as dynamic timing analysis, is the traditional way to determine the timing of a program. Basically, the program is executed many times with different inputs and the execution time is measured for each test run. For WCET analysis, there are problems connected with measurements: The methods often means labor-intensive and error-prone work, and even worse, it cannot guarantee that the WCET has been found. This is because each measurement exercises only one path. For most programs, the number of possible execution paths is immense, and therefore too large for exhaustive testing. Also it is, in general, very hard to find the worst case input. This means that the set of inputs may not include the worst case path, and thus the method may underestimate the WCET. WCET measurements require: − the program to be analysed can be compiled and linked to an executable binary. This requires that the program is ”finished” in some sense (it must compile and link without errors, and execute to completion), − an input data set which covers all of the program paths (or as many as possible, hopefully including the WCET path) is available, and
312
J. Gustafsson et al.
− the actual hardware (or a timing accurate simulator) is available in a setup that allows for correct measurements. Static timing analysis estimates the WCET of a program without actually running it. The analysis avoids the need to run the program by simultaneously considering the effects of all possible inputs, including possible system states, together with the program’s interaction with the hardware. The analysis relies on mathematical models of the software and hardware involved. Given that the models and the calculation never make underestimations, the result is a safe timing estimate that is greater than or equal to the actual WCET. Static WCET analysis is usually divided into three phases: a flow analysis where information about the possible program execution paths is derived, a low-level analysis where the execution time for atomic parts of the code (e.g., instructions, basic blocks or larger code sections) is decided from a model of the target architecture, and a final calculation phase where the derived flow and timing information are combined into a resulting WCET estimate. Due to the complexity of today’s software and hardware, both flow- and lowlevel analysis may result in over-approximations, e.g., reporting too many paths as feasible or assigning too large timings to instructions. Thus, the calculation will give a safe but potentially pessimistic WCET value. Static WCET analysis requires: − that the analysed program can be compiled and linked to an executable binary, and − that a (safe) timing model of the hardware is available, something that can be very hard and costly to provide. Depending on the type of static analysis, more requirements may be added. If the static analysis is based on binary code, a binary decoder must be developed for the used binary format in order to re-generate the control flow graph of the program. This flow graph is used by the static analysis. Other tools do flow analysis on source code level or intermediate code level. For these tools, some kind of compiler support is necessary. A possibility to add manual annotations for, e.g., flow constraints, is useful as a complement to the flow analysis. This may reduce the overestimation of the analysis. Some tools use annotations to describe input data limits for the same purpose. Hybrid analysis techniques combine measurements and static analyses techniques. The tools use measurements to extract timing for smaller program parts, and static analysis to deduce the final program WCET estimate from the program part timings. Since measurements are used, the resulting WCET estimates are not guaranteed to be safe. Hybrid WCET analysis requires: − that the analysed program can be compiled and linked to an executable binary, − an input data set which covers all of the program paths (or as many as possible, hopefully including the WCET path) is available, and − the actual hardware (or a timing accurate simulator) is available in a setup that allows for correct measurements.
Approximate Worst-Case Execution Time Analysis
3
313
Related Work
The idea of early timing analysis has been explored in a number of variants. The TimingExplorer [5] developed by AbsInt [6] is an assisting tool to find early timing estimates to be able to decide suitable hardware architectures for realtime systems. The available source code is compiled and linked for each of the cores in question. The user also has the possibility to specify different memory layout and cache properties. Each resulting executable is then analysed with the TimingExplorer to get a WCET estimate for the chosen core and hardware architecture. TimingExplorer uses a run-time optimised analysis to quickly produce WCET estimates. It should be noted that TimingExplorer trades precision against, e.g., speed. It is also not certain that the estimates are safe. Even though the goals and the achievements of the TimingExplorer approach are similar to ours, there are some differences: TimingExplorer requires a hardware model, something which is not needed in our approach. Also, the source code for the programs for which WCET estimates are to be found does not have to be compiled and linked in our approach. This means that it probably can be applied easier and earlier than TimingExplorer. On the other hand, TimingExplorer will probably give more precise WCET estimates than our method. Another interesting approach for early stage WCET analysis is the integration of the aiT WCET analysis tool [6] into the SCADE [7,8], and ASCET [9] software tools. The use of these tools gives the possibility to generate code, and possibly WCET estimates, early from sketches of the system. Their approach requires support from different software tools for keeping a mapping between the high-level code constructs and the corresponding binary code snippets. Moreover, aiT’s low-level analysis must have been ported to the used target processor. Thus, the approach should be less portable than ours. Moreover, it can only be applied rather late in the system development process, since aiT requires the full binary program to be available for its WCET analysis. No flow analysis is performed on the high-level code level. Lisper and Santos has recently developed a new kind of regression method, based on end-to-end measurements of programs, where the resulting timing model is guaranteed to not underestimate any observed execution times [19]. This is similar to the model identification used in our method. In contrast to our method, the Lisper/Santos method works completely on the binary level. There are some earlier efforts to do early stage WCET analysis upon Petri nets [10], Statechart models [11], and Matlab/Simulink models [12]. Compared to our approach, all these approaches are less portable and require much support from the high-level development tool. There have been approaches to WCET analysis on Java. Persson et al. [13] assign timing costs to Java constructs using an attribute grammar. Bernat et al. [14] do analysis on the Java Byte Code (JBC) level and provide a mechanism for giving compiler/language independent WCET annotations in the Java source code. Bate et al. [15] extended the latter approach to also include a timing model for JBC instructions. Compared to our approach, all these approaches rely on manual flow annotations and manual assignments of times to constructs.
314
J. Gustafsson et al.
Bartlett et al. [16] use program traces to derive parametric upper loop bounds for WCET analysis. Their work, and the measurement-based approach to WCET analysis used by Rapita systems [17], both have similarities to our idea of deriving a timing model by program runs. Franke [18] uses measurements and regression analysis to derive a reasonably timing-accurate simulator for assembly code instruction sets, thus having some similarities to our timing model derivation approach.
4
Early Stage WCET Analysis
Our method is based on the same main phases as ordinary static WCET analyses; i.e., a flow analysis, a low-level analysis, and a final calculation phase. In particular, the method combines a worst-case oriented flow analysis with the usage of an approximate timing model for low-level analysis, both working on code constructs available during early phases of embedded system development. The method is based on the assumption that the possible flows through a program is similar independent of which code level the program is represented in. For example, an upper limit on the number of times a certain code construct is taken in the high-level code is normally mimicked as a limit on the corresponding low-level instructions. However, a direct mapping between the high level code and the low-level code is not always possible, due to compiler optimisations. The timing behaviour of a program on different code levels is more complex. It is often not possible to assign constant timing to high-level code constructs due to the timing variability caused by different hardware features. Moreover, compiler optimisations may make the mapping between different code levels very hard. Instead, our method derives an approximative timing model for high-level code constructs by systematic measurements. As mentioned in Section 1, our method for obtaining early stage WCET estimates includes two main steps: 1) Timing model identification using a set of programs, and 2) WCET estimate calculation using flow analysis and the derived timing model. 4.1
Timing Model Identification
We propose a timing model identification using a set of programs to derive an approximate (but not guaranteed safe) timing model for different code constructs found in program code. The code level to be used for the model identification, hereafter called timing model code level (tmcl), can be on any level of choice. It can be the high-level code itself, or some code possible to generate from the highlevel code in one or more steps. For example, it can be a modelling language, C, C++ or some other source code format, intermediate code (maybe emitted by a compiler or some modelling tool), or even assembler. The idea is that by identifying a timing model for the tmcl code, the WCET analysis can be performed directly on this code, rather than analyzing the compiled binary and a precise timing model on the binary level. This eliminates the
Approximate Worst-Case Execution Time Analysis
315
need to compile the code for the target system. The higher the level of the code, the earlier the analysis can be done, but on the other hand the WCET estimates are likely to be less precise. The timing model identification is made by a combination of measurements and regression analysis as described in the following. We propose to identify a linear timing model n t= ci τi i=1
for n code constructs. These can, for instance, be arithmetic/logic operators, program variable accesses, statements altering the program flow, or possibly more complex constructs that recur often enough. The model assumes a constant execution time τi to each construct i, ci is the number of times the construct is executed, and t is the execution time predicted by the model. The accuracy of such a simple model will depend on how closely the selected code constructs correspond to (fixed sequences of) instructions in the corresponding binary, compiled for the target machine. The model will typically be derived for a given combination of target hardware, compiler, and possibly also compiler options (like optimisation level). The model identification selects the execution times τi to fit the model as well as possible to a set of given runs on the target machine. An overview of timing model identification is shown in Fig. 2. First, a test set of programs and corresponding sets of inputs is selected and executable binaries are generated using the compiler of interest. For each pair of input and program, two runs are made (similar to [19]): − the binary is run, and the actual execution time is recorded, and − the tmcl code is run, with the same inputs, by an interpreter or similar that records the number of times each code construct is executed. For each such (double) run j we obtain an actual execution time tj , and, for each code construct i, a number of executions cij . According to the linear timing model, we obtain an equation n tj = cij τi i=1
If we make m such runs, we obtain a linear equation system t = Cτ where the coefficients in the m × n-matrix C are the recorded execution counts cij . If the timing model happens to be exact, then the system has a unique solution. Most likely it is not. Then, if the number of linearly independent row vectors in C is at least n, we can perform some kind of regression to find a “best” solution τ ∗ that in some sense minimizes the deviation of the execution times predicted by the model and the actual execution times. A possible choice is the usual linear regression, or least-squares method [20]. Another choice is the max regression proposed in [19], which has the property that the model never will underestimate any observed execution times. The resulting timing model will depend both on how the source code is compiled to executable code, e.g., on the compiler and its options, on the translation
316
J. Gustafsson et al.
!!!"
Fig. 2. Timing model identification
to tmcl format, as well as on the characteristics of the different hardware features, such as caches and pipelines of the used hardware. Ideally, if we have a 1-1 mapping between the high level code and the binary code, and if the code has basic blocks with constant execution times, we may get a very good fit. When this is not the case, the fit will be less good, but probably still good enough to give a WCET estimate that is helpful in early design stages. 4.2
Early Stage WCET Estimate Calculation
The approximate WCET for the program of interest will be derived using flow analysis, the approximate timing model derived in the model identification step, and a final WCET calculation. Neither the executable of the program, nor the hardware/simulator is required during this step (see Fig. 3). Note that the derived WCET for the program is valid for the combination of target hardware, compiler, and compiler options that was used in the model identification step.
!!!"
Fig. 3. Early stage WCET estimate calculation
Approximate Worst-Case Execution Time Analysis
317
First, the analysed program is translated to tmcl format. We then perform a flow analysis of the program, using the tmcl representation of the program and (optionally) constraints on input variables to the program. The flow analysis generates flow constraints on basic blocks and edges in the tmcl code. The approximative low-level analysis uses the tmcl form of the program and the table of the best possible fit of timing costs to the different code constructs (generated in the model identification step). While scanning through the program, execution time estimates for basic blocks will be generated and saved. Finally, we can derive a WCET estimate using some existing WCET calculation method, see e.g., [4], combining the results of the previous steps.
5
Implementation
The method described in this paper is currently being implemented by the WCET group at M¨ alardalen University [21] in the SWEET (SWEdish Execution time Tool) WCET analysis tool. The ALF (ARTIST2 Language for WCET Flow Analysis) language [22] has been selected as the tmcl format. ALF is a language intended for flow analysis for WCET calculation. SWEET includes an ALF interpreter, which output statistics of ALF construct exercised during a program run. Thus, the timing model identification will assign timing to different constructs found in the ALF code. To perform the measurement runs, we plan to test both some clock-cycle accurate simulators and some hardware equipped with time-measurement facilities. We will also do the flow analysis on ALF, using SWEET’s powerful flow analysis methods [23,24,25]. The result will be given as constraints on the number of times different basic blocks of the ALF code can be executed and edges of the control flow graph can be followed. The final calculation of the WCET estimate will be made using one of SWEET’s different calculation methods [26].
6
Discussion and Future Work
The method will be evaluated using WCET benchmarks and industrial code. This first version of the method uses linear equations which are exact only if the tmcl code constructs always have the same corresponding code in the binary representation, and that the high code constructs always have the same execution time. We will in, a systematic way, study the consequences of deviations from these requirements. The experiences we will get from the evaluation will hopefully give some ideas on how to extend the method. There are a number of interesting ideas to explore, and some of them has been mentioned in the paper. Other ideas include: − More precise model generation. It may be possible to do a context sensitive and more precise model where ALF constructs have different costs depending on execution context, e.g., whether it is the first loop iteration or not
318
− −
−
−
J. Gustafsson et al.
that is executed, since cache effects can give different execution times for the corresponding ”real” instructions in the binary. How will compiler optimisations and other code transformations affect the accuracy? Code transformations make the mapping between the high-level code constructs and the binary program more complex. How may different hardware features affect the accuracy? Modern processors use many performance-enhancing features, like caches, pipelines, branchprediction, out-of-order execution, etc., all which may cause the execution time for instructions to vary. How well will timing effects of such types of features be captured by the resulting timing model? Are some types of hardware features more troublesome than others? Selection of model identification sets. How many programs and input data sets are needed for getting an accurate timing model? How similar must the test programs used to generate the timing model be to the program for which WCET estimates will be derived? How to handle a mix of source code formats? An embedded system project may involve a variety of source code formats, including different high-level languages, C or C++ code, or even assembler. An interesting property of our implementation is that if programs can be translated to ALF, they are analysable by the method. We expect a number of formats to be able to translate to ALF.
References 1. Wikipedia: V-Model (2009), http://en.wikipedia.org/wiki/V-Model 2. Boehm, B.W.: Software Engineering Economics. Prentice Hall PTR, Upper Saddle River (1981) 3. Westland, C.J.: The cost of errors in software development: evidence from industry. Journal of Systems and Software 62, 1–9 (2002) 4. Wilhelm, R., Engblom, J., Ermedahl, A., Holsti, N., Thesing, S., Whalley, D., Bernat, G., Ferdinand, C., Heckmann, R., Mitra, T., Mueller, F., Puaut, I., Puschner, P., Staschulat, J., Stenstr¨ om, P.: The worst-case execution time problem — overview of methods and survey of tools. ACM Transactions on Embedded Computing Systems (TECS) 7, 1–53 (2008) 5. Nenova, S., K¨ astner, D.: Source Level Worst Case Timing Estimation and Architecture Exploration in Early Design Phases. In: Holsti, N. (ed.) Proc. 9th International Workshop on Worst-Case Execution Time Analysis (WCET 2009), Dublin, Ireland, pp. 12–22 (2009) (Preliminary proceedings) 6. AbsInt: aiT tool homepage (2008), http://www.absint.com/ait 7. Scade: Homepage for Scade tool suite from Esterel (2009), http://www.esterel-technologies.com/products/scade-suite 8. Souyris, J., Pavec, E.L., Himbert, G., Jegu, V., Borios, G., Heckmann, R.: Computing the worst case execution time of an avionics program by abstract interpretation. In: Wilhelm, R. (ed.) Proc. 5th International Workshop on Worst-Case Execution Time Analysis, WCET 2005 (2005) 9. Ascet: Homepage of ETAS’ ascet tool chain (2009), http://www.etas.com/en/products/ascet_software_products.php
Approximate Worst-Case Execution Time Analysis
319
10. Stappert, F.: From Low-Level to Model-Based and Constructive Worst-Case Execution Time Analysis. PhD thesis, Faculty of Computer Science, Electrical Engineering, and Mathematics, University of Paderborn, C-LAB Publication, Vol. 17. Shaker Verlag, Aachen (2004), ISBN 3-8322-2637-0 11. Erpendbach, E., Altenbernd, P.: Worst-case execution times and schedulability analysis of statecharts models. In: Proc. 11th Euromicro Conference of Real-Time Systems, pp. 70–77 (1999) 12. Kirner, R., Lang, R., Freiberger, G., Puschner, P.: Fully automatic worst-case execution time analysis for Matlab/Simulink models. In: Proc. 14th Euromicro Conference of Real-Time Systems (ECRTS 2002), Washington, DC, USA (2002) 13. Persson, P., Hedin, G.: Interactive execution time predictions using reference attributed grammars. In: Proc. of the 2nd Workshop on Attribute Grammars and their Applications (WAGA 1999), Netherlands, pp. 173–184 (1998) 14. Bernat, G., Burns, A., Wellings, A.: Portable Worst-Case Execution Time Analysis using Java Byte Code. In: Proc. 12th Euromicro Conference of Real-Time Systems (ECRTS 2000), Stockholm, pp. 81–88 (2000) 15. Bate, I., Bernat, G., Murphy, G., Puschner, P.: Low-level Analysis of a Portable Java Byte Code WCET Analysis Framework. In: Proc. 7th International Conference on Real-Time Computing Systems and Applications (RTCSA 2000), pp. 39–48 (2000) 16. Bartlett, M., Bate, I., Kazakov, D.: Guaranteed loop bound identification from program traces for WCET. In: Proc. 15th IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS 2009), San Francisco, CA. IEEE Computer Society, Los Alamitos (2009) 17. Rapitime: Rapitime white paper (2009), http://www.rapitasystems.com/system/files/RapiTime-WhitePaper.pdf 18. Franke, B.: Fast cycle-approximate instruction set simulation. In: Falk, H. (ed.) Proc. 11th International Workshop on Software and Compilers for Embedded Systems (SCOPES 2008), Munich, pp. 69–78 (2008) 19. Lisper, B., Santos, M.: Model identification for WCET analysis. In: Proc. 15th IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS 2009), San Francisco, CA, pp. 55–64. IEEE Computer Society, Los Alamitos (2009) 20. Chatterjee, S., Hadi, A.S.: Regression Analysis by Example, 4th edn. John Wiley & Sons, Chichester (2000) 21. M¨ alardalen University: WCET project homepage (2009), http://www.mrtc.mdh.se/projects/wcet 22. Gustafsson, J., Ermedahl, A., Lisper, B., Sandberg, C., K¨ allberg, L.: ALF – a language for WCET flow analysis. In: Proc. 9th International Workshop on WorstCase Execution Time Analysis (WCET 2009), Dublin, Ireland, pp. 1–11 (2009) 23. Gustafsson, J., Ermedahl, A., Sandberg, C., Lisper, B.: Automatic derivation of loop bounds and infeasible paths for WCET analysis using abstract execution. In: Proc. 27th IEEE Real-Time Systems Symposium, RTSS 2006 (2006) 24. Sandberg, C., Ermedahl, A., Gustafsson, J., Lisper, B.: Faster WCET flow analysis by program slicing. In: Proc. ACM SIGPLAN Conference on Languages, Compilers and Tools for Embedded Systems (LCTES 2006), pp. 103–112 (2006) 25. Ermedahl, A., Sandberg, C., Gustafsson, J., Bygde, S., Lisper, B.: Loop bound analysis based on a combination of program slicing, abstract interpretation, and invariant analysis. In: Rochange, C. (ed.) Proc. 7th International Workshop on Worst-Case Execution Time Analysis (WCET 2007), Pisa, Italy (2007) 26. Ermedahl, A.: A Modular Tool Architecture for Worst-Case Execution Time Analysis. PhD thesis, Uppsala University, Dept. of Information Technology, Uppsala University, Sweden (2003)
Using Context Awareness to Improve Quality of Information Retrieval in Pervasive Computing∗ Joseph P. Loyall and Richard E. Schantz BBN Technologies, Cambridge, MA 02138, USA {jloyall,schantz}@bbn.com
Abstract. Publish-subscribe-query information broker middleware offers great promise to users of pervasive computing systems requiring access to information. However, users of publish-subscribe-query information broker middleware face a challenge in requesting information. The decoupling of publishers and consumers of information means that a user requesting information is frequently not aware of what is available, where it comes from, and when it becomes available. Too specific a request might return no results, while too broad a request might overwhelm the user with a combination of useless and buried useful information. This paper investigates using context, such as a user’s location, affiliation, and time, to automatically improve the quality of information brokering and delivery. Augmenting an explicit client request with contextual clauses can automatically prioritize, order, and prune information so that the most useful and highest quality among the information available is delivered first. The paper provides techniques for augmenting client requests with context, techniques for combining multiple contextual aspects, and experiments evaluating the efficacy and performance of those techniques. Keywords: Context awareness, location based computing, quality of service, publish-subscribe-query information management.
1 Introduction Computing devices, including PDAs, GPSs, iPods, cell phones and laptops, as well as those embedded inside automobiles, appliances, and smart buildings, permeate our daily lives. Many of these computing devices are interconnected and more are becoming so, frequently through wireless connections and increasingly in an ad hoc manner. The devices and users are often mobile and it is not always obvious what devices are present and reachable. The ubiquity of computing and connectivity leads to unprecedented data access and variety confronting all users of mobile, embedded, and internetworked systems. Advanced information brokers have emerged to help navigate the Digital Whitewater of the expanding volume and variety of data available in ubiquitous systems. Based on publication-subscribe-query operations, these information management services offer ∗
This work was supported by the USAF Air Force Research Laboratory under ITT contract number SPO900-98-D-4000, Subcontract Number: 205344 – Modification No. 4.
S. Lee and P. Narasimhan (Eds.): SEUS 2009, LNCS 5860, pp. 320–331, 2009. © IFIP International Federation for Information Processing 2009
Using Context Awareness to Improve Quality of Information Retrieval
321
active information management, enabling information providers to make information available (i.e., through publication) and information consumers to request future (i.e., subscriptions) or past (i.e., queries) information based on type, attributes, or content. A user requesting information is frequently not aware of what is available, where it comes from, and when it becomes available. This presents new challenges in the area of traditional quality of service (QoS), typically concerned with the management of resources such as CPU, bandwidth, and memory. In addition, the limited cognitive resources of users can be overloaded with the tasks of discovering, processing, and consuming the vast amounts of information available to them. Users of publish-subscribe-query information services face a challenge in crafting requests to discover and deliver needed information. Too specific a request might return no results, while too broad a request might overwhelm the user with information, burying the most useful results in a sea of less useful, even superfluous, information. As an example, consider a user planning a route through an urban area and who needs maps or imagery of the possible routes he is likely to take as illustrated in Fig. 1. For this user, information that is recent, that is from trusted sources, and that is delivered in the order of closeness to his current location is much more likely to be of greater use than information that is older, less trustworthy, or more distant. However, with publish-subscribe-query systems, the user does not necessarily know what information is available, when it was created, by whom, or which locations it covers. It is challenging for the user to craft a request to get the best information available. If he crafts a request asking for any imagery available of the city, he might get a large number of responses, with a mix of imagery that is recent and old, for intermixed locations close to and distant from his location, and from a variety of sources that are known and unknown. Conversely, if he crafts a request that asks for imagery in a specific timeframe, location, or from particular sources, he Fig. 1. A user planning a route through an area might receive no information. This requires information about the area without knowing all the potential sources of the inforchallenge affects the quality of inmation formation request and delivery in the following ways: •
•
Increases cognitive burden and effort on the user to examine a potentially large set of results to find the most useful ones or craft a set of precise requests that will be issued simultaneously or sequentially until useful results are received. Decreases resource efficiency, e.g., memory is needed on the user’s device to hold a large set of results and bandwidth is used to transmit a large set of results or requests only some of which are useful.
322
•
J.P. Loyall and R.E. Schantz
Adds delay due to the time required for returning and processing a large set of results or for issuing multiple increasingly refined requests and awaiting their responses.
Furthermore, when considering multiple aspects of information, the decision of what information is the most useful in response to a request is often not obvious. Returning to the example of urban route planning, which of the following images is more useful to the requesting user? • •
An image that was collected today, but miles away from the user or another image that is of the location that the user is in, but is a year old. An image that is very recent, but taken by some unknown person’s camera (and has gone through any unknown amount of doctoring), or an image much older but collected by official government or industry communication satellites.
Characteristics of the user and his situation form the context in which his explicit requests are made and are useful to help determine the characteristics of information that are the most useful. Contextual information about the user and his situation can affect what he needs, how he perceives a particular response, and information ordering. Contextual information can be used to supplement explicit requests, to refine broad requests or to broaden narrow requests, and to prune, prioritize, and sort information to better suit the user’s situation and needs. Automatically gathering context and using it as part of information requests and dissemination is an important step towards a more proactive computing vision for embedded and ubiquitous computing environments. Information Management Services (IMS). As a starting point frame of reference for the investigation into context aware capabilities, and to seed prototype experimentation, we use a US Air Force reference implementation of IMS [1, 2, 7] developed for tactical and enterprise net-centric operations. Information is in the form of typed Managed Information Objects (MIOs) consisting of a data payload and metadata describing the object and its payload. Requests for information, i.e., subscriptions or queries, are expressed using predicates over MIO types and metadata values, expressed in query languages such as XPath or XQuery. The core set of IMS, illustrated in Fig. 2, includes the following: 1. 2. 3. 4. 5.
Submission service receives MIOs published by clients. Predicate evaluation matches the metadata of MIOs to predicates registered by information subscribers. Archive service inserts MIOs into an MIO data store. Query service processes queries consisting of predicates over the metadata of archived MIOs. Dissemination service sends query results to querying clients and MIOs that match subscription predicates to subscribing clients.
There are three places in the IMS architecture in which contextual information can be used to order and filter MIOs to improve the quality of information dissemination:
Using Context Awareness to Improve Quality of Information Retrieval
•
•
•
323
Subscribers Producers Query service Register MIO operations to afSubscription fect the order Registered Submission Service and number of Predicates 1 query results. 2 Contextual inDissemination 5 Service Archive Predicate formation can be Published MIO Evaluation used to deter3 Archive mine which reService sults to return to MIO Query results Query Query Data store the query client results Service Query predicate and the order in 4 which to return them to be of Query clients most use, i.e., of highest quality, to the client. Fig. 2. The services involved in information management that Once a suffican benefit from using context information cient number of high quality results have been returned, the remainder can be pruned. Dissemination service queues to affect the order of delivery of MIOs. When the outgoing bandwidth is insufficient to keep up with the number of MIOs that need to be disseminated and the number of clients to whom they need to be disseminated, the dissemination service enqueues the MIOs. Contextual information can be used to order and filter the MIOs in these queues. Registered subscription predicates to prioritize and refine predicates improving the quality of MIO brokering. Just as contextual information can be used to refine and order the results of queries, context can be used to refine and order predicates to be evaluated.
Aspects of Context. Although there are many elements of context that can be considered, the following are particularly relevant for ubiquitous computing and mobile, tactical users, and drive our initial investigation: •
•
•
Location – A user’s position or the position of an area of interest, frequently expressed as latitude, longitude, and altitude. Many existing devices include the ability to detect their locations, using Global Positioning System (GPS) and Enhanced Observed Time of Difference (E-OTD), which provide absolute position, and WLAN and RFID, which provide relative position. Time – Whereas location indicates a user’s position in space, there is a corresponding position in time. More recent information is likely to be more useful in general, while a user’s time zone and related aspects (night or day, winter or summer) can have an effect on QoS requirements and perception. Affiliation – A user’s affiliation can also affect QoS requirements and perception, e.g., in terms of trust. Information from an affiliated source, or closely related source, is more likely to be trusted. For example, in a military situation, a warfighter is more likely to trust information from his own unit or command than from a coalition partner, and more likely to trust a coalition partner than an unknown or opposing source.
324
J.P. Loyall and R.E. Schantz
These three contextual attributes are easily quantified and represent core attributes upon which to build a context aware approach for QoS. Location is increasingly being exploited in location-based services and location aware computing [3, 4]. Time corresponds closely to the core QoS property of timeliness (i.e., latency and performance), and affiliation contributes to the core QoS property of trust. Other contextual attributes are less easily quantifiable and less directly mapped to QoS characteristics, but are useful to consider in future work as we expand our vision into the notion of generally intelligent and anticipatory information management capabilities. These include user activity (what a user is doing), user mood (e.g., is the user distracted, fatigued, impatient, or angry), situation awareness (the things going on around the user), equipment (what the user is carrying or has), user attention (the amount of attention that the user is able to provide), user preferences (what a user prefers, wants, or needs in his or her current situation), and administrative domain (the domain that a user is in or connected to). The challenges associated with inserting context awareness toward improving quality of service for information management include where and how to incorporate context awareness, selecting the aspects of QoS to incorporate, establishing the appropriate granularity of contextual values, determining the added value of combinations of contextual attributes, and how to (and how frequently to) automatically gather or update the contextual values themselves. The remainder of this paper touches upon some of these subjects as a first step in exploring this rich and emerging field of research.
2 Incorporating Context into Information Requests In this section, we explore techniques for automatically converting explicit information requests originating from a user into augmented requests that prioritize results based on just a single dimension of context, e.g., time (most recent results first), location (closest results first), or affiliation (most trusted results first). We then discuss techniques for combining contextual aspects. Request Expansion. A straightforward technique, Request Expansion, converts a single explicit user request into a set of requests, each with extra clauses that match ranges of contextual values. The full set of new requests would match all (or approximately all) the MIOs matched by the original request, but each of the new requests would return only a subset of the matches, with the first of the new set returning those MIOs that are most appropriate or desirable based on the added context, with each subsequent request in the new set returning the next set of most appropriate MIOs. For example, a single user supplied request can be turned into multiple context added requests that additionally match on • •
Time, each predicate representing a range of time, e.g., an hour, day, week, or month. Location, each predicate representing a particular distance, e.g., a mile, from a point.
This approach is easy to implement and it can be efficient. Once an expanded request has returned results, no subsequent results are going to be better from the perspective of that context, so the processing of the remaining expanded predicates can often be
Using Context Awareness to Improve Quality of Information Retrieval
325
avoided, especially if the results completely satisfy the user’s need, the resources can be better used for other requests or operations, or there is a need to reduce the cognitive overload of too many results. However, the approach comes with tradeoffs. The efficiency and effectiveness depend on the granularity chosen for the contextual attribute. This technique provides only an approximation of the optimal order of contextual value, more or less optimal based on the granularity of the contextual ranges chosen. Finer granularity results in closer to optimal order of results, but more predicates to process (resulting in lower performance). Coarser granularity takes less time to execute because there are fewer predicates to process, but the results are also more coarsely ordered. As an example, Fig. 3 shows an expanded set of requests based on location that expands from the requesting client (represented by the “R” in the center of the image) one latitude and longitude minute at a time. Any information object that matches a predicate representing an inner bounding box will be roughly closer in proximity to the requesting client than an information object matching a subsequent predicate representing an outer Fig. 3. A subscription by a client (R) can be turned into multiple bounding box, and subscriptions, each matching on the original predicate and location will be considered inside a bounding box. The closer the bounding box to R, the highhigher quality with er quality the resulting matched information is. regard to the location context. The request expansion technique addresses several of the challenges described in Section 1, namely (1) the user provides a request only about what he needs, limiting the complexity of his request; (2) some results will be returned to the client, if any are available within the ranges specified; and (3) if a best set of results are available as determined by the added context, e.g., very recent, close, or trusted, they can be prioritized accordingly. Ordering as a Feature of Query Languages. Another technique uses existing query languages, e.g., XQuery, to prioritize and sort information. We automatically add clauses to an explicitly made user request that order on ranges of contextual values. XQuery provides an “order by” operator which uses capabilities of the
326
J.P. Loyall and R.E. Schantz
underlying database to sort the responses.1 This needs to be applied in a place in the overall system design where there are a set of MIOs to sort, such as in the archive repository, dissemination service queue, or submission service queue. It is less useful in the predicate evaluation service, where only one MIO is available at a time. In contrast to the request expansion approach, this technique produces an optimal ordering among the MIOs available for sorting but is less universal. Context Templates as an Automation Mechanism. We developed a set of templates as a proof of concept approach serving two purposes. First, the templates provide a means for automatically extending explicitly provided user requests with the additional contextual clauses, using either of the approaches above. Second, the templates are composable, so that we can also easily combine multiple contextual attributes in an automated fashion. The templates take arguments indicating the predicate being extended, a Boolean indicating whether the call to the template is the first or is nested in a composed set of template calls, and additional arguments indicating granularity and starting points (e.g., current location or time). Combining Contextual Attributes. While using a single contextual attribute to focus information brokering and dissemination can improve the quality of information, the more expected case will utilize multiple aspects of context. For example, it might be useful for a user to have information that is close in location to his position, but not if it is out of date or from an untrusted source. Likewise, recent information that is distant in location from the user's position might be less useful than something a little older but much closer. What a user typically needs is information with the right combination of contextual attributes that indicates that it is of the highest value to his current goals. Supporting this, however, requires being able to key on multiple simultaneous aspects of context and to be able to compare and sort the combinations in a meaningful way. A key tradeoff in combining contextual aspects is deciding how the aspects are weighted relative to one another, i.e., whether one or more contextual aspects are more important than others, whether they are equal in importance, whether the relative importance is unknown, or whether the relative weighting is itself “context sensitive”. If there is an order of importance, then the composable templates described above can be used to sort on multiple contextual attributes sequentially. The template is called multiple times in order, sending the original user request to the first call and the result of the first call to the second, etc. This automatically generates an extended predicate that orders by the first contextual attribute followed by ordering 1
In practice, the information management services that we use utilize XPath [12] for the predicate language, instead of its superset XQuery [13], because of the latter’s ability to modify or delete archived information, a security risk that breaks the semantics of the information broker in which subscribing and querying clients only have read access to the published and archived MIOs. If this restriction is relaxed, or enforced by other means such as examining requests for specific language clauses that indicate modification or deletion, then features of XQuery, or another expressive and Turing complete [5] predicate language, can be used to order and filter results based on complex relationships, such as contextual information.
Using Context Awareness to Improve Quality of Information Retrieval
327
by the second contextual attribute, and so forth2. This works well when the contextual attributes can be totally ordered and when they are have coarse granularity. Finer grained attributes that have only one or a very few elements in each unique value will not have any effect when they sort on the additional attributes. When contextual attributes are of equal (or unknown) importance, one can examine the intersection of the matching sets for individual contextual aspects. That is, execute multiple, interleaved requests, each incorporating a single contextual aspect, and then aggregate the order of delivery of MIOs across the sets. This approach works with either the request expansion or query language ordering approach, but requires running multiple predicate evaluations or executing multiple queries, essentially trading off more execution time for a smaller, more tailored result set. This is a suitable tradeoff when CPU at the server is plentiful, but bandwidth to the requesting client is not, a situation to be expected in mobile, tactical environments. Furthermore, evaluation of the requests can be halted once the intersection set is large enough.
3 Example and Evaluation To illustrate and evaluate incorporating context into requests, we created an operationally relevant experimental scenario, shown in Fig. 4, and database of MIOs. We developed a requesting client and a set of ten publishing clients, each simulating a moving sensor (e.g., an unmanned aerial vehicle, UAV) placed randomly within the confines Fig. 4. The experimental database was populated by ten publishers of Baghdad, Iraq, located roughly within the confines of Baghdad, Iraq, moving in one heading in a of twelve directions, publishing simulated imagery MIOs randomly chosen direction “traveling” 2
For example, the following pseudocode uses the templates to order by Time context to the “month” granularity, followed by ordering by Location to the finest granularity (Lat-Lon second): tmpredicate = tx.extendPredicateWithTimeContext(orig_pred, "month", true); tmlocpredicate = ctx.extendPredicateWithLocationContext(tmpredicate, client_lon, client_lat, false); ... executeQuery(tmlocpredicate);
328
J.P. Loyall and R.E. Schantz
approximately 30 mph, and publishing imagery information of the location over which the UAV is “flying” every 12 seconds. The requesting client was considered a “US Force” and each UAV was assigned an affiliation ran- Fig. 5. XPath expansion by location graphed against optimal domly from a set of distance 14 ordered choices (e.g., US, UK, or Iraqi force). The requesting client and each published MIO were tagged with time, location, and affiliation contextual values. Fig. 4 shows the simulated location of the requesting client (“Requester”), the starting location of the 10 publishers, the direction each is traveling, and their affiliations. Fig. 5 and Fig. 6 show the results from automatically incorporating location context Fig. 6. The order of MIO delivery using an XQuery request with locainto the informa- tion context graphed against the Euclidean distance of each MIO from tion search request the requesting client (smaller is better) using the request expansion and query language ordering approach, respectively. The results of request expansion track the optimal results closely, but not perfectly due to the selection of granularity of the expanded requests. Likewise, the query language ordering technique tracks the optimal distance context closely, but not perfectly. In this case, the difference is because the XQuery ordering we used is based on the Taxicab distance calculation [6] rather than the optimal Euclidean distance. While the inclusion of context can increase the amount of time (between 1% and 20% longer to execute the requests in our experiments) to get all the results, a more insightful question to ask instead of how long it takes to get all results is how quickly a sufficient set of results useful to the requester are available (or, conversely, how many useful results can be retrieved within a specific deadline). As the graph in Fig. 7 shows, the 100 closest, most recent, or most trusted MIOs are available in less than six seconds in any of the methods.
Using Context Awareness to Improve Quality of Information Retrieval
329
Fig. 8 shows an approach to provide a set of results with the best overall combination of the time, location, and affiliation contextual attributes, where best is defined to be the intersection of the result sets for each individual contextual attribute delivered in order of the normalized summed Fig. 7. Cumulative time it takes for MIOs to be delivered in the concontextual value. text-based request expansion After 325 results ordered by the individual contextual attributes, there were 34 MIOs common to each set. Fig. 8 shows these graphed in the order of their normalized summed values. The thick purple line (marked with “×”) is the combined normalized Fig. 8. The results if the intersection of the first 325 results ordered by distance, recentness, and trust are returned in order of their norvalue, with the malized summed values other lines representing the individual normalized attribute values. These 34 MIOs represent the order of delivery of MIOs with the best combination of recentness, distance from the requesting user, and trust. The intersection technique to create this combination is a promising technique for combining even significantly independent contextual attributes like the ones that we investigated.
4 Related Work Other work in context awareness has targeted pervasive computing environments and database queries. Context Query Language, CQL, is a specific query language [10] for
330
J.P. Loyall and R.E. Schantz
gathering contextual information in pervasive computing environments, and could therefore be used to collect the contextual information used to enhance the predicates in our approach. In [8], Mansour et al fashion queries in mobile applications as relational algebra, including tags for contextual information. Context is defined as XML documents and queries are automatically formed by middleware that inserts values from the XML context into the queries. In [11], Seher et al build an ontology of contextual information that is used to tag queries to refine them. In [9], Norrie et al also deal with context in terms of database queries, but define “context” as the part of the schema needed to interpret a query in a local database that needs to be distributed with a query to enable it to be interpreted across federations.
5 Conclusions This paper explores the space of using context awareness to improve the quality of information management. It provides the following major contributions: • • •
Identification of ways in which context can be used to automatically streamline and improve information retrieval and dissemination. Strategies for automatically using context awareness to improve the quality of processing information requests and quantitative evaluations of the strategies. Approaches and metrics for combining context attributes.
Context awareness and the use of contextual information as part of an overall set of information management services can provide the following improvements in the quality of information discovered, brokered, and disseminated: • • • •
Reduce the manual burden of crafting appropriate information requests. Reduce the cognitive load on the user to choose results that are most useful. Improve the speed by which useful results get to the user and ensure that the earliest results are more likely to be the most useful. Reduce demand on resources and increase the effectiveness of resource usage.
Our investigations illustrated that some (maybe many) context attributes are independent and have other aspects (such as granularity) that make it challenging to combine them into a single measure of context useful for improved quality. We presented a few techniques for combining attributes that are promising, including composable templates that support sorting on multiple keys. The intersection approach we present is more versatile (supports any granularity and doesn’t require ordering the context attributes) but comes with the tradeoff of requiring processing of more requests. The next steps for this research include applying and building upon the results of this study to incorporate context awareness into emerging information management (IM) services. An overarching result of our investigation is evidence that context awareness can improve the usefulness of information brokers and reduce the burden on the users of IM services. The results of this investigation can be used to influence the design, configuration, and implementation of IM elements to make them context aware. Contextual information can be incorporated in MIOs and in MIO database indexes, so there is ready access to context information and faster retrieval and ordering based on context. Similarly, the structures underlying predicate regis-
Using Context Awareness to Improve Quality of Information Retrieval
331
tration and predicate matching can be organized to support context-aware pattern matching, prioritization, and filtering. Finally, context awareness can be built into the IM brokering services, making IM operations more context aware, responsive to client needs and preferences, proactive, and effective.
References 1. Combs, V., Hillman, R., Muccio, M., McKeel, R.: Joint Battlespace Infosphere: Information Management within a C2 Enterprise. In: Tenth International Command and Control Technology Symposium, ICCRTS (2005) 2. Grant, R., Combs, V., Hanna, J., Lipa, B., Reilly, J.: Phoenix: SOA Based information Management Services. In: SPIE Conference on Defense Transformation and Net-Centric Systems, Orlando, FL, April 13-17 (2009) 3. Das, A.: Location Based Services, A Status Report of South East Asia. Location (JulyAugust 2008) 4. Deshmukh, S., Miller, J.: Location-Aware Computing. Deviceforge.com (December 3, 2003), http://www.deviceforge.com/ 5. Kepser, S.: A Simple Proof for the Turing-Completeness of XSLT and XQuery. In: Proceedings of Extreme Markup Languages, Montreal, Quebec, Canada, August 2-6 (2004) 6. Krause, E.F.: Taxicab Geometry: An Adventure in Non-Euclidean Geometry. Dover, NY (1986) 7. Linderman, M., Siegel, B., Ouellet, D., Brichacek, J., Haines, S., Chase, G., O’May, J.: A Reference Model for Information Management to Support Coalition Information Sharing Needs. In: Tenth International Command and Control Technology Symposium, ICCRTS (2005) 8. Mansour, E., Höpfner, H.: Towards an XML-Based Query and Contextual Information Model in Context-Aware Mobile Information Systems. In: 10th Intl. Conf. on Mobile Data Management: Systems, Services and Middleware (2009) 9. Norrie, M.C., Kerr, D.: Universal Contextual Queries in Database Networks. In: 3rd Intl. Conf. on Cooperative Information Systems (1995) 10. Reichle, R., Wagner, M., Khan, M.U., Geihs, K., Valla, M., Fra, C., Paspallis, N., Papadopoulos, G.: A Context Query Language for Pervasive Computing Environments. In: Sixth Annual IEEE Intl. Conf. On Pervasive Computing and Communications (2008) 11. Seher, I., Ginige, A., Shahrestani, S.A.: A Domain Identification Algorithm for Personalized Query Expansion with Contextual Information. In: 6th IEEE/ACIS Intl. Conf. on Computer and Information Science (2007) 12. World Wide Web Consortium, XML Path Language (XPath) Version 1.0, W3C Recommendation (November 16, 1999), http://www.w3.org/TR/xpath 13. World Wide Web Consortium, XQuery 1.0: An XML Query Language, W3C Recommendation (January 23, 2007), http://www.w3.org/TR/xquery/
An Algorithm to Ensure Spatial Consistency in Collaborative Photo Collections Pieter-Jan Vandormael and Paul Couderc INRIA Rennes, Irisa http://www.irisa.fr/aces
Abstract. Ubiquity of digital capture devices and on-line sharing enable the emergence of new collaborative services, similar to Google Street view which would leverage on community contributed pictures. A key problem is to deal with the variable quality of spatial context meta data (geotagging) associated with these pictures. In this paper, we propose a solution to ensure spatial consistency in photo collections aggregated from distributed capture. Keywords: distributed capture, geotagging, digital pictures.
1
Introduction
A trend in modern computing is the increasing availability of computing power and connectivity in a ubiquitous and miniaturized form. This is due to several technologies that have become widely accepted: mobile phones, the universality of the Internet and web technologies, wireless communication systems providing global connectivity, and finally, sensing technologies (such as the GPS) which provide environment awareness. This progress brings along new applications, such as the geotagging of photos on Flickr [1], and other ways of coupling photos with locations, such as in Google’s Street View [2] or Microsoft Live Labs’ Photosynth [3]. Decentralized forms are also being looked into: an example is PIX-grid[4], an Internet-scale peer-to-peer system that enables users to share (globally or only within a group) and efficiently locate photographs based on semi-automatically provided metainformation (by the devices and the user). Context-enabled photo collections can also be used to help in-situ navigation with a mobile device, such as in [8]. An important issue in these information system is the quality of the context data, in this case the location and the heading (or direction) information associated with the pictures: inaccurate geotagging can raise important coherency issues in the collection. While an outdoor photo collection captured by a single entity will likely not hit into major inconsistencies, the problem becomes difficult when merging collection captured by different users, geotagged with different sensors and/or methods. The rest of the paper is organized as follows. In the next section, we present the photo application which we consider and the spatial coherence problem that is S. Lee and P. Narasimhan (Eds.): SEUS 2009, LNCS 5860, pp. 332–342, 2009. c IFIP International Federation for Information Processing 2009
An Algorithm to Ensure Spatial Consistency
333
addressed. In the third section, we detail the algorithm of the solution. The fourth section briefely presents a prototype application of the algorithm. The fifth section discusses related works and finally we conclude with some perspectives.
2
Application Example
We elaborated a simple application on a mobile device where a user is asked to take pictures and input the distance and the direction relatively in between the spots where the pictures are taken. In this way a map can be built of the environment wherein the photos are related by their distance and relative direction (angle). This information can be shared between users and this collectively built map can then serve as a photographic guide to others. One can expect the map would grow quite big easily, so the application should scale well to large collections. It makes sense to distribute the geotagged photo collection and any processing involved in locating and relating the pictures, so resources can be shared in a way that is achievable for limited mobile devices and without a central service. Imagine a user standing at the corner of a room, facing one far end of the room, and taking a picture. The user would then walk diagonally (say 45 ◦ to the right) across the room in 20 steps, face another wall to the North, and again take a picture. The user enters this data (20 steps, 45 ◦ , facing North) along with the photos into the application. In this way the user has done a relative position estimation himself and does not need to rely on external systems as GPS. While the user’s estimate might be much less accurate, we can count on it that if the user would now move to the left of where he was before, the new, third picture would certainly be classified as being left to the second picture, and never the other way around, as it could have been with GPS if it was within the margin of inaccuracy. This form of relative positioning can still be given absolute coordinates if the first corner of the room were a known landmark, which has been located by means of tools such as GPS beforehand. While the user will not easily make obvious errors in subsequent photos, the user might go for a walk, make a very large tour and then arrive at his/her starting point, only finding out the first and the last pictures are not in the same spot in the application, because he or she has built up a large cumulative error on the way. This problem can be resolved by incorporating known locations (called landmarks) as part of the tour, such that intermediate measurements can be corrected with interpolation techniques: suppose the user starts at the corner of a large room, which is a known landmark, and then traverses the room diagonally in steps. Finally he ends up in the opposite corner, which is also a known landmark. There the error might be large but can be corrected because the position is known. The steps in between can subsequently be adjusted as well. The errors we expect from users are both in distance and in angles: both systematic distance errors (the optimistic user always thinks it is not as far) and
334
P.-J. Vandormael and P. Couderc
non-systematic distance errors (it being difficult for humans to judge distances, even in clear view). Also, some error in the angle estimations, even for not too great distances, especially when it is not 90 ◦ or 0 ◦ . The errors have a chance of becoming greater for larger distances and in the end so large that the user will need to work in steps. The focus of the application is however on consistency and transparency for the user, over accuracy, so this error is not harmful if the user does not notice it. It gets even more difficult when things, such as buildings or shrubbery, are in the way of the direct line of sight, and the user must deviate from the straight path. The same goes for hills and different level of terrain, when the problem gets a height component. One would expect even greater errors here. They can be reduced if the user is willing to work in steps (e.g. to the building first, then the sides of the building, and afterwards to the endpoint), except when there is nothing to reference to. More errors are inserted because multiple users can collaborate or work interchangeably on the adding of the photos to the application.
3
Solution and Algorithm
We propose an algorithm to deal with spatial coherence issues that arise because of inaccuracies in the users’ estimates in this distributed capture example application. Distributed photo capture more specifically concerns the aggregation of photos taken with a mobile device, by several users and at different but related places, into a coherent collection. The coherence of the collection is clearly the quality metric here. A photo collection becomes spatially incoherent when a photo is added that is not at the location where the user expects it. Based on the virtual world model of the collection, the user might expect to see one thing but might get to see something quite different that does not match reality. The algorithm is closely based on the relaxation algorithm described in [7]. Its original purpose is to solve the localization problem in the field of robotic node localization. It serves to maintain a coherent representation of the environment and to allow reconciliation with future estimates. However we will be using it with different data in a different situation. Also, decentralization is an additional goal. The main goal of our solution is improving the coherence, and error diffusion is the means. Our adapted algorithm relies to a great extent on the information provided by the users, and tries to use as much of the information as possible to come up with the best possible solution. Making sure that the error is distributed and not extremely located in one place requires lots of data being present and will only then make the map truly coherent. Taking in the information from each new estimate and applying it locally, is the quickest way to improve the coherence and also makes distributed calculation possible. In [7] a Gaussian noise distribution is assumed in all directions around the positions of the nodes so a circular probability area is achieved and a single variance measure can be used. The variance measure typically is taken to be proportional to the estimate distance as the variance represents the uncertainty of the measurement.
An Algorithm to Ensure Spatial Consistency
335
The general idea behind the relaxation algorithm is based on the mesh of springs analogy, where coherence is attained for optimization to the lowest energy in the mesh. However the algorithm does not solve a system of physics-like equations but works iteratively. The principle is to handle each node in turn, determine its position through the links from each of its topological neighbours, and move it there (“where its neighbours think it should be”). By repeating this the positions will converge to a globally optimal solution. While this original algorithm lacks some of the advanced techniques to improve coherence in comparison with other algorithms from the domains of robotics and ad-hoc sensor networks, it is one of the few that allow to create a lightweight, simple and distributed system. Its skill level can be described as plain accuracy maximization, as found in many of the sensor node algorithms, yet at a higher level because it tries to use all the information available, without approximating. Its merit is its capacity to be adapted for decentralized calculation, while remaining rather straightforward to implement. Also adaptations can be brought about to make it more specific to the photo capturing aspect. For instance functionality is added to take in account the fact that not only the positions of the photos must be made coherent but also the photos themselves, since they have an orientation. 3.1
Representation
How will we relate photos to locations in our virtual world model (i.e. a map)? One way is to directly link the location of the main object with the photo itself. This is useful for contextual relating because this object-centric way allows to group several photos of the same object, for instance photos that are rotated around the object in respect. However, there is no easy way to determine what the main object is. Inside a building a photo of an object will likely be a close-up. It might be difficult to find this object by its photo. Outside, it could be a view from afar of a touristic landmark. But the landmark could look different from different sides. So its connected photos would all be different and one photo by itself would give unclear location information. Also, all the photos having this landmark would be inserted on one spot, and no photos would surround it, so there is no easy way to follow photos to get to the landmark. It is difficult to spatially map out the photos, unless vision techniques are used. In fact the photos we expect are not about objects at all, they are about views from different viewpoints. The viewpoints we propose as an alternative are the positions of the camera when the photo was taken. We take the location of the camera or cameraman (the user who inputs the photos) and link this to the photo. Rather than an inward and centralized view, this gives an outward and scattered view on the environment with photos “looking around”. In the interior photos contain walls and doors or views of corridors, outside they show the scenery from that point. With such a set of photos it is possible to create a walk-through experience for the user, much like he himself would see with his own eyes. Grouping of photos can be done based on the camera viewpoint. So, when the camera is only rotated, photos are added in the same group. Moving to another group happens when the cameraman walks around. Disadvantages here
336
P.-J. Vandormael and P. Couderc
are of course the impossibility to insert just any photo coherently into the system. Photos of close-ups or photos not taken straight at eye-level are discouraged. Nodes. There are some points or places (the viewpoints) of which we want to determine the position in GPS-like coordinates. We will keep referring to these points as nodes 1 . Several photos can share a node on their (equal) position. Links. Some of these nodes’ positions will be determined using GPS or pointing on a map. Others will be estimated by the user from another node. Thus, there are two kinds of estimates, which we will call links, that give information on the positions of the nodes and topologically place or connect the nodes: Relative Links. Relative links are user estimates of distance and direction from one node to the next. Actually they connect two photos rather than two nodes, since they include information on the view direction of the photos. However the idea of relating nodes still holds for the bigger part, when orientations do not matter, and therefore we will continue to use it below out of simplicity. Each relative link has a distance and two relative angles (one for the link in respect to the “startphoto”, and one for the “endphoto” in respect to the link). A consequence is that the position calculated by estimating depends on the orientation of the photo at the neighbouring (related) node. This is only logical: if the neighbour’s position gets corrected and its photo gets turned a bit to the left, then the position of this node should not only translate with the neighbour’s position, but should also slightly rotate counterclockwise (to the left) around it. Of course multiple links can be made. The links are reciprocal. Absolute Links. Absolute links are considered local metric information with very little inaccuracy about known positions. The information consists of a position in world coordinates and an absolute angle (versus North) indicating the view direction of the photo. High accuracy is needed for several reasons matching those of anchor nodes in sensor networks. The more known nodes, the fewer real correcting work needs to be done (supposing that there are no conflicts). In any case a sufficient proportion is needed for the problem to be feasible. And if these known nodes have a small error on their position, then it will be virtually impossible to get a smaller error than this for nearby nodes. Also, a couple of nodes with absolute coordinates are needed to be able to transfer everything into absolute coordinates. There can be more than one absolute link involved per photo. 3.2
Algorithm
The algorithm consists of two steps for each iteration. In a classical, non-distributed execution on a single device, both steps are performed for each node i in turn, within one iteration. The algorithm is iterated until the solution has converged. 1
As in nodes of a graph, not sensor node or robotic node devices.
An Algorithm to Ensure Spatial Consistency
337
Step 1. For the current node i and each neighbouring node j, between which there is a relative link with distance dji and direction (absolute angle) θji , a po sition estimate (xji , yji ) for node i using the position (xj , yj ) of the neighbouring node j is calculated. xji = xj + dji cos θji
(1a)
yji
(1b)
= yj + dji sin θji
The used technique to calculate the position can be seen as a combination of multilateration and multiangulation. The absolute angle θji is derived from the two relative angles known as part of the relative link, and the (resulting) orientation (absolute angle) αpj of the involved photo pj of neighbouring node j (remember that a relative link really is between two photos rather than two nodes). In the same way the variance vji on the position of the current node i in this estimate is calculated from the variance vj of each neighbouring node j and the variance of the link between the nodes uji . vji = vj + uji
(2)
For simplicity, but at the expense of the generality of having multiple relative links between the same neighbouring nodes, the roles of the neighbouring node and the relative link are not set apart above (and below in step 2). If the index i is kept for the current node, then the index j should be allowed to represent the same neighbouring node several times, and the index ji, replaced by the index k, represents the link. For each absolute link k concerning node i the position estimate (xk , yk ) (or (xji , yji )) is simply the position that the absolute link provides. Absolute links are given a variance too. But since there is no neighbouring node on the other side, the variance vk (or vji ) only has this one component. Apart from this difference (and apart from the fact that they will not transport any requests for neighbour updates) absolute links are incorporated into the calculations of the algorithm just like the relative links. Step 2. In the second step the position (xi , yi ) of the current node i is updated with a weighted average of the position estimates (xji , yji ) made from all of the neighbours j of the previous step. The purpose is to produce a new position (xi , yi ) for node i that in the end (after looping i and iterating) is more suitable. First, the new variance vi for node i is accumulated from the variances in the estimates vji , so the inverted variances in the estimates can be used as the weights for the average. Here, j is the sum over the neighbours j of node i (excluding j = i). In practice, this is done in a loop over j, constantly adding to a temporary value 1/vi , and then inverting this to get vi .
1 1 = vi vji j
(3)
338
P.-J. Vandormael and P. Couderc
Next, also in a loop over j for the summation, the position is calculated and updated.
xi =
xji vi j
vji
(4a)
yi =
yji vi j
vji
(4b)
The node’s position is now corrected, but it is additionally necessary to turn each photo to be as much as possible in correspondence with the information in the links. For each photo pi on node i the following is done: for each link j to that photo pi , the absolute angle (orientation) of the photo αjpi is calculated as it should be to be perfectly in line with the information within the link. Next, the angle is averaged with weights (according to the variances) for all these links and the photo pi is turned to this absolute angle αpi . 3.3
Stopping Criterion
In [7] the total change in positions falling below some threshold is given as a possible stopping criterion. However, in view of the distributed application, it is better to stop for each node individually when no more neighbouring nodes have had significant updates to their position. If this is the case then calculating the position for the node again would have no significant influence since the used data has nearly not changed. The fact that the position update difference for one node drops below a certain threshold still is the stopping criterion in a way. If this node is recalculated but not corrected much, while its neighbours are still updating significantly, and if this node never really gets corrected anymore, so it is not this node that is asking its neighbours to update, then this will bring the algorithm to a halt eventually, under normal circumstances. 3.4
Locality
The algorithm can work locally. If this proposed stopping criterion is not too strict then an update in the position of a current node will not ripple through more than a couple of neighbours (with the change in position getting smaller and smaller), given that the mesh is rigid enough. This all depends on how many accurately known locations there are and how high the connectivity is. A high connectivity (many direct neighbours) increases the accuracy of the current node’s position. An absolute link makes the node’s position perfectly accurate (nearly perfectly accurate, in the next version). However, a high connectivity throughout the whole mesh makes the convergence speed slower because of more involved computations. Accuracy comes at the cost of speed. The same goes for a too strict stopping criterion, which could make changes propagate themselves nearly endlessly and would destroy the convergence speed.
An Algorithm to Ensure Spatial Consistency
3.5
339
Decentralized Execution
This locality property makes it possible to theoretically implement the algorithm in a distributed fashion. Calculations can be executed per node, on the device that “owns” the node: when the calculations are performed decentralized on several devices, some of the nodes i are virtually situated on different devices. The responsibility to recalculate a node and its photos, and the ownership of their information, is assigned to a single device based on the device on which the first photo of that node was added. Each device queues the nodes that need to be recalculated and runs the algorithm in turn for each of them. If a node gets updated and the change is noteworthy, neighbouring nodes are set to be updated too in case they are on the same device, or an update request is sent out to the other neighbouring nodes that are on other devices. The stopping criterion per node now decides if the update should be propagated to the neighbours or not. The whole procedure over all the devices enveloping several algorithm runs goes as follows: first of all, the above algorithm for the node i is started on its owning device, because either it was newly created or because some data from one of its neighbouring nodes (possibly on a different device) got corrected significantly enough to also justify a correction run for this node. As in step 1, the information from the neighbours is used to estimate the position. The most recent information about the neighbours (their position, variance and the orientations of their photos) got sent by the neighbours when it changed (or when this node was newly created, since it needed to be presented to its neighbours). 3.6
Real-Time Insertion
This local impact property also makes it possible to add a node along the way and without incurring changes in the whole system (but only in a small area surrounding the node). The dynamism in this algorithm allows for new nodes to be added at any time (except in the steps of one running core calculation of course). Its presence should be announced much like update requests. The device that added the node should take responsibility for it and its calculations. This means that when the device is switched off afterwards no non-cached information can be requested and no updates can be made to that node, as is inevitable in a distributed system. Because effects of change or addition are local the insertion is also fast. There is no need to recalculate the whole system of nodes. One iteration of the algorithm for the new node is sufficient for a first roughly nearby position. Still, it might take quite a bit more to stabilize the effect of the insertion. 3.7
Proof of Convergence
By referring to the mesh of springs analogy, the authors of [7] show that each update to any node will always result in a decrease in energy. Because the energy function moreover is both bounded at zero and quadratic, they prove that there is always convergence to a global optimum (minimum in energy) for their
340
P.-J. Vandormael and P. Couderc
specific version of the algorithm. However, it is not proven that the algorithm as stated above (with for instance the additional photo rotations) always converges. Implementation without slower convergence is also difficult if real world coordinates are used, rather than some artificial Cartesian coordinates. 3.8
Averaging the Angles
For the last part of step 2 (3.2), as with averaging vectors starting from one point there is the possibility that the different angles cancel each other out in the calculation of the average. Take for example the average of an angle of 0 degrees and of an angle of 180 degrees. Rather than 90 degrees or -90 degrees, as one might answer too quickly, the average is unsolvable. If it is looked upon as two vectors of equal length pointing in opposite direction, then it becomes obvious that the resultant vector of the sum of both is a vector in any direction with zero length. For this reason the average angle αpi is calculated with the x and y components of the estimated angles αjpi .
αxpi
cos (αjp )vi i = v ji j
αypi
sin (αjp )vi i = v ji j
(5a)
(5b)
An arctangent (with determination of the correct quadrant, often called “atan2”) is used to calculate the average angle from its components (converting rectangular coordinates to polar). αpi = arctan (
αypi ) αxpi
(6)
In case αxpi and αypi are very small (and the zero length vector problem has occurred), by choice, the angle calculation can be skipped and the photo’s orientation just would not be corrected at all. This would only happen very exceptionally and clearly there would not be enough or correct information to improve the coherence anyway. Since a photo at some time had one link to it, namely at creation, and therefore always had a valid angle at some point, the problem can be solved in this way, because the old orientation is kept.
4
Prototype Application
Following up on this theoretical work, we have implemented a prototype application of a distributed geotagged photo collection, called “Geometer”, as an example and to show the realizability of such a system. The system is built on the Android platform for mobile devices. The realised prototype allows individual users to add photos and performs the algorithm to make their positions
An Algorithm to Ensure Spatial Consistency
(a) The background is a satellite image.
341
(b) On a white background.
Fig. 1. Geometer map mode screen with a full set of photos added of an indoor walkthrough
coherent. The users can then view a map or walk through the photos. In practice distributed operation is included, following the theoretical design, but the application’s distributed use has not been perfected yet.
5
Related Works
While photo sharing and aggregation is still a hot topic, to our knowledge collaborative photo capture by uncoordinated entities has not been given much attention. Google Street view [2] does not import user geotagged picture, and Flicker [1] does not try to ensure spatial consistency. Distributed capture is being investigated by the robotic (swarm of robots) and sensor network communities [5], but the specific problem of large scale photo collection with pictures contributed by anyone on the Internet with variable context quality (spatial meta data) seems new. On a smaller scale, the problem of temporal consistency in cooperative capture of multimedia streams by a group of mobile phones has been studied in [6].
6
Conclusion
This paper has investigated the spatial coherence in the novel concept of distributed photo capture, in particular the issue of global coherence in geotagged
342
P.-J. Vandormael and P. Couderc
photo collections. We illustrated this issue with an application example, where users can capture and geotag pictures, and then apply an algorithm to enforce the coherency of the spatial references. A prototype was implemented on the Android platform, and was used for limited experimentations. The algorithm, in practice, turns out capable. It succeeds in keeping things “together” and in reducing the distributed error by diffusing it. However, some issues of the algorithm need to be looked into: the ever increasing variance, the order of recalculating the nodes, the presence of oscillations, and the zero length vector problem in case of many links. These issues are the object on-going works.
References 1. 2. 3. 4.
5.
6.
7.
8.
http://www.flickr.com/ http://maps.google.com/help/maps/streetview/ http://livelabs.com/photosynth/ Aberer, K., Cudre-Mauroux, P., Datta, A., Hauswirth, M.: PIX-Grid: A Platform for P2P Photo Exchange. In: Proceedings of Ubiquitous Mobile Information and Collaboration Systems (UMICS 2003), collocated with CAiSE 2003 (2003) Bisnik, N., Abouzeid, A., Isler, V.: Stochastic event capture using mobile sensors subject to a quality metric. In: MobiCom 2006: Proceedings of the 12th annual international conference on Mobile computing and networking, pp. 98–109. ACM, New York (2006) Bourdon, X.L., Couderc, P.: A protocol for distributed multimedia capture for personal communicating devices. In: Autonomics 2007: Proceedings of the 1st international conference on Autonomic computing and communication systems, pp. 1–8 (2007) Duckett, T., Marsland, S., Shapiro, J.: Learning globally consistent maps by relaxation. In: IEEE International Conference on Robotics and Automation, ICRA 2000. Proceedings, vol. 4 (2000) Pauty, J., Couderc, P., Banˆ atre, M.: Using context to navigate through a photo collection. In: Proceedings of the 7th international conference on Human computer interaction with mobile devices & services, pp. 145–152 (2005)
Real-Sense Media Representation Technology Using Multiple Devices Synchronization Jae-Kwan Yun, Jong-Hyun Jang, Kwang-Ro Park, and Dong-Won Han Meta-Verse Technology Research Team, IT Convergence Technology Resarch Laboratory Electronics and Telecommunications Research Institute, 138 Gajeong-no, Yuseong-gu, Daejeon, 305-700, South Korea {jkyun,jangjh,krpark,dwhan}@etri.re.kr
Abstract. Recently, the requirements for the real-sense media representation are increasing rapidly. Until now, most people mainly used the method SMSD on one device, but we need more than one device to play multiple audio/videos and multiple sensory effects of the real-sense media. Therefore, multi-track media synchronization and effect device synchronization algorithms are very important part of the real-sense media representation. Therefore, in this paper, we suggested concept of the real-sense media playback, sensory effect metadata scheme, real-sense media playback system architecture, and synchronization algorithm of multiple audio/video devices and multiple sensory effect devices. Keywords: SMMD, SEM, Media/Device Synchronization, Real-Sense Media.
1 Introduction Until now, most people have used the method SMSD (Single Media Single Device) which plays one media, including audio and video, on one playback device such as TV or PC [1]. But, as the development of the media playback technology, media running method is changing to the SMMD (Single Media Multiple Devices) which plays new type of media that includes not only previous type of audio, video, and text but also sensory effects(metadata information) to give users real-sense effects [3]. This SMMD service is very effective when we are making a movie for an experience room or an exhibition center, and a real-sense broadcasting service like stocks, cooking, quiz program which plays multiple devices to give users a lot of information at the same time [2]. Therefore, in this paper, we introduced the method for making a real-sense media that combines multiple audio/videos and sensory effects, and the method for playing this media with user peripheral audio/video playback devices and effect devices. Also, we explained the algorithms for the multiple audio/video synchronization and the multiple effect device synchronization. This paper is organized as follows. Chapter 2 discusses the concepts of the real-sense media playback, the real-sense media system architecture and SEM(Sensory Effect Metadata) scheme, chapter 3 explains media/device synchronization and synchronization algorithms and chapter 4 shows the evaluation test results. The conclusion of this paper is given in chapter 5. S. Lee and P. Narasimhan (Eds.): SEUS 2009, LNCS 5860, pp. 343–353, 2009. © IFIP International Federation for Information Processing 2009
344
J.-K. Yun et al.
2 Real-Sense Media Playback In this chapter, we explained concepts of the real-sense media playback, the realsense media playback system architecture, and the sensory effect metadata scheme. 2.1 Concepts of the Real-Sense Media Playback Basically, in the real-sense media service system, we aimed that one media which contains multiple audio/video tracks and sensory effects can be played with multiple audio/video devices and effect devices. Fig, 1 shows contents of the real-sense media of lecturing how to cook with 3 audio/video tracks and 4 sensory effects.
Fig. 1. Concept of the Real-Sense Media
Track number 1 becomes main track media and it contains audio/video of front screen of cooking lecture, track number 2 contains the scene of foodstuffs, and track number 3 contains the scene of cooking tools, and each track is stored in one real-sense media file. Sensory effects (vibration, scent, light, web link) are the real-sense effect related with each track, and also it is edited with the timeline of the main track media. The real-sense media put together is transmitted to the home server from the contents server. And then, home server analyzes real-sense media, plays main track media on the playback devices like TV connected to the home server. The home server sends track number 2 and 3 to the user's around audio/video devices like laptop or other TV. Each sensory effect for real-sense presentation is analyzed within the home server, translated to a control information for device control, and each control information activates effect devices like light, scent generator, PC as the timeline of the main track media. As shown in fig. 1, to activate 3 devices (vibration ON, light OFF, web link A OPEN) exactly at the time of t(x), the device execution time and network delay must be considered.
Real-Sense Media Representation Technology
345
2.2 Real-Sense Media Playback System Architecture To play real-sense media within the home server, we inserted each audio/video into one MPEG-4 media file. First of all, sensory effects that activate playback devices in each scene must be generated in the contents server with metadata authoring tool. And, based on the main track media, multi-track media must have synchronization information within a media. we normally used SL(Sync Layer) information of MPEG-4 file. Transport manager transmits real-sense media to the home server, and main controller sends this real-sense media to media parser. In the media parser, real-sense media is separated into multi-track audio/videos and SEM. The main track media in the separated audio/video tracks is played in the hardware decoder. The other tracks are sent individually to the audio/video playback devices, and sync manager is keep trying to synchronize each track with the main track media. The SEM is parsed in the media parser and sent to the metadata analyzer. The metadata analyzer converts SEM into proper device information and device mapper makes device control message for each sensory effect. At this time, sync manager also synchronizes effect devices with the timeline of the main track media. Fig. 2 describes the system architecture for the real-sense media playback.
Fig. 2. System Architecture for the Real-Sense Media Playback
2.3 Sensory Effect Metadata Scheme The SEM is the metadata for representing real-sense effects, and it contains effect types and effect variables by using XML [4]. The SEM is consisted of two main parts those are effect property and effect variables. Effect property contains definition of each sensory effect applied to the real-sense media. By analyzing effect property, the home server can map each sensory effect to a proper sensory device in the user’s environment before processing real-sense media playback. Effect variables contains the control variables for sensory effect synchronized with the media stream. Fig. 3 shows the process of the SEM.
346
J.-K. Yun et al.
Fig. 3. Process of the Sensory Effect Metadata
Sensory effect metadata should be represented by the SEDL(Sensory Effect Metadata Description Language). Currently, The ISO/IEC MPEG is currently standardizing the SEDL [5]. Fig. 4 shows EBNF(Extended Backus-Naur Form) of SEDL.
Fig. 4. Sensory Effect Metadata Description Language
SEM is the root element and the DescriptionMetadata provides information about the SEM itself. The Declarations element is used to define a set of elements. The Parameter may be used to define common settings used by several sensory effects similar to variables in programming languages. A GroupOfEffects defines at least two effects play several devices at once. The EffectDefinition comprises all the information pertaining to a single sensory effect. An EffectDefinition may have several optional attributes which are defined as follows: activate describes whether the effect shall be activated; duration describes how long the effect shall be activated; fade-in and fade-out provide means for fading in/out effects; alt describes an alternative effect identified by a URI; priority describes the priority of effects with respect to other effects in the same group of effects; intensity indicates the strength of the effect; position describes the position from where the effect is expected to be received from the user’s perspective; adaptability enable the description of the preferred type of adaptation of the corresponding effect with a given upper and lower bound.
Real-Sense Media Representation Technology
347
3 Media/Device Synchronization In this paper, we synchronized several types of devices with real-sense media to give users real-sense effects. This effect device can be an active device which has its own CPU, memory, operating system and activated by itself. On the contrary, passive device only receives control messages from the home server, controls and returns the feedback messages. There are two method to give real-sense effects to users, one method is representing effects by directly controlling such devices like electronic fan, scent generator, digital light related with the scene. Another method is 3 dimensional display using left and right movies, or 360 degree dome shape media display by playing multiple audio/videos with synchronized way. In this chapter, we explain audio/video synchronization and effect device synchronization algorithm. 3.1 Workflow of the Multi-track Media Synchronization The multi-track media can be played several audio/video devices that user possesses and sub media is capable of being played in a user's laptop or PC. Multi-track is transmitted to the home server or other audio/video devices by using RTP(Real-Time Transfer Protocol), sends media control information with RTCP protocol. Home server becomes master, and it sends or receives synchronization messages continuously from slaves. Fig. 5 shows the workflow for the multi-track media synchronization. Master Media #1 Audio
Video
Hom e Server Audio Decoder
Video Decoder
Sensory Effect Device #1
Inter-media synch
SEM 1
SEM 1, SEM 2, SEM 3
Video
SEM 2
Slave #1
Media Device #1 Audio Decoder
Media #n Audio
group synch
Video
SEM 3
Video Decoder
Inter-media synch group synch
. . .
Inter-Device synch
Media #2 Audio
Slave #n
Media Device #2
Media data unit Control Packets (RTCP-Group Synch)
Audio Decoder
Sensory Effect Device #2
Video Decoder
Sensory Effect Device #3
Inter-media synch
Feedback(RTCP)
Fig. 5. Workflow of the Multi-track Media Synchronization
To play multi-tracks in the real-sense media, the service flow like fig. 6 is required. SEI(Server External Interface) is the interface for browsing the real-sense media list in the contents server and controlling start, pause, stop of media playback. DM(Device Management) is the interface for controlling the connection between the home server and players installed in each audio/video device, it sends VCR control message generated by the SEI and checks session maintenance of each audio/video device.
348
J.-K. Yun et al.
*QOG 5GTXGT 5GTXGT'ZVGTPCN+( XQLFDVW!
6HVVLRQ0DQDJHU
%NKGPV NCRVQR &GXKEG/CPCIGOGPV XQLFDVW!
6HVVLRQ&OLHQW
&GXKEG5[PEJTQPK\CVKQP EURDGFDVW!
3OD\HU,QWHUQDO,)
6WUHDPLQJ6HUYHU
465246246%2 XQLFDVW!
3OD\HU /GFKC5[PEJTQPK\CVKQP
Fig. 6. Audio/Video Synchronization Architecture
PII(Player Internal Interface) do a roll of media playback controlling, it runs and quits player process. DS(Device Synchronization) makes synchronization between each player installed in audio/video device, it do the OCR(Object Clock Reference) mapping for synchronization. RTSP, RTP/RTCP is protocol for the real-time media transferring. MS(Media Synchronization) do inter-media synchronization by the OCR of the audio track in the player. 3.2 Multi-track Media Synchronization Algorithm After choosing real-sense media and initializing players, the home server sends 'Play' message to the session manager. The home server becomes master, opens UDP socket, and broadcasts the master's play time for audio/video device synchronization. Player records master's play time in a memory mapping file and session client reads this time. After make it media start time M(t), register current system time C(t) by timer event, and broadcasts periodically current time every one second over a UDP socket. Current media time MC(t) is calculated by the formula (1), fig. 7 shoes audio/video synchronization algorithm.
Session Client Receives mediaPlay Message
Start Timer
Timer Event System Time
Previous Timer Event Time P(t)
Elapsed Time C(t) – P(t)
Current System Time C(t)
Player Media Timer
Media Start Time M(t)
Current Media Time Current MediaTime MC(t) = M(t)+C(t)-P(t)
Fig. 7. Audio/Video Synchronization Algorithm
Broadcast System Time
Real-Sense Media Representation Technology
349
(1)
MC ( t ) = M ( t ) + C ( t ) − P ( t )
In this case, like fig. 8, there are occasions that stream of the master device does not synchronized with the stream of slaves. Number 3 of the slave stream #2 was discarded because it did not reached the time range [t2, t3] that master stream must be played. Number 6 of the slave stream #1, it paused and replayed to make a synchronization with the master stream, because master stream did not arrived just in its playing time. By using this way, the home server can synchronize with other devices by performing pause and discard MDU(media device unit)s. 6\QFKURQL]DWLRQ ZLWKGLVFDUG
6ODYH6WUHDP
6\QFKURQL]DWLRQ ZLWKSDXVH
W
W
W
W
PLVPDWFKHGWLPHOLQH
W
6ODYH6WUHDP
0DVWHU6WUHDP
W
W
W
7LPH
PLVPDWFKHGWLPHOLQH
Fig. 8. Synchronization with Pausing MDUs and Discarding MDUs
3.3 Effect Device Synchronization
To synchronize effect devices efficiently, there need two time table in real-sense media service system. First of all, the sensory effect time table is used to maintain the SEM when real-sense media is transmitted to the home server. In the sensory effect time table, each effects are classified into effect property, start time, duration, and effect variables. Fig. 9 (a) shows the SEM example defined 9 effects during the time [t0, t6] about the effect type Light, Wind, Scent, Heat. With an authoring tool which makes an expression of the SEM, the author may edit effects about the scene regardless of the effect devices activated. In this case, the home server maintains types and status of effect device connected with the home server. After analyzing the SEM in the parser, mapping process is done to choose best device to represent that effect. In other words, H1 which represents heat effect, can be mapped to effect device heater, effect variable 30% is converted to heater's capable control message heater level 1(if heater had level 3, effect variable 100% could be heater level 3). There are some devices that they could not uses duration so they only be controlled by start, stop control message. Fig. 9 (b) shows when there are only that kind devices, 9 sensory effects can be mapped to 16 control command stored in the effect device controller time table. When real-sense media is played, control commands are transmitted to the effect devices by the media play time.
350
J.-K. Yun et al.
C5GPUQT[ 'HHGEVU6KOG6CDNG
+WWW
+HDW(IIHFW
+WWW 6WWW 526(
6FHQW(IIHFW
6 WWW *5$66
:WWW
:LQG(IIHFW /LJKW(IIHFW /GFKC%NQEM
+WWW
:WWW
/WWW 5('
W
W
/WWW *5((1
W
W
W
W
W
7LPH
D'HHGEV&GXKEG %QPVTQNNGT 6KOG6CDNG
'KHDWHU W21 ' 3(5)80(5 W21
&GXKEG%QPVTQN
' KHDWHU W21
W
' /(' W 21
W
')$1 W21
W
' )$1 W2))
' KHDWHU W21
' KHDWHU W2))
' 3(5)80(5 W2))
' 3(5)80(5 W2))
' )$1 W21
' )$1 W2))
' /(' ' /(' W 2)) W 21
W
W
' 3(5)80(5 W21
W
' /(' W 2))
W
7LPH
0XOWLSOH&RQWURO6LJQDOVDWVSHFLILF WLPH W
Fig. 9. Sensory Effect Time Table and Effect Device Controller Time Table
3.4 Effect Device Synchronization Algorithm
To give real-sense effect to users, same as audio/video synchronization, effect device must be activated at the exact time of the scene to be played. Therefore, we need algorithm which calculates device activation time D(t). In the home server, event occurs according to the time line of the main track media. D (t ) = MC (t ) − δ ( t ) − N (t )
δ (t ) =
Home Server
(2)
1 n 1 k δ i (t ), N (t ) = ∑ N i (t ) ∑ n i =1 k k =1
Timer Event Ei(t)
Ei+1 (t)
Ei+2 (t) System Time
COMMAND Network Delay N(t)
ACK
Effect Device
Device Execution Time d(t) Device Activation Time D(t) = MC(t)-d(t)-N(t)
Current MediaTime MC(t)
Fig. 10. Effect Device Synchronization algorithm
System Time
Real-Sense Media Representation Technology
351
According to the formula (2), when the sync manager sends control command to the device controller at D(t) time, the command is analyzed and decomposed into control command type, interface, control value, and start time.
4 Evaluation Test This synchronization algorithms we proposed are tested mainly in an experience room or an exhibition room. In these kind of room we used multi-track media as to display 3 dimensional media, and there are various kind of effect devices that are capable of making real-sense representation. Currently presented media is using 2 main tracks, 4 sub-tracks and 14 kinds of effect devices. 2 main tracks use beam projects for displaying right and left images, 2 sub-tracks in 4 are displayed in UMPC(Ultra Mobile PC) while users are moving, and the other 2 sub-tracks are displayed in PCs embedded in the motion chair. Table 1 shows the information of the tested video, audio. Table 1. Video and Audio Information of the Test Media Codec Width,Height Frame rate Average bitrate
Video
MPEG-4(H.264) 720 x 480 29.97 2.54MBits/sec
Codec Bit per sample Channels Average bitrate
Audio
AAC 16bit/sample 2 channels 1411 KBits/sec
The test methods for the multi-track media are as follows. First of all, we displayed each track without carrying out synchronization algorithm. We compared time gap between home server and each client whenever home server broadcasts synchronization time in every 10 seconds. As shown in fig. 11, the error can be reduced to 10% compared with not using synchronization algorithm. 1,500
1060
1,400
1040
1,300
Slave#1 Slave#2
1,200
Slave#3 1,100
Slave#4
1020
Slave#1 Slave#2
1000
Slave#3 Slave#4
980
Master 1,000 900
(ms) 1
Master 960 940
21
41
61
81
101 121 141 161 181 (sec)
(ms) 1
21
41
61
81
101 121 141 161 181 (sec)
Fig. 11. Evaluation Results for the multi-track media synchronization
The SMP 8635 hardware video/audio decoder(Revision C, Production ChipMRUA library version is mrua_SMP8634_2.8.2.0) currently used in the home server is based on the mips chip operated by embedded linux. The decoder's performance is not so good because it was developed only for the purpose of playing
352
J.-K. Yun et al.
audio/video. As shown in Fig. 11 (a), there was 400 milliseconds time difference between the home server and slave #1, slave #2 when media was played over 200 seconds because slave media was played in a normal PC. But, if synchronization algorithm was applied, the time difference occured 40ms compare with the case of not using synchronization algorithm. Table 2. Devices for the Evaluation Test Effect Property Heat Effect Wind Effect Scent Effect Light Effect Vibration Effect Shade Effect
Other Effect
Device Heater Fan Scent Generator Dimmer Flash Color Light(LED) Vibration Motion Curtain Water Sprayer Air Zet Tickler Phone Web Browser
Interface RS-232 RS-232 RS-232 RS-485 USB RS-485 RS-485 RS-485 RS-485 RS-485 RS-485 RS-485 Ethernet Ethernet
δD(t) 550ms 500ms 1s 800ms 70ms 120ms 150ms 100ms 5.3s 200ms 200ms 200ms 7s 1.2s
δN(t) 5ms 10ms 12ms 7ms 28ms 7ms 7ms 7ms 10ms 7ms 7ms 7ms 1ms 1ms
Number 3 2 2 1 2 4 4 1 1 2 2 1 1 2
We used 7 kind and 14 effect devices to represent real-sense effect as shown in table 2. We used electronic appliances that have MCU(Micro Controller Unit) to represent real-sense effect like "Heat", "Wind", "Shade". "Light" effect devices are categorized to "Dimmer" supporting fade in and fade out, and "Flash" to show an effect of lightning, spark, and "LED" supporting RGB color to express real colors. To show a "Vibration" effect, there are "Vibration" chairs that users feel oscillation from the seat and backrest, also this vibration chair is capable of giving a "Motion" effect like swing, up/down, shaking, forward/backward, left/right turn. And, there are "Other" effect like "Water Sprayer", "Air Zet", "Tickler". To interact with PC, there are "Web Browser" effect that opens related web pages in a specific scene of a movie. The "Phone" effect rings a real world phone or a cell phone when there is a scene of phone calls. We recorded the time of δD(t) since a control signal was delivered to a device, executed and reached its effect to a person in two meters. And, we measured the time δN(t) that MCU of each device sent a control signal and received a response message. Fig. 12 shows the result with synchronization algorithm and without it. We checked the maximum time difference from media time and device execution time, and it also contains the re-transmit time when control signal did not work correctly. After using synchronization algorithm, devices can be controlled within the time error 25ms.
Real-Sense Media Representation Technology
353
Fig. 12. Evaluation Results for the Effect Device Synchronization Algorithm
5 Conclusion Recently, requirements for media that gives users real-sense effects like real-sense movie, real-sense broadcast, real-sense news, real-sense education is increasing. But, previous playback method like SMSD cannot provide users with the real-sense effect, because one device cannot have all capability of effect representing. Therefore, in this paper, we explained the concepts and architecture of the real-sense media service system, and suggested the playback algorithms for operating multiple audio/video devices with multiple effect devices together. As shown in evaluation test and field tests, people feel ackward when synchronization algorithms did not work. Therefore in the future, these synchronization algorithms can do an important role when realsense broadcasting service like IPTV will be widely spreaded. The future research issue is a study for the QOS of the multi-track media and the effect device control. Acknowledgments. This work was supported by the IT R&D program of MKE/KEIT, [2007-S010-03, Development of Ubiquitous Home Media Service System base on SMMD]"
References [1] Choi, B.S., Joo, S.H., Lee, H.R., Park, K.R.: Metadata structure for Media and Device Interlocking, and the Method for Mapping It in the Media Format. In: Advances in Information Sciences and Services, pp. 186–190 (2007) [2] Segui, F.B., Cebollada, J.C.G., Mauri, J.L.: Multiple Group and inter-stream synchronization techniques: A comparative study. Information Systems, 108–131 (2009) [3] Yun, J.K., Shin, H.S., Lee, H.R., Park, K.R.: Development of the Multiple Devices Control Technology for Ubiquitous Home Media Service System. In: International Conference on Ubiquitous Information Technology & Applications, pp. 404–408 (2007) [4] Timmerer, C., Hasegawa, S. (eds.): Working Draft of ISO/IEC 23005 Sensory Information, SO/IEC JTC 1/SC 29/WG 11/N10475, Lausanne, Switzerland (February 2009) [5] Waltl, M., Timmerer, C., Hellwagner, H.: A Test-Bed for Quality of Multimedia Experience Evaluation of Sensory Effects. In: Proceedings of the First International Workshop on Quality of Multimedia Experience (QoMEX 2009), July 29-31 (2009)
Overview of Multicore Requirements towards Real-Time Communication Ina Podolski and Achim Rettberg Carl von Ossietzky University Oldenburg
[email protected],
[email protected] Abstract. For embedded systems multicores are becoming more important. The flexibility of multicores is the main reason for this increasing extension. Considering embedded systems are often applied for real-time task, the usage of multicores implicates several problems. This is caused by the architecture of multicore chips. Usually such chips consist of 2 or more cores, a communication bus, I/O's and memory. Exactly the accesses to these resources from a core make it hard to ensure real-time. Therefore, well known mechanisms for resource access must be used, but for sure this is not sufficient enough. Because, a lot of design decisions depend on the applications. As a result, it is mandatory to analyze the requirements of the applications in detail and from the targeted multicore system. The aim of this analysis process is to derive a method for an optimal system design with respect to real-time support. This paper gives an overview of the requirement analysis for multicores and RT scheduling algorithms. Additionally, existing scheduling strategies are reviewed and proposals for new schedulers will be made.
1 Introduction The trend of applying multicore systems in many embedded application the necessity of real-time for such systems becoming more and more important. Typical multicores are shown in figure 1.
Fig. 1. Three exemplary types of multicore architectures
The authors of [1] argue that the transactional memory concept within multicore systems has attracted much interest from both academy [3] [4] and industry [5] as it eases programming and avoids the problems of lock-based methods. Furthermore they discuss, by supporting the ACI (Atomicity, Consistency and Isolation) properties of S. Lee and P. Narasimhan (Eds.): SEUS 2009, LNCS 5860, pp. 354–364, 2009. © IFIP International Federation for Information Processing 2009
Overview of Multicore Requirements towards Real-Time Communication
355
transactions, transactional memory relieves the programmer from dealing with locks to access resources, see [1]. Besides the locking problems it is necessary to find solutions for deadlock avoidance and take care on the priority inversion problem. As surely argued in [1], in the case of multicore systems, lock-based synchronization can reduce the data bandwidth by blocking several processes that try to access critical sections, thus reducing processors utilization. The resource access is one of the major problems. Shared resources are controlled by lock-based methods with the wellknown disadvantages of serial access. Parallel access can be realized with transactional memory, see also [1]. That means, a transaction is either aborted when a conflict is detected, or committed in case of successful completion. An exemplary mapping of functional modules to tasks is depicted in figure 2. The tasks itself are mapped to cores of a multicore architecture. Some tasks require operating systems with a scheduler. As we can see from literature and the authors of [1] argue, real-time scheduling of transactions, which is needed for many real-time applications, is an open problem in multicore systems. The idea is to look at existing solutions for real-time scheduling of tasks in multiprocessor systems or transaction in systems are not suitable for multicore systems. The literature shows, real-time scheduling of tasks in multiprocessor systems does not consider important features of multicore systems, such that the presence of on-chip shared caches, see [1]. Caches are a big problem. If each core has is own cache the problems is the shared access from the caches to the main memory. In case of shared caches again locking mechanism have been used with all it’s disadvantages. Again as argued in [1] real-time scheduling of transactions in systems has been around since the 80s but assuming either centralized or distributed systems, but both solutions are not suitable for multicore systems as well. In this paper, we give an overview of related work of existing scheduling methods with resource access for multicore systems. We will briefly present a simple resource access strategy to demonstrate the problems and will discuss the main challenges for real-time scheduling in multicore systems.
Fig. 2. Exemplary mapping of functions to tasks and multicores
356
I. Podolski and A. Rettberg
2 Multiprocessor Scheduling Approaches with Resource Access Protocols In this section we will demonstrate with a small example the problems of shared resource access for multicore systems. We make the following assumptions. For each task, a set of jobs is associated. At any time, each processor executes at most one job. The task has a period and an execution requirement. When a job is released, it executes during the execution requirement of the task, and once the period is elapsed, another job of the task, is released. On multiprocessors, EDF is not optimal either under the partitioned or the global approaches [10], called respectively P-EDF and G-EDF, see also [1]. There exist further classes of scheduling algorithms differs from the previous ones. A typical example is the Pfair algorithm [11]. Pfair based on the idea of proportionate fairness and ensures that each task is executed with uniform rate. All tasks are broken into so-called quantum-length subtasks and the time is subdivided into a sequence of subintervals of equal lengths called windows, see [1]. This means, a subtask has to be executed within the associated window. Additionally, migration is allowed for each subtask. As described in [1] an optimal Pfair variant is that from [12]. In figure 3 an example multicore architecture is shown with cores A and B. On core A two tasks t1 and t2 are executed and on core B t3 and t4 are running. Additionally, the tasks access resources R1 and R2. For example task t2 has access to R1. As described in [1], the protocols managing resources in real-time systems are usually used in a hard real-time context, such as M-PCP and FMLP1 [13] under EDF. For Pfair scheduling, a lock-free algorithm has been proposed [12] to ensure that some task is always making progress. Indeed, classical lock-based algorithms cannot satisfy this property. Obviously, it can be figured out that resource access conflicts can occur. If task t1 on core A and task t4 on core B run at the same time and try to access R2 there is a resource conflict. Those conflicts exist due to the parallelism implicated by multicore architectures.
Fig. 3. Example multicore architecture with 2 cores A and B, 2 resources R1 and R2, and tasks t1 to t4
Overview of Multicore Requirements towards Real-Time Communication
357
Fig. 4. Task graph for t1 to t4 Table 1. Task characteristics
t1 t2 t3 t4
ci 6 6 8 5
di 9 16 20 22
ai 0 0.5 0 9
Figure 4 shows a task graph consisting of four tasks t1 to t4. Data dependencies between the tasks are represented by edges. Arrival time ai, computation time ci and deadline di for the tasks are given in table 1. On uni-core architectures there exist different resource access protocols. The most used one is the priority ceiling protocol (PCP) see [9]. The advantages of PCP are to prevent chained blocking and deadlocks. Each resource has a semaphore. A semaphore sk is assigned a priority ceiling C(sk) equal to the priority of the highest-priority job that can lock it. Note that C(sk) is a static value that can be computed off-line. Let ti be a task with the highest priority among all tasks ready to run; thus, ti is assigned the processor. Furthermore, let s* be the semaphore with the highest ceiling among all the semaphores currently locked by tasks other than ti and let C(s*) be its ceiling. To enter a critical section guarded by a semaphore sk, task ti must have a priority higher than C(s*). If the priority Pi of task ti is greater equal C(s*), the lock on sk is denied and ti is said to be blocked on semaphore s* by the job that holds the lock on s*. When a task ti is blocked on a semaphore, it transmits its priority to the task, say tk, that holds that semaphore. Hence, tk resumes and executes the rest of its critical section with the priority of ti. Task tk is said to inherit the priority of ti. In general, a task inherits the highest priority of the jobs blocked by it. When tk exits a critical section, it unlocks the semaphore and the highest-priority job, if any, blocked on the semaphore is awakened. Moreover, the active priority of tk is updated as follows: if no other jobs are blocked by tk, pk is set to the nominal priority Pk; otherwise, it is set to the highest priority of the jobs blocked by tk . The schedule with PCP protocol for an uni-core processor is shown in figure 5. On a uni-core 26 time units are needed for the schedule.
358
I. Podolski and A. Rettberg
Fig. 5. Uni-core scheduling with PCP
The question is, if the PCP is adaptable for multicores? PCP solves the resource access within one core, but between cores is has to be modified, see the example in figure 6. There we show an optimal but unrealistic schedule. One conflict is the parallel access of to t1 and t3 on resource r2. Therefore, it is necessary to adapt PCP to avoid conflicts caused by parallel access.
Fig. 6. Unrealistic schedule on a multicore with resource conflicts
A solution could be a globally controlled semaphore queue. With this queue we are able to avoid parallel access to resources. Lets say a task t2 on core A access resource r1, the semaphore s0 for r1 is included in the queue. Other tasks trying to access s0 have to check within the queue if the global semaphore is not set within the global queue. Figure 7 shows the PCP solution with the global queue. The ceiling blocking is now visible between the cores, see the resource access from t2 and t3 on s0. In figure 7 we need 20 times units for the schedule in comparison to the 26 for the uni-core schedule. To reduce the long blocking we suggested another modification of PCP. Let us assume we have another resource r3 inside the system that is used by t2 on core A, see figure 8. This is a sub-optimal solution, because now we have again a blocking time. This blocking time may be very long and lead to deadline misses. Another problem is the reduction of concurrency within the system. We see that core B has idle times caused by the blocking of t3 by access to s0.
Overview of Multicore Requirements towards Real-Time Communication
359
Fig. 7. Resource conflicts of PCP within multicore architecture
The main idea of this modification is the following. A resource is released whenever a task is finished with the usage. Furthermore, if a task is pre-empted by another task on the same core, the blocked resources of the pre-empted task are released. This enables tasks running on others cores to block the resources and starts with executing their critical sections. We have to ensure the fairness of our approach. First of all on each core we use PCP and for all tasks running on this core we have therefore a fair situation.
Fig. 8. Modified multicore example with additional resource R3
The inter-core PCP respectively our modified version has to ensure real-time requirements on all cores. Let ti a task on core A that is pre-empted by a task with a higher priority. In this case the semaphores blocked by ti are released. Let’s assume ti hold only one semaphore sk. The time period ti is pre-empted is as long as the higher priority task is running, let’s say this time-period is bi. Semaphores are released only for the time-period bi. A task tj on core B needed sk for time bj can now block the semaphore sk, but only if bi ≥ bj. If ti and tj tries to access sk at the same time, the task with the earliest deadline will get the allowance to block semaphore sk. Let di the deadline for ti and dj for tj. If di < dj task ti will have access to sk first. Now it is necessary to calculate the new deadline for tj as follows: dj = dj + li, whereas li is the time sk is needed by ti. This is a calculation for only one blocking. Figure 9 shows the schedule with the modified PCP.
360
I. Podolski and A. Rettberg
Fig. 9. Optimal schedule with the modified PCP
For evaluation the utilization factor for a schedule with our approach is as follows:
Whereas Ci is the computation time task ti and Ti is the period of the task. The blocking time Bi for all semaphores of the task is added to the Ci. Additionally we have to update the deadline of ti, because due to the blocking. That means, if the ti is not blocked di is not modified, otherwise the new deadline is added by the sum of all blocking times of ti.
This short example demonstrates the shared resource access problem. It is wellknown that for uni-processor systems based on static or dynamic priority assignment of task Earliest Deadline First (EDF) is optimal [9].
3 State-of-the-Art In this chapter we will give a short overview of the related work in this research field and enriches the discussion started in [1]. The authors of [2] claims also that the shared memory bus becomes a major performance bottleneck for many numerical applications on multicore chips, understanding how the increased parallelism on chip strains the memory bandwidth and hence affects the efficiency of parallel codes becomes a critical issue. They introduce the notion of memory access intensity to facilitate quantitative analysis of program’s memory behavior on multicores, which employ stateof-the-art prefetching hardware. The paper in [10] deals with the scalability of the scheduling algorithms presented above, on multicore platforms. One main conclusion of the authors is that on multicore platforms bandwith have negative impact on the algorithms, allowing migrations. The global approach, the scheduling overheads greatly depend on the way of implementing the run queues. On the other hand, without resource sharing, P-EDF performs well for this study, see also [1]. In [6] the authors present cache-efficient chip multiprocessor (CMP) algorithms with good speed-up for some widely used dynamic programming algorithms. They
Overview of Multicore Requirements towards Real-Time Communication
361
consider three types of caching systems for CMPs: D-CMP with a private cache for each core, S-CMP with a single cache shared by all cores, and multicore, which has private L1 caches and a shared L2 cache. Furthermore they derive results for three classes of problems: local dependency dynamic programming (LDDP), Gaussian Elimination Paradigm (GEP), and parenthesis problem. For each class of these problems, they propose a generic CMP algorithm with an associated tiling sequence. A fundamentally new approach to increase the timing predictability of multicore architectures aimed at task migration in embedded environments is described in [7]. A task migration between two cores imposes cache warm-up overheads on the migration target, which can lead to miss deadlines for tight real-time schedules. The authors propose novel micro-architectural support to migrate cache lines. The developed scheme shows dramatically increased predictability in the presence of cross-core migration. Another scheduling method for real-time systems implemented on multicore platforms that encourages certain groups of tasks to be scheduled together while ensuring realtime constraints is proposed in [8]. This method can be applied to encourage tasks that share a common working set to be executed in parallel, which makes more effective use of shared caches. Another good overview of resource access protocols can be found in [14] and is as follows: “Rajkumar et al. [15] were the first to propose locking protocols for real-time multiprocessor systems. They presented two multiprocessor variants of the priorityceiling protocol (PCP) [16] for systems where partitioned, static-priority scheduling is used. In later work, several protocols were presented for systems scheduled by PEDF. The first such protocol was presented by Chen and Tripathi [17], but it is limited to periodic (not sporadic) task systems. In later work, Lopez et al. [18] and Gai et al. [19] presented protocols that remove such limitations, at the expense of imposing certain restrictions on critical sections (such as, in [19], requiring all global critical sections to be non-nested). A scheme for G-EDF that is also restricted was presented by Devi et al. [20]. More recently, Block et al. [21] presented the flexible multiprocessor locking protocol (FMLP), which does not restrict the kinds of critical sections that can be supported and can be used under either G-EDF or P-EDF. In the FMLP, resources are protected by either spin-based or suspension-based locks. The FMLP is the only scheme known to us that is capable of supporting arbitrary critical sections under G-EDF. Furthermore, the schemes in [20, 19, 18] are special cases of it. Thus, given our focus on G-EDF and P-EDF, it suffices to consider only the FMLP when considering lock-based synchronization.” In [1] they discussed that pure global algorithms will not scale, and thus real-time global policies need to be revisited for many-core architectures. More particularly, the scheduler should be able to control more precisely the sharing of processor's internal resources (i.e. cache levels) by real-time tasks with on-chip shared caches, both the small size of the caches and the memory. In [2] the overview of the memory access is as follows: “Memory bandwidth has been a fundamental issue for decades and has now become a major limitation on multicore systems ([22], [23], [24]). In S. Carr’s paper ([25]), methods are proposed to balance computation and memory accesses to reduce the memory and pipeline delays for sequential code on uniprocessor machines. The authors statically estimate the
362
I. Podolski and A. Rettberg
ratio of memory operations and floating point operations for each loop and use them to guide loop transformations (e.g. unroll and jam). The methodology from [2], which is based on the new notion of the memory access intensity, targets parallel programs on multicore systems which employ sophisticated prefetching hardware. Several existing papers investigate the scalability problem on multicore ([26], [27], [28]). They make observations that the memory bandwidth constraint can hamper program performance. However, no quantitative analyses are performed in those studies.” The authors of [29] present cache-efficient chip multiprocessor (CMP) algorithms with good speed-up for some widely used dynamic programming algorithms. They consider three types of caching systems for CMPs: D-CMP with a private cache for each core, S-CMP with a single cache shared by all cores, and Multicore, which has private L1 caches and a shared L2 cache. They derive results for three classes of problems: local dependency dynamic programming (LDDP), Gaussian Elimination Paradigm (GEP), and parenthesis problem. For each class of problems, they develop a generic CMP algorithm with an associated tiling sequence. They then tailor this tiling sequence to each caching model and provide a parallel schedule that results in a cache-efficient parallel execution up to the critical path length of the underlying dynamic programming algorithm.
4 Summary Within this paper we start the discussion of real-time scheduling approaches and the resource access for multicore systems. Existing scheduling mechanisms have been adapted for multicore systems. Obviously new rules and policies have been required, which leads to new requirements depending on the features of the multicore architecture. With this position paper we want to give an overview based on the given literature.
References [1] Sarni, T., Queudet, A., Valduriez, P.: Real-time scheduling of transactions in multicore systems. In: Proc. of Workshop on Massively Multiprocessor and Multicore Computers (2009) [2] Liu, L., Li, Z., Sameh, A.H.: Analyzing memory access intensity in parallel programs on multicore. In: Proceedings of the 22nd Annual international Conference on Supercomputing, ICS 2008, Island of Kos, Greece, June 7-12, pp. 359–367. ACM, New York (2008), http://doi.acm.org/10.1145/1375527.1375579 [3] Herlihy, M., Moss, J.E.B.: Transactional memory: Architectural support for lock-free data structures. In: Proc. the 20th Annual International Symposium on Computer Architecture, May 1993, pp. 289–300 (1993) [4] Shavit, N., Touitou, D.: Software transactional memory. In: Proc. the 12th Annual ACM Symposium on Principles of Distributed Computing (PODC), pp. 204–213 (1995) [5] Tremblay, M., Chaudhry, S.: A third-generation 65nm 16-core 32-thread plus 32-scoutthread cmt sparc r processor. In: IEEE International Solid-State Circuits Conference (February 2008)
Overview of Multicore Requirements towards Real-Time Communication
363
[6] Chowdhury, R.A., Ramachandran, V.: Cache-efficient dynamic programming algorithms for multicores. In: Proceedings of the Twentieth Annual Symposium on Parallelism in Algorithms and Architectures, SPAA 2008, Munich, Germany, June 14-16, pp. 207–216. ACM, New York (2008), http://doi.acm.org/10.1145/1378533.1378574 [7] Sarkar, A., Mueller, F., Ramaprasad, H., Mohan, S.: Push-assisted migration of real-time tasks in multi-core processors. SIGPLAN Not. 44(7), 80–89 (2009), http://doi.acm.org/10.1145/1543136.1542464 [8] Anderson, J.H., Calandrino, J.M.: Parallel task scheduling on multicore platforms. SIGBED Rev. 3(1), 1–6 (2006), http://doi.acm.org/10.1145/1279711.1279713 [9] Buttazzo, Giorgio, C.: Hard real time computing systems. Kluwer Academic Publishers, Dordrecht (2000) [10] Calandrino, B.B.J., Anderson, J.: On the scalability of real-time scheduling algorithms on multicore platforms: A case study. In: Proc. The 29th IEEE Real-Time Systems Symposium (December 2008) [11] Baruah, S.K., Cohen, N.K., Plaxton, C.G., Varvel, D.A.: Proportionate progress: A notion of fairness in resource allocation. Algorithmica 15, 600–625 (1996) [12] Leung, J.: Handbook of scheduling: algorithms, models, and performance analysis. Chapman & Hall/CRC, Boca Raton (2004) [13] Brandenburg, B.B., Calandrino, J.M., Block, A., Leontyev, H., Anderson, J.H.: Real-time synchronization on multiprocessors: To block or not to block, to suspend or spin? In: IEEE Real-Time and Embedded Technology and Applications Symposium, pp. 342–353. IEEE Computer Society, Los Alamitos (2008) [14] Brandenburg, B.B., Calandrino, J.M., Block, A., Leontyev, H., Anderson, J.H.: RealTime Synchronization on Multiprocessors: To Block or Not to Block, to Suspend or Spin? In: Proceedings of the 2008 IEEE Real-Time and Embedded Technology and Applications Symposium, RTAS, April 22-24, pp. 342–353. IEEE Computer Society, Washington (2008), http://dx.doi.org/10.1109/RTAS.2008.27 [15] Rajkumar, R.: Synchronization in Real-Time Systems: A Priority Inheritance Approach. Kluwer Academic Publishers, Dordrecht (1991) [16] Sha, L., Rajkumar, R., Lehoczky, J.: Priority inheritance protocols: An approach to realtime system synchronization. IEEE Transactions on Computers 39(9), 1175–1185 (1990) [17] Chen, C., Tripathi, S.: Multiprocessor priority ceiling based protocols. Technical Report CS-TR-3252, Univ. of Maryland (1994) [18] Lopez, J., Diaz, J., Garcia, D.: Utilization bounds for EDF scheduling on real-time multiprocessor systems. Real-Time Systems 28(1), 39–68 (2004) [19] Gai, P., di Natale, M., Lipari, G., Ferrari, A., Gabellini, C., Marceca, P.: A comparison of MPCP and MSRP when sharing resources in the Janus multiple processor on a chip platform. In: Proceedings of the 9th IEEE Real-Time and Embedded Technology Application Symposium, pp. 189–198 (2003) [20] Devi, U., Leontyev, H., Anderson, J.: Efficient synchronization under global EDF scheduling on multiprocessors. In: Proceedings of the 18th Euromicro Conference on RealTime Systems, pp. 75–84 (2006) [21] Block, A., Leontyev, H., Brandenburg, B., Anderson, J.: A flexible real-time locking protocol for multiprocessors. In: Proceedings of the 13th IEEE International Conference on Embedded and Real-Time Computing Systems and Applications, pp. 71–80 (2007) [22] Smith, A.J.: Cache Memories. Computing Surveys 14(3), 473–530 (1982)
364
I. Podolski and A. Rettberg
[23] Asanovic, K., et al.: The Landscape of Parallel Computing Research: A View from Berkeley. EECS Department University of California, Berkeley Technical Report No. UCB/EECS-2006-183 (December 18, 2006) [24] Hennessy, J.L., Patterson, D.A.: Computer Architecture: A Quantitative Approach, 4th edn. (2007) [25] Carr, S., Kennedy, K.: Improving the Ratio of Memory Operations to Floating-Point Operations in Loops. ACM Transactions on Programming Languages and Systems 16, 1768–1810 (1994) [26] Zhang, Q., et al.: Parallelization and Performance Analysis of Video Feature Extractions on Multi-Core Based Systems. In: Proceedings of International Conference on Parallel Processing, ICPP (2007) [27] Alam, S.R., et al.: Characterization of Scientific Workloads on Systems with Multi-Core Processors. In: International Symposium on Workload Characterization (2006) [28] Chai, L., et al.: Understanding the Impact of Multi-Core Architecture in Cluster Computing: A Case Study with Intel Dual-Core System. In: Cluster Computing and the Grid (2007) [29] Chowdhury, R.A., Ramachandran, V.: Cache-efficient dynamic programming algorithms for multicores. In: Proceedings of the Twentieth Annual Symposium on Parallelism in Algorithms and Architectures, SPAA 2008, Munich, Germany, June 14-16, pp. 207–216. ACM, New York (2008), http://doi.acm.org/10.1145/1378533.1378574
Lifting the Level of Abstraction Dealt with in Programming of Networked Embedded Computing Systems (Keynote Speech) K.H. Kim* EECS Dept. & DREAM Lab, University of California, Irvine Irvine, CA 92697, USA
[email protected] Abstract. The scale and complexity of advanced networked embedded computing (NEC) application systems are steadily growing. The need has become increasingly acute for a programming method and style that imposes much less amounts of detail-handling on the real-time (RT) distributed computing (DC) programmer than the currently widely used method does. With this motivation a number of different attempts have been made toward establishing high-level RT DC objects. The TMO scheme developed over the past 18 years by the author and his collaborators is one of those attempts. In terms of lifting the level of abstractions of main program building-blocks, the TMO scheme has been about the most daring attempt. However, all the attempts have not yet reached the level of sufficient maturity in that the ease of guaranteeing the timeliness of critical output actions with a high degree of precision has not been much demonstrated. Some basic principles and techniques learned from past research on TMO are briefly reviewed. Then major remaining research issues are discussed. Keywords: real time, distributed computing, networked embedded computing, TMO, time-trigger, message-trigger, object, timeliness, guarantee.
1 Motivation In recent years the scale and complexity of advanced networked embedded computing (NEC) application systems have been steadily growing. To cope with this growing complexity, the programming styles, methods, and tools need to be significantly upgraded. The predominant style of programming used today in the young field of NEC application programming is the low-level programming that can be called the thread - UDP – thread-priority (TUP) programming. For the past two decades I have belonged to the small community which strongly believe in the need for upgrading the level of abstractions of the program building-blocks available to real-time (RT) distributed computing *
The author is also an adjunct faculty member of Konkuk University.
S. Lee and P. Narasimhan (Eds.): SEUS 2009, LNCS 5860, pp. 365–376, 2009. © IFIP International Federation for Information Processing 2009
366
K.H. Kim
(DC) programmers. The fundamental problem with the TUP programming is that understandable specification, design, and timeliness guaranteeing is very difficult. To be more specific, the following limitations exist: (Pr1) The quality of advanced NEC software produced by use of the TUP approach tends to be low due to the difficulty of obtaining understandable designs. (Pr2) The productivity of NEC software engineers is low due to the need for them to deal with tedious details, which can be avoided when using a higher-level programming approach, as well as the difficulty of obtaining understandable designs. (Pr3) The performance of advanced NEC software tends to be low due to the difficulty of understanding the performance impacts of detailed design choices and achieving optimizations. The relationship between the application requirements, especially timing requirements, and the designs of TUP programs is very difficult to trace. As a result, ineffective use of resources in NEC systems results. Therefore, the need has become increasingly acute for a programming method and style that imposes much less amounts of detail-handling on the RT DC programmer. This means specifically the following: (Up1) A higher-level RT DC program model consisting of program structures and vocabularies that are more abstract and yet do not force compromises in the degree of control of various important action timings, is desired. (Up2) A desirable program model should exhibit more directly and clearly the relationship between itself and the application requirements that it is intended to meet, including the timing and other performance requirements. (Up3) As the core part of the desirable program model being sought, the program construct which can play the role of the main building-block in development of RT DC application programs was singled out from the beginning phase of the research on high-level RT DC programming. In addition, given the insufficient understanding that had existed in the field regarding program constructs and the desire to lift the level of abstraction which RT DC programmers would need to deal with without removing any important kind of programming power from them, the small community quickly reached a conclusion that the main building-block should be some version of the abstract data type object. Therefore, abstract RT DC object models have been at the top in the list of research targets in this area. Among a number of different attempts toward establishing such high-level RT DC objects, the efforts that have been pursued with the greatest persistence are the following three: RT Corba [OMG05], RT Java [Bru09], and TMO (Time-triggered Message-triggered Object) [Kim97, Kim00, Liu09]. In terms of lifting the level of abstractions of main program building-blocks, the TMO scheme can be viewed as the most daring and advanced attempt. However, all three attempts have not yet reached the level of sufficient maturity in that the ease of guaranteeing the timeliness of critical output actions with a high degree of precision, e.g., sub-millisecond-level guaranteeing of a result return from a remote object-method invocation, has not been sufficiently demonstrated. Much further research remains to be done.
Lifting the Level of Abstraction Dealt with in Programming of NEC Systems
367
In the next section, some basic principles and techniques learned from past research on establishing the high-level RT DC object, TMO, are briefly reviewed. Then major remaining research issues are discussed in Section 3.
2 Principles and Approaches Newly Explored in TMO Research In 1992 the author decided to adopt a simple skeleton model with concrete syntactic structure and semantics, which was initially named the RTO.k and later renamed to TMO (Time-triggered Message-triggered Object) [Kim97, Kim00, Liu09]. In the past 18 years, the progress in fully developing and maturing the TMO programming technology has been slow but steady. In fact, all efforts geared toward establishing highlevel RT DC objects have been advancing at similar rates. Some parts of the TMO model were concrete mechanizations of the basic principle of global-Time based Coordination of Distributed computing Actions (TCoDA), which had been first advocated by Hermann Kopetz in Austria [Kop87, Kop97]. Name of TMO ODSS
ODSS
1
2
•••
EAC
Object Data Store ( ODS )
AAC
AAC Reservation Q Service Request Queues
From SvM's , SpM's
Capabilities for accessing other TMO's and network environment incl . logical multicast channels, and I/O devices
SpM 1
SpM 2
• • Deadlines
Time-triggered (TT) Spontaneous Methods ( SpM's) "Absolute time domain"
SvM 1 Client TMO's
• •
SvM 2 concurrency control
• •
Message-triggered Service Methods ( SvM's) "Relative time domain"
Fig. 1. Structure of the TMO (adapted from [Kim97])
The basic TMO structure is depicted in Figure 1. Calling the TMO scheme a highlevel DC programming scheme is justified by the following characteristics of the scheme: (1) No manipulation of processes and threads. Concurrency is specified in an abstract form at the level of object-methods. Since processes and threads are transparent to TMO programmers, the priorities assigned to them, if any, are not visible, either.
368
K.H. Kim
(2) No manipulation of hardware-dependent features in programming interactions among objects. TMO programmers are not burdened with any direct use of low-level network protocols and any direct manipulation of physical channels and physical node addresses. (3) Timing requirements need to be specified only in the most natural form of a timewindow for every time-triggered method execution and a completion deadline for every client-requested method execution. This high-level expression matches the most closely with the designer's intuitive understanding of the application's timing requirements. The program structuring principles and approaches newly explored in TMO research are summarized in this section. These principles and approaches are considered to be of fundamental nature and useful in practical NEC programming and software engineering. 2.1 Pervasive Use of a Global Time Base Practically all time references in a TMO are references to global time [Kop97] in that their meaning and correctness are unaffected by the location of the TMO. If GPS receivers are incorporated into the TMO execution engine, then a global time base of microsecond-level precision can be established easily. Within a cluster computer or a LAN based DC system a master-slave scheme, which involves time announcements by the master and exploitation of the knowledge on the message delay between the master and the slave, can be used to establish a global time base of sub-millisecond level precision. A TMO instantiation instruction may contain a parameter which explicitly indicates the required precision of the global time base to be established by the TMO execution engine. This pervasive use of a global time base in networks of RT DC objects was first explored in the TMO scheme and is regarded as a fundamental approach of high usefulness in advanced NEC software engineering. 2.2 Deadline for Result Arrival (DRA) TMO is a natural, syntactically minor, and semantically powerful extension of the conventional object(s). TMO is a DC component and thus TMOs distributed over multiple nodes may interact via remote method calls and another mechanism discussed below in (G). Object-methods in a TMO that can be called from other TMOs are called message-triggered service methods (SvMs). It takes two simple statements to construct a remote method call: (Stmt1) SvMGateClass Gate1 (_T("TMO3“), _T("SvM7“), tm4_DCS_age(7*1000*1000) ) (Stmt2) Gate1.BlockingSR ( void* ParamPtr, int ParamSize, MicroSec DRA, tms ORT ) The first statement Stmt1 is for instantiation of a proxy object, called a TMO-method gate, for the remote TMO-method (TMO3.SvM7) to be invoked. This statement is
Lifting the Level of Abstraction Dealt with in Programming of NEC Systems
369
placed in the environment access capability (EAC) section which is one of the four major parts of the TMO structure. The other statement Stmt2 is a call for a built-in method of the TMO-method gate, BlockingSR(), which in turn performs a blocking type of a service request to the remote TMO-method. The TMO-method gate possesses built-in methods for both blocking types and nonblocking types of service requests. Note that the 3rd parameter in Stmt2 is a deadline for return result arrival (DRA). The delay from the instant at which this statement is executed to the instant at which the result returned from the invoked remote SvM becomes available in the node hosting the client should not exceed the DRA specified. Otherwise, the TMO execution engine which is spread over the network of computing nodes forming the DC network raises an error signal. In order to specify the DRA appropriately the TMO programmer needs to have good understanding of at least the worst-case service time of the remote SvM or a tight upper bound on the service time. In fact, each SvM includes an explicit specification of a guaranteed execution duration bound (GEDB) and a maximum invocation rate (MIR) supported by it. The TMO was the first RT DC object / component which incorporated DRA as an integral mechanized part. 2.3 Object-Data-Store Segment (ODSS) and Concurrency Control One of the four major parts of the TMO structure is the object-data-store (ODS) section which contains the data-container variables shared among methods of a TMO. Variable are grouped into ODS segments (ODSSs), which are the units that can be locked for exclusive use by a TMO-method in execution. Access rights of TMOmethods for ODSSs are explicitly specified and registered to the execution engine which in turn analyzes them to exploit maximal concurrency. An execution of a TMOmethod is launched only when all the ODSSs for which the TMO-method has access rights have been locked for use in the TMO-method-execution. Conversely, multiple TMO-method-executions may progress concurrently if for each of those executions the TMO execution engine has locked all the ODSSs needed for that method-execution. This can be viewed as an approach for "maximal exploitation of concurrency among object-methods". TMO is among the earliest RT DC objects which incorporated such a concurrency control approach. 2.4 Spontaneous Method (SpM) TMO is an autonomous active DC component. Its autonomous action capability stems from one of its unique parts, called the time-triggered (TT) methods or the spontaneous methods (SpMs), which are clearly separated from the conventional service methods (SvMs). The SpM executions are triggered upon reaching of the global time at specific values determined at the design time whereas the SvM executions are triggered by service request messages from clients. For example, the triggering times may be specified as "for t = from 10am to 10:50am every 30min start-during (t, t+5min) finish-by
370
K.H. Kim
t+10min". All time references here are global-time references. By using SpMs, the TCoDA principle can be easily designed and realized. If an SpM has the specification of triggering times, "for t = from system-startup to system-shutdown every eternity start-during (t, t + minimum-activation-delay) finishby system-shutdown", the function body is executed from the system-startup instant until the system-shutdown instant or the completion of the function body, whichever comes first. Therefore, such SpM can be viewed as a "thread" activated upon system startup to run the function body expressed in a high-level language form. The logic in such function body may be designed to check the global time and sleep until the global time reaches a certain time-point. It may also access a certain pointer and invoke a local function pointed to by that pointer. The SpM as an object-method was first explored in the TMO scheme. The SpM is regarded as a fundamental approach of high usefulness in advanced NEC software engineering. 2.5 Basic Concurrency Constraint (BCC) A major execution rule intended to enable reduction of the designer's efforts in guaranteeing timely service capabilities of TMOs is the basic concurrency constraint (BCC) that prevents potential conflicts between SpMs and SvMs. Basically, activation of an SvM triggered by a message from an external client is allowed only when potentially conflicting SpM executions are not in place. Thus an SvM is allowed to execute only if no SpM that accesses the same ODSSs to be accessed by this SvM has an execution time-window that will overlap with the execution time-window of this SvM. The BCC does not reduce the programming power of TMO in any way. Under BCC the timing behavior of SpMs is not impacted in complicated ways by the SvM executions, especially if the CPU sharing among independent SpM executions and SvM executions is handled such that the share of the CPU time each SpM execution receives can be easily calculated. Therefore, the analysis of the timing behavior of a TMO can proceed largely in two steps, i.e., analyze the execution time bounds, which can also be viewed as the service time bounds (STBs), of SpMs first and then analyze the STBs of SvMs. The STBs of SvMs tend to be less tight than the STBs of SpMs. BCC was first explored in the TMO scheme. The author believes that this BCC is also a fundamental approach of high usefulness in advanced NEC software engineering. 2.6 Ordered Isolation (OI) Rule The difficulty of analyzing competitions among method-executions depends much on the way ODSSs are locked and released. To reduce that difficulty further after incorporating BCC, the TMO scheme adopted the ordered isolation (OI) rule [Kim07b]. The OI rule can be stated by using the term initiation timestamp (I-timestamp) defined as follows: - In the case of an SvM execution, the I-timestamp is defined as the record of the time instant at which the execution engine initiated the SvM execution after receiving
Lifting the Level of Abstraction Dealt with in Programming of NEC Systems
371
the client request and ensuring that the SvM execution can be initiated without violating BCC. - In the case of an SpM execution, the I-timestamp is defined as the record of the time instant at which the SpM execution was initiated according to the AAC specification of the SpM. The OI rule has the following two parts: (OI-1) A method-execution with an older I-timestamp must not be waiting for the release of an ODSS held by a method-execution with a younger I-timestamp. (OI-2) A method execution must not be rolled back due to an ODSS conflict. If these rules or other rules restricting the possible types of waiting situations and rollback situations are not followed, then the validation of GEDBs can be drastically more complicated. The price paid for reducing the complexity of deriving tight execution duration bounds by adopting the OI rule is the loss of some concurrency. The OI rule was first explored in the TMO scheme. A rule that allows a greater amount of concurrency than the OI rule does and yet does not make the derivation of tight execution duration bounds of TMO-methods is an open research topic. 2.7 Real-Time Multicast and Memory Replication Channel (RMMC) TMOs can use another interaction mode in which messages can be exchanged over logical multicast channels of which access gates are explicitly specified as data members of involved TMOs. The channel facility is called the Real-time Multicast and Memory-replication Channel (RMMC) [Kim00, Kim05]. The RMMC scheme facilitates RT publish-subscribe channels in a powerful form. It supports not only conventional event messages but also state messages based on distributed replicated memory semantics [Kop97]. The access gates are called RMMC gates and treated as special types of ODSSs. Access rights of TMO-methods for RMMC gates are thus explicitly specified and registered to the execution engine. RMMC gates are declared in the EAC section. The declaration of an RMMC gate includes a parameter specifying a bound on the message transmission delay over the channel. When a message is multicast over an RMMC by calling for a built-in method of the corresponding RMMC gate, an official release time (ORT) is tagged to the message. When the message reaches a computing node hosting a subscriber TMO, the message cannot be opened until the ORT arrives. This ORT mechanism is useful in synchronizing message-pickups by multiple subscribers or ordering message-pickups by a subscriber of the messages coming over multiple RMMCs to the same TMO. The ORT is also incorporated into remote method call mechanisms. The part of the RMMC that carries event messages is an extension of the data field scheme initiated by Hitachi, Ltd. in Japan [Kim95, Mor93] and the part that carries state messages is an extension of the messaging scheme initiated by Hermann Kopetz. The ORT idea was initiated by Toshiba Corp., Japan. The following four concepts or approaches were first explored in the TMO scheme.
372
K.H. Kim
(1) The use of an RT logical multicast channel as a mechanism for interconnecting RT DC objects. (2) A single RT logical multicast channel that carries both event messages and state messages. (3) The approach of using an RMMC gate in connecting an RT DC object to an RT logical multicast channel. (4) The incorporation of the ORT into multicasts of RT event messages and state messages. Through the TMO programming experiences over the years the author has become convinced that this RMMC is a fundamental mechanism of high usefulness in advanced NEC software engineering. 2.8 Underground Non-Blocking Buffer (NBB) with a Pair of Access Gates In the basic TMO scheme, there is no way for any two concurrent method executions to exchange any data. This is because an ODSS cannot be shared when at least one method-execution needs to access it in the read-write mode. In some application cases, it is desirable to let two long-running SpM executions exchange data streams. The TMO was recently extended by incorporating a new mechanism that enables multiple rounds of data passing from one method-execution to the other method-execution and yet does not damage the nature of the TMO scheme which makes STB validation relatively easy. The mechanism is based on the NonBlocking Buffer (NBB) developed in recent years by several teams [Var01; Kim06; Kim07a]. An NBB used between a producer thread and a consumer thread allows the producer to insert a new data item into its internal circular buffer at any time without experiencing any blocking. If the internal circular buffer is saturated, then the producer attempting to insert a new item can detect it immediately and choose to do other things for a while and then check the NBB again. Similarly, the NBB allows the consumer to retrieve a data item from the internal circular buffer at any time without experiencing any blocking. The version of NBB that is appropriate for use between two methods in a TMO is depicted in Figure 2. Two NBBs are there. Each NBB consists of an internal circular buffer, a producer gate, and a consumer gate. The two gates are ODSSs and they are registered with the execution engine. In a sense, the internal circular buffer is treated as an invisible data structure. Therefore, the producer method puts a read-write lock on the producer gate and the consumer method puts a read-write lock on the consumer gate and then the two are treated by the execution engine as two independent methods not sharing any data structure. Only the TMO designer knows that the internal circular buffer is a shared data structure but the execution engine pretends not to know this and allows the producer method-execution and the consumer to proceed concurrently. Therefore, this version of NBB is called the underground NBB. The author believes that the NBB is an important mechanism for use in RT concurrent programs and TMOs containing underground NBBs may be the first instance of using NBBs in RT DC programs.
Lifting the Level of Abstraction Dealt with in Programming of NEC Systems
373
Fig. 2. Underground NBBia TMO (Adapted from [Kim07b])
3 Major Remaining Research Issues Over the past 15 years TMO execution facilities have been continuously enhanced along with the related APIs (application programming interfaces). A middleware model called the TMOSM (TMO Support Middleware) provides execution support mechanisms and can be easily adapted to a variety of commercial kernel-hardware platforms in wide use in industry [Kim99; Jen07, Kim08]. Therefore, a TMO execution engine consists of a group of networked computing node platforms (hardware nodes, plus OS kernels) and instantiations of the TMO Support Middleware (TMOSM) running on the node platforms. TMOSM uses well-established services of commercial OS kernels, e.g., process and thread support services, short-term scheduling services, and low-level communication protocols, in a manner transparent to the application programmer. Prototype implementations of TMOSM currently exist for Windows XP / Vista, Windows CE, and Linux 2.6 platforms. Along with TMOSM, the TMO Support Library (TMOSL) has been developed [Kim99; Kim00; Kim05]. It provides a set of friendly application programming interfaces (APIs) that wrap the execution support services of TMOSM. TMOSL defines a number of C++ classes and enables convenient high-level programming by approximating a programming language directly supporting TMO as a basic building block. Other research teams have also developed TMO execution engines based on different kernel platforms [KimH02; KimJ05]. TMOSM, TMOSL, and other tools are available for down-load from web, http://dream.eng.uci.edu /tmodownload/. A number of RT DC applications have also been developed by use of the TMO scheme [Hen08, Liu09]. Some demo applications can be seen in http:// dream.eng.uci.edu/demo.
374
K.H. Kim
However, there are a number of issues that need to be resolved through further research before the technology can reach the level of full maturity. Some major issues are listed below. 3.1 Intermediate-Level RT DC Programming Scheme For implementing NEC applications on small computing platforms, the DC object approach such as the TMO scheme could appear overhead-heavy. Also, transitioning from the TUP programming to a high-level programming approach such as TMO programming may be too big an adjustment in many industrial environments. Therefore, there seems to be a good justification for establishing a complimentary RT DC programming scheme which is based on a somewhat lower level of abstractions of program building-blocks. I believe that one promising direction is to take the object framework off the TMO and let SpMs become independent time-triggered functions (TTFs) and SvMs become independently threaded service functions (ITSFs). Then the resulting programming approach is to build RT DC programs with TTFs, ITSFs, NBBs, logical multicast channels, and other data sharing mechanisms. Such intermediate-level programming approach still avoids the direct use of threads and tread-priorities. In order to avoid creating overhead-heavy environments, it will be necessary to develop "kernel-level support" for building-blocks of such intermediate-level RT DC programs. 3.2 Service Time Bound (STB) Although the TMO scheme was initiated with the goal of enabling timeliness guaranteeing, only very limited experimental research on deriving STBs of TMO-methods has been performed [Col09]. This is the primary reason why the TMO scheme, and any other RT DC programming scheme for that matter, cannot be said to have reached a sufficiently mature level. The extent of research conducted all over the world in that direction, i.e., toward establishing a sound technical foundation for derivation of tight STBs of RT DC programs, looks quite unimpressive at present. Considering the critical needs for such research, the author hopes that the situation change soon. To establish a useful technical foundation, much better understanding needs to be obtained on issues such as allocation of both computing and communication resources to concurrent method-executions, allocation of resources in manners driven by qualityof-service specifications (e.g., timing specifications in TMOs), utilization of multi-core CPUs, timeliness-guaranteed handling of I/O activities, etc.
Acknowledgements The research reported here has been supported in part by NSF under Grant Numbers 03-26606 (ITR) and 05-24050 (CNS), by ETRI, and by the Konkuk Univ WCU project, No, R33-2008-000-10068-0, sponsored by MEST, Korea. No part of this paper represents the views and opinions of the sponsors mentioned above.
Lifting the Level of Abstraction Dealt with in Programming of NEC Systems
375
References 1. Bruno, E.J., Bollella, G.: Real-Time Java Programming: With Java RTS, Bru 2009, Sun Microsystems (2009) 2. Colmenares, J.A., Kim, K.H., Kim, D.H.: Experimental Evaluation of a Hybrid Approach for Deriving Service-time Bounds of Methods in Real-time Distributed Computing Objects. In: Proc. IESS 2009 (Int’l Embedded Systems Symp.), Langenargen, Germany (2009) 3. Henrich, E., et al.: Realization of an Adaptive Distributed Sound System Based on Globaltime-based Coordination and Listener Localization. In: Proc. ISORC 2008 (11th IEEE CS Int’l Symp. on Object/Component/Service-Oriented Real-time distributed Computing), Orlando, FL, May 2008, pp. 91–99 (2008) 4. Jenks, S.F., Kim, K.H., et al.: A Middleware Model Supporting Time-Triggered MessageTriggered Objects for Standard Linux Systems. Real-Time Systems - The International Journal of Time-Critical Computing Systems 36(1), 75–99 (2007) 5. Kim, K.H., Mori, K., Nakanishi, H.: Realization of Autonomous Decentralized Computing with the RTO.k Object Structuring Scheme and the HU-DF Inter-Process-Group Communication Scheme. In: Proc. IEEE 2nd Int’l Symp. on Autonomous Decentralized Systems (ISADS 1995), Phoenix, AZ, April 1995, pp. 305–312 (1995) 6. Kim, K.H.: Object structures for real-time systems and simulators. IEEE Computer 30(8), 62–70 (1997) 7. Kim, K.H., Ishida, M., Liu, J.: An Efficient Middleware Architecture Supporting TimeTriggered Message-Triggered Objects and an NT-based Implementation. In: Proc. 2nd IEEE Int’l Symp. on Object-oriented Real-time distributed Computing (ISORC 1999), May 1999, pp. 54–63 (1999) 8. Kim, K.H.: APIs for Real-Time Distributed Object Programming. IEEE Computer 33(6), 72–80 (2000) 9. Kim, K.H., Li, Y.Q., Liu, S., et al.: RMMC Programming Model and Support Execution Engine in the TMO Programming Scheme. In: Proc. ISORC 2005 (8th IEEE CS Int’ Symp. On Object-Oriented Real-Time Distributed Computing), May 2005, pp. 34–43 (2005) 10. Kim, K.H.: A Non-Blocking Buffer Mechanism for Real-Time Event Message Communication. Real-Time Systems 32(3), 197–211 (2006) 11. Kim, K.H., Colmenares, J.A., Rim, K.W.: Efficient Adaptations of the Non-blocking Buffer for Event Message Communication between Real-Time Threads. In: Proc. ISORC 2007 (10th IEEE CS Int’l Symp. on Object & Component Oriented Real-Time Distributed Computing), Santorini, Greece, May 2007, pp. 29–40 (2007) 12. Kim, K.H., Colmenares, J.: Maximizing Concurrency and Analyzable Timing Behavior in Component-Oriented Real-Time Distributed Computing Application Systems. KIISE Journal of Computing Science and Engineering (JCSE) 1(1), 56–73 (2007), http://jcse.kiise.org/ 13. Kim, K.H., Li, Y.Q., Rim, K.W., Shokri, E.: A Hierarchical Resource Management Scheme Enabled by the TMO Programming Scheme. In: Proc. ISORC 2008 (11th IEEE CS Int’l Symp. On Object/Component/Service-Oriented Real-time distributed Computing), Orlando, FL, May 2008, pp. 370–376 (2008) 14. Kim, H.J., Park, S.H., Kim, J.G., Kim, M.H., Rim, K.W.: TMO-Linux: A Linux-based real-time operating system supporting execution of TMOs. In: Proc. 5th IEEE Int’l Symp. on Object-Oriented Real-Time Distributed Computing (ISORC 2002), pp. 288–294 (2002)
376
K.H. Kim
15. Kim, J.G., Kim, M., Kim, K., Heu, S.: TMO-eCos: An eCos-Based Real-Time MicroOperating System Supporting Execution of a TMO Structured Program. In: Proc. 8th IEEE Int’l Symp. on Object-Oriented Real-Time Distributed Computing (ISORC 2005), pp. 182–189 (2005) 16. Kopetz, H., Ochsenreiter, W.: Clock Synchronisation in Distributed Real-Time Systems. IEEE Trans. Computers, 933–940 (1987) 17. Kopetz, H.: Real-Time Systems: Design Principles for Distributed Embedded Application. Kluwer Academic Pub., Boston (1997) 18. Liu, S., Kim, K.H., Zhang, Z., Lee, S.P., Rim, K.W.: Achieving High-Level QoS in MultiParty Video-Conferencing Systems via Exploitation of Global Time. In: Proc. ISORC 2009 (12th IEEE CS Int’l Symp. on Object/Component/Service-Oriented Real-time distributed Computing), Tokyo, Japan (March 2009) 19. Mori, K.: Autonomous Decentralized Systems: Concept, Data Field Architecture, and Future Trends. In: Proc. IEEE CS Int’l Symp. on Autonomous Decentralized Systems (ISADS 1993), March 1993, pp. 28–34 (1993) 20. Object Management Group, Real-time CORBA Specification, Version 1.2, (formal/0501-04) (January 2005), http://www.omg.org/cgi-bin/apps/doc?formal/ 05-01-04.pdf 21. Varma, P.: Two Lock-Free, Constant-Space, Multiple-(Impure)-Reader, Single-Writer Structures, US Patent No. 6304924 B1 (2001)
Author Index
Ahn, Chulbum 288 Aikebaier, Ailixier 12 Altenbernd, Peter 308 Bagci, Faruk 58 Baldoni, Roberto 91, 144 Bondavalli, Andrea 69 Bonomi, Silvia 91 Brancati, Francesco 69 Brinkschulte, Uwe 82 Calha, Mario 264 Casimiro, Antonio 264 Ceccarelli, Andrea 69 Cerocchi, Adriano 144 Chang, Chun-Hyon 1 Chen, Chong-Jing 252 Chinnapongse, Vivien 203 Cho, Jong-Sik 121 Chou, Pai H. 252 Coppolino, Luigi 192 Couderc, Paul 332 Cret¸u, Vladimir 227 D’Antonio, Salvatore 192 Dougherty, Brian 36 Edwards, Stephen A. 276 Elespuru, Peter R. 168 Elia, Ivano Alessandro 192 Enokido, Tomoya 12 Ermedahl, Andreas 308 Falai, Lorenzo 69 Fetzer, Christof 215 Gustafsson, Jan
308
Han, Dong-Won 343 Han, Ingu 103, 114 Heu, Shin 1 Huber, Benedikt 180 Huber, Bernhard 296 Hughes, Danny 156 Huygens, Christophe 156
Jang, Jong-Hyun 343 Jenks, Stephen F. 252 Jin, Hyun-Wook 240 Jones, Paul L. 203 Joo, Su-Chong 121 Joosen, Wouter 156 Kim, Doo-Hyun 1 Kim, JungGuk 1 Kim, K.H. 365 Kim, Se-Gi 1 Kim, Sung-Jin 252 Kim, YongUk 288 Kirner, Raimund 47 Kluge, Florian 58 Lee, Insup 203 Lee, Joonwoo 288 Lee, Jung-Hyun 103, 114 Lee, Meong-Hun 121 Lee, Sunggu 24 Lee, Yong-Woong 121 Lisper, Bj¨ orn 308 Lodi, Giorgia 144 Lohn, Daniel 82 Loyall, Joseph P. 320 Macariu, Georgiana 227 Marques, Luis 264 Matthys, Nelson 156 Michiels, Sam 156 Mishra, Shivakant 168 Montanari, Luca 144 Mutschelknaus, Florian 58 Nah, Yunmook
288
Obermaisser, Roman Pacher, Mathias 82 Park, Chanik 24 Park, Jinha 24 Park, Kwang-Ro 343
296
378
Author Index
Podolski, Ina 354 Puffitsch, Wolfgang 180 Puschner, Peter 47
Sokolsky, Oleg 203 Song, Seung-Hwa 1 S¨ ußkraut, Martin 215
Querzoni, Leonardo
Takizawa, Makoto Thompson, Chris
144
Rammig, Franz J. 131 Raynal, Michel 91 Rettberg, Achim 354 Rim, Kee-Wook 103, 114 Romano, Luigi 192 Rufino, Jose 264 Samara, Sufyan 131 Satzger, Benjamin 58 Schantz, Richard E. 320 Schiffel, Ute 215 Schmidt, Douglas C. 36 Schmitt, Andr´e 215 Schoeberl, Martin 47, 180 Shakya, Sagun 168 Shin, Chang-Sun 121
Ungerer, Theo
12 36
58
Vandormael, Pieter-Jan Verissimo, Paulo 264 Wang, Shaohui 203 Weigert, Stefan 215 White, Jules 36 Yoe, Hyun 121 Yoo, Junbeom 240 Yoo, Sungjoo 24 Yun, Jae-Kwan 343 Zhao, Yuhong
131
332