Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen
Editorial Board David Hutchison Lancaster University, UK Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA Josef Kittler University of Surrey, Guildford, UK Jon M. Kleinberg Cornell University, Ithaca, NY, USA Alfred Kobsa University of California, Irvine, CA, USA Friedemann Mattern ETH Zurich, Switzerland John C. Mitchell Stanford University, CA, USA Moni Naor Weizmann Institute of Science, Rehovot, Israel Oscar Nierstrasz University of Bern, Switzerland C. Pandu Rangan Indian Institute of Technology, Madras, India Bernhard Steffen TU Dortmund University, Germany Madhu Sudan Microsoft Research, Cambridge, MA, USA Demetri Terzopoulos University of California, Los Angeles, CA, USA Doug Tygar University of California, Berkeley, CA, USA Gerhard Weikum Max Planck Institute for Informatics, Saarbruecken, Germany
7201
Jens B. Schmitt (Ed.)
Measurement, Modelling, and Evaluation of Computing Systems and Dependability and Fault Tolerance 16th International GI/ITG Conference MMB & DFT 2012 Kaiserslautern, Germany, March 19-21, 2012 Proceedings
13
Volume Editor Jens B. Schmitt University of Kaiserslautern disco - Distributed Computer Systems Lab Computer Science Department Building 36, P.O. Box 3049 67663 Kaiserslautern, Germany E-mail:
[email protected] ISSN 0302-9743 e-ISSN 1611-3349 ISBN 978-3-642-28539-4 e-ISBN 978-3-642-28540-0 DOI 10.1007/978-3-642-28540-0 Springer Heidelberg Dordrecht London New York Library of Congress Control Number: 2012932064 CR Subject Classification (1998): C.2, C.4, C.1, D.2.8, D.2, D.4.8, D.4 LNCS Sublibrary: SL 2 – Programming and Software Engineering © Springer-Verlag Berlin Heidelberg 2012 This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. The use of general descriptive names, registered names, trademarks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India Printed on acid-free paper Springer is part of Springer Science+Business Media (www.springer.com)
Preface
This volume contains a selection of the papers presented at the 16th International GI/ITG Conference on Measurement, Modelling and Evaluation of Computing Systems and Dependability and Fault Tolerance (MMB & DFT 2012) held during March 19–21, 2012 in Kaiserslautern, hosted by the University of Kaiserslautern. MMB & DFT 2012 covered diverse aspects of performance and dependability evaluation of systems including networks, computer architectures, distributed systems, software, fault-tolerant and secure systems. This biennial conference has a long tradition starting as early as in 1981. Besides its main scientific program, MMB & DFT 2012 comprised two keynotes, two tutorials from academic and industrial experts, several tool presentations as well as three workshops. Specifically, we were very happy to have the keynote talks by Anja Feldmann (TU Berlin / Deutsche Telekom Laboratories) on “Internet Architecture Trends” and by Lothar Thiele (ETH Z¨ urich) on “Modeling and Evaluation of Thermal System Properties.” In this edition of MMB & DFT, we had the speciality of integrated workshops featuring certain topics (with their own call for papers): – Workshop on Network Calculus (WoNeCa), organized by Anne Bouillard (ENS, France), Markus Fidler (Leibniz University of Hannover), and Florin Ciucu (TU Berlin / Deutsche Telekom Laboratories) – Workshop on Modeling and Analysis of Complex Reaction Networks (MACoRN), organized by Werner Sandmann (TU Clausthal) and Verena Wolf (Saarland University) – Workshop on Physically Augmented Security for Wireless Networks (PILATES), organized by Matthias Hollick (TU Darmstadt), Ivan Martinovic (University of Oxford), and Dirk Westhoff (HAW Hamburg) Overall we received 54 submissions, 36 to the main conference (including 6 tool papers) and 18 to the workshops, by authors from 17 different countries. Each submission was reviewed by at least 3, and on average 3.9, Program Committee members. In a physical TPC meeting with further technical discussions, 26 of these submissions were selected for inclusion in this volume. On behalf of the TPC, we would like to thank all authors who submitted their work to MMB & DFT 2012. We hope that all authors appreciate the hard work of the TPC members, and found their feedback and suggestions valuable. We would like to express our debt and gratitude to all the members of the TPC, and the external reviewers, for being so responsive and for their timely and valuable reviews.
VI
Preface
We are grateful to everyone involved in the organization of the MMB & DFT 2012 conference, as well as to the speakers and the attendees of the conference. We also appreciate the excellent support of EasyChair in managing the processes of submission, reviewing, and preparing the final version of the proceedings. January 2012
Jens B. Schmitt
Organization
MMB & DFT 2012 was organized by the Distributed Computer Systems Lab, University of Kaiserslautern, Germany.
Organizing Committee General and Program Chair Local Organization Chairs
Tools Chair Submission Chair Publication Chair Web Chair Publicity Chair
Jens Schmitt Steffen Bondorf Steffen Reithermann Carolin Reffert-Schmitt Hao Wang Matthias Wilhelm Michael Beck Adam Bachorek Wint Yi Poe
Program Committee Lothar Breuer Peter Buchholz Joachim Charzinski Hans Daduna Klaus Echtle Bernhard Fechner Markus Fidler Reinhard German Boudewijn Haverkort Gerhard Haßlinger Holger Hermanns Joost-Pieter Katoen J¨ org Keller Peter Kemper Udo Krieger Wolfram Lautenschl¨ ager Axel Lehmann Ralf Lehnert Erik Maehle Michael Menth Bruno M¨ uller-Clostermann
University of Kent, UK TU Dortmund, Germany Hochschule der Medien Stuttgart, Germany Universit¨at Hamburg, Germany Universit¨ at Duisburg-Essen, Germany Universit¨ at Augsburg, Germany Leibniz Universit¨ at Hannover, Germany Universit¨ at Erlangen-N¨ urnberg, Germany University of Twente, The Netherlands Deutsche Telekom, Germany Universit¨ at des Saarlandes, Germany RWTH Aachen, Germany FernUniversit¨ at in Hagen, Germany The College of William and Mary, USA Otto-Friedrich-Universit¨ at Bamberg, Germany Alcatel-Lucent, USA Universit¨ at der Bundeswehr M¨ unchen, Germany TU Dresden, Germany Universit¨ at zu L¨ ubeck, Germany Universit¨at T¨ ubingen, Germany Universit¨ at Duisburg-Essen, Germany
VIII
Organization
Peter Reichl Anne Remke Johannes Riedl Francesca Saglietti Werner Sandmann Jens Schmitt Markus Siegle Helena Szczerbicka Aad Van Moorsel Oliver Waldhorst Max Walter Verena Wolf Bernd Wolfinger Katinka Wolter Armin Zimmermann
Forschungszentrum Telekommunikation Wien, Austria University of Twente, The Netherlands Siemens AG, Germany Universit¨at Erlangen-N¨ urnberg, Germany TU Clausthal, Germany TU Kaiserslautern, Germany Universit¨ at der Bundeswehr M¨ unchen, Germany Leibniz Universit¨at Hannover, Germany Newcastle University, UK Karlsruher Institut f¨ ur Technologie, Germany TU M¨ unchen, Germany Universit¨ at des Saarlandes, Germany Universit¨at Hamburg, Germany FU Berlin, Germany TU Ilmenau, Germany
Additional Reviewers Hern´ an Bar´ o Graf Matthias Becker Martin Drozda Christian Eisentraut Philipp Eittenberger Luis Mar´ıa Ferrer Fioriti Klaus-Dieter Heidtmann Michael Hoefling Oliver Hohlfeld Andrey Kolesnikov Minh Lˆe Alfons Martin Linar Mikeev Jorge Perez-Hidalgo Martin Riedl Johann Schuster Falak Sher David Spieler Mark Timmer Sebastian Vastag Hannes Weisgrab
Universit¨ at des Saarlandes, Germany Leibniz Universit¨ at Hannover, Germany Leibniz Universit¨ at Hannover, Germany Universit¨ at des Saarlandes, Germany Otto-Friedrich-Universit¨ at Bamberg, Germany Universit¨ at des Saarlandes, Germany Universit¨ at Hamburg, Germany Universit¨ at T¨ ubingen, Germany TU Berlin, Germany Universit¨ at Hamburg, Germany TU M¨ unchen, Germany Universit¨ at T¨ ubingen, Germany Universit¨at des Saarlandes, Germany TU Dresden, Germany Universit¨ at der Bundeswehr M¨ unchen, Germany Universit¨ at der Bundeswehr M¨ unchen, Germany RWTH Aachen, Germany Universit¨at des Saarlandes, Germany University of Twente, The Netherlands TU Dortmund, Germany Forschungszentrum Telekommunikation Wien, Austria
Keynote Talks at MMB & DFT 2012
Modeling and Evaluation of Thermal System Properties Lothar Thiele, ETH Zurich
[email protected] Power density has been continuously increasing in modern processors, leading to high on-chip temperatures. A system could fail if the operating temperature exceeds a certain threshold, leading to low reliability and even chip burnout. There have been many results in recent years about thermal management, including (1) thermal-constrained scheduling to maximize performance or determine the schedulability of real-time systems under given temperature constraints, (2) peak temperature reduction to meet performance constraints, and (3) thermal control by applying control theory for system adaption. The presentation will cover challenges, problems and approaches to real-time scheduling under temperature constraints for single- as well as multi-processors.
Internet Architecture Trends Anja Feldmann, TU Berlin / T-Labs
[email protected] The ever growing demand for information of Internet users is putting a significant burden on the current Internet infrastructure who’s architecture has been more or less unchanged over the last 30 years. Indeed, rather than adjusting the architecture small fixes, e.g., MPLS, have been deployed within the core network. Today, new technical abilities enable us to rethink the Internet architecture. In this talk we first highlight how Internet usage has changed in the area of user generated context. Then we explore two technology trends: Cloud networks and open hardware/software interfaces. Virtualization, a main motor for innovation, decouples services from the underlying infrastructure and allows for resource sharing while ensuring performance guarantees. Server virtualization is widely used, e.g., in the clouds. However, cloud virtualization alone is meaningless without taking into account the network needed to access the cloud resources and data: cloud networks. Current infrastructures are limited to use the tools provided by the hardware vendors as there are hardly any open software stacks available for network devices in the core. This hurts innovation. However, novel programing interfaces for network devices, e.g., OpenFlow, provide open hardware/software interfaces and may enable us to build a network OS with novel features. We outline initial work in this area.
Table of Contents
Full Papers Availability in Large Networks: Global Characteristics from Local Unreliability Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hans Daduna and Lars Peter Saul
1
Stochastic Analysis of a Finite Source Retrial Queue with Spares and Orbit Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Feng Zhang and Jinting Wang
16
Bounds for Two-Terminal Network Reliability with Dependent Basic Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Minh Lˆe and Max Walter
31
Software Reliability Testing Covering Subsystem Interactions . . . . . . . . . . Matthias Meitner and Francesca Saglietti
46
Failure-Dependent Timing Analysis - A New Methodology for Probabilistic Worst-Case Execution Time Analysis . . . . . . . . . . . . . . . . . . . Kai H¨ ofig
61
A Calculus for SLA Delay Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sebastian Vastag
76
Verifying Worst Case Delays in Controller Area Network . . . . . . . . . . . . . . Nikola Ivkovic, Dario Kresic, Kai-Steffen Hielscher, and Reinhard German
91
Lifetime Improvement by Battery Scheduling . . . . . . . . . . . . . . . . . . . . . . . . Marijn R. Jongerden and Boudewijn R. Haverkort
106
Weighted Probabilistic Equivalence Preserves ω-Regular Properties . . . . . Arpit Sharma
121
Probabilistic CSP: Preserving the Laws via Restricted Schedulers . . . . . . Sonja Georgievska and Suzana Andova
136
Heuristics for Probabilistic Timed Automata with Abstraction Refinement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Luis Mar´ıa Ferrer Fioriti and Holger Hermanns
151
Simulative and Analytical Evaluation for ASD-Based Embedded Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ramin Sadre, Anne Remke, Sjors Hettinga, and Boudewijn Haverkort
166
XII
Table of Contents
Reducing Channel Zapping Delay in WiMAX-Based IPTV Systems . . . . Alireza Abdollahpouri and Bernd E. Wolfinger
182
Performance Evaluation of 10GE NICs with SR-IOV Support: I/O Virtualization and Network Stack Optimizations . . . . . . . . . . . . . . . . . Shu Huang and Ilia Baldine
197
Business Driven BCM SLA Translation for Service Oriented Systems . . . Ulrich Winkler, Wasif Gilani, and Alan Marshall
206
Boosting Design Space Explorations with Existing or Automatically Learned Knowledge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ralf Jahr, Horia Calborean, Lucian Vintan, and Theo Ungerer
221
Tool Papers IBPM: An Open-Source-Based Framework for InifiniBand Performance Monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Michael Hoefling, Michael Menth, Christian Kniep, and Marcus Camen
236
A Workbench for Internet Traffic Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . Philipp M. Eittenberger and Udo R. Krieger
240
A Modelling and Analysis Environment for LARES . . . . . . . . . . . . . . . . . . Alexander Gouberman, Martin Riedl, Johann Schuster, and Markus Siegle
244
Simulation and Statistical Model Checking for Modestly Nondeterministic Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jonathan Bogdoll, Arnd Hartmanns, and Holger Hermanns
249
UniLoG: A Unified Load Generation Tool . . . . . . . . . . . . . . . . . . . . . . . . . . . Andrey Kolesnikov
253
Selected Workshop Papers Non Preemptive Static Priority with Network Calculus: Enhancement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . William Mangoua Sofack and Marc Boyer
258
A Demand-Response Calculus with Perfect Batteries . . . . . . . . . . . . . . . . . Jean-Yves Le Boudec and Dan-Cristian Tomozei
273
A Formal Definition and a New Security Mechanism of Physical Unclonable Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Rainer Plaga and Frank Koob
288
Table of Contents
XIII
Modeling and Analysis of a P2P-VoD System Based on Stochastic Network Calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kai Wang, Yuming Jiang, and Chuang Lin
302
Using NFC Phones for Proving Credentials . . . . . . . . . . . . . . . . . . . . . . . . . . Gergely Alp´ ar, Lejla Batina, and Roel Verdult
317
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
331
Availability in Large Networks: Global Characteristics from Local Unreliability Properties Hans Daduna and Lars Peter Saul University of Hamburg, Department of Mathematics, Mathematical Statistics and Stochastic Processes, Bundesstrasse 55, 20146 Hamburg, Germany
Abstract. We apply mean-field analysis to compute global availability in large networks of generalized SIS and voter models. The main results provide comparison and bounding techniques of the global availability depending on the local degree structure of the networks. Keywords: Reliability, SIS model, voter models, mean field analysis, stochastic ordering, convex order, bounding global availability.
1
Introduction
The research described in this paper is on reliability theory of large networks of interacting components. We are interested in the characterization of availability of network resources which will be described as availability of a typical node under a global averaging process. The technical tool we shall use to quantify global availability is mean field analysis. This averaging principle is a standard technique in statistical mechanics and has recently found interest in other fields where large systems of interacting simple entities and their quantitative behavior are described. Seemingly this growing interest in mean field analysis is in parallel with the emergence of the network science in different fields of applications. Examples are the theory of social networks, for a review see [Lam09]. Our starting point are epidemic models which were used by Jackson and Rogers [JR07]. Their starting point is the mathematical theory of a generalized SIS model, which describes epidemics spreading in populations (SIS≡Susceptible-Infected-Susceptible). Similar SIS models are recently used to describe and investigate peer-to-peer networks, the propagation of computer viruses in the internet, the diffusion of innovation in economical communities, networks of mobile computers, where interconnection of the computers is organized by (push- or pull-like) protocols which determine a regime for continuous exchange of local information about the status of the network’s members (”gossiping”). Application of mean field analysis to gossip based models for diffusion of information in large networks is, e.g., in [BCFH11], [BCFH09], and furthermore [BGFv09], and [BMM07]. J.B. Schmitt (Ed.): MMB & DFT 2012, LNCS 7201, pp. 1–15, 2012. c Springer-Verlag Berlin Heidelberg 2012
2
H. Daduna and L.P. Saul
We set available≡Susceptible and unavailable≡Infected in our availability analysis and our aim is to determine the portion of available (functioning) nodes over a large network. The formulas obtained allow a parametric analysis of global network availability depending on the local characteristics of the nodes. A similar investigation was undertaken by Jackson and Rogers [JR07], with respect to diffusion of innovation in economical communities. Some of our research directions are influenced by that paper, although the application we have in mind requires more complicated behavior of the individual nodes than the SIS models provide. Fundamental for SIS models is that susceptible individuals become infected at a rate that is strongly dependent on the number of infected neighbors while the recovering of infected individuals follows a rule which is independent of the status of their neighbors, see (1) and (3) below. The latter assumption is unrealistic in the context of unreliable networks of queues. A more appropriate assumption is that the break down structure and the repair structure should be of a comparable level of complexity. This is the case e.g. in the well established principle of reduced workload or capacity: In a queueing network with unreliable nodes for each node the individual availability (portion of time the node is up) is computed and then its service capacity is individually reduced to that portion. For more details and more elaborated versions of this principle see [CM96]. Voter models [Lig85] can be considered as symmetrization of the classical SIS model, where changing state (e.g. political preferences) in both directions follows the same mechanism, defined by the infection process. In a similar spirit we will symmetrize the transition mechanisms of Jackson and Rogers [JR07] and will therefore arrive at a network model where the individual nodes’ behavior follows rules of a generalized voter model and a generalized SIS model. 1.1
Connections to Network Science and Epidemic Models
Queueing networks and their structure are closely related to the networks of the recently emerging network science, but usually are of a higher order of complexity with respect to the individual nodes’ behavior. In network science in almost any case we can think of the network consisting of individuals and interconnections between them, the individuals are represented as vertices and the connections between individuals as edges of a graph. Therefore our models are from the realm of Graph Theory and, because of the emergence of random effects and influences, of Random Graph Theory. The emergence of ”network science” over the last decade relies on different predecessor fields where large-scale structures, described by graphs, constitute an important aspect of real world phenomena, like social, biological, physical, informational networks. The rapid development of the still very diverse field resulted in several recent books and surveys, e.g., [DM03], [DM10], [Jac08], [BBV08]. The diffusion mechanism which describe, e.g., the propagation of information in networks is borrowed from models for spreading of epidemics. Modeling of epidemics is described and surveyed in, e.g., the classical book [Bai75] and [AB00], [DG01]. Epidemics are described as contact processes on regular lattices
Availability in Large Networks
3
in the theory of Markovian interacting particle systems [Lig85][Chapter VI]. The models in the present paper generalize the models from [JR07], [LP08]. The research presented here is part of our ongoing investigation of queueing networks of the Jackson or Gordon-Newell network types under the condition that the nodes (servers, stations) are unreliable and can break down. Working periods and subsequent repair phases of the servers are random. For integrated models which encompass performance analysis and availability in a closed model (performability analysis) see [SD03], [Sau06], [HMRT01]. Another field where our results are related to is modeling and investigation of disruptive (dealy-tolerant) wireless (possible mobile) sensor networks, see [WDW07].
2 2.1
Availability Modeling Finite Networks
In this section we describe the behavior of the networks on the micro-level in finite systems. This will make the mean field model better understandable. To describe the availability of interacting nodes we use a a finite undirected graph G = (V, E) with vertices V = {1, 2, . . . , J} and edges E ⊆ V 2 \ diag(V 2 ) without multiple edges between vertices. The vertices represent stations which are either up≡functioning≡susceptible (=state 0 for the node) or down≡under repair≡infected (= state 1 for the node). Nodes interact with another if they are neighbors, i.e. if they are connected by an edge. Ni := {j ∈ V : (i, j) ∈ E} is the neighborhood of node i and di := |Ni | is the degree of node i. An important characteristic of the network is the degree distribution p = (p(d) : d ∈ IN) which is defined as a discrete probability with finite support p(d) =
|i ∈ V : di = d| , J
d ∈ IN.
Although it is for a given network a deterministic quantity it will be considered in the following as a statistical descriptor of the network. Unless otherwise specified, we will always require that p(0) < 1 holds. The states of the network are vectors n = (n1 , . . . , nJ ) ∈ {0, 1}J =: S which describe the states of all nodes (up or down). We assume that the development of the system can be described by a continuous time (time homogeneous) Markov process X = (X(t) : t ≥ 0) with state space {0, 1}J . This requires to prescribe the non-zero transition intensities (rates) Q = (q(n, m) : n, m ∈ S) for X. These rates are characterized by parameters ν, δ > 0 and are proportional to the respective status of the node’s neighborhood and a constant drift x, y ≥ 0. Definition 1 (Neighborhood dependent local breakdown rates and repair rates). Assume that the state of the network at time t ≥ 0 is n ∈ S.
4
H. Daduna and L.P. Saul
If node i ∈ {1, . . . , J}, is up, i.e. ni = 0, its break down rate is q((n1 , . . . , ni−1 , 0, ni+1 , . . . , nJ ), (n1 , . . . , ni−1 , 1, ni+1 , . . . , nJ )) = ν( nj + x),
(1)
j∈Ni
where ν > 0 is the spreading rate, i.e., a parameter describing the amount of breakdown rate similar to an infection transmission and x ≥ 0 is a constant rate with which a functioning node breaks down independent of the status of its neighborhood. If node i ∈ {1, . . . , J}, is down, i.e. ni = 1, then its repair rate is q((n1 , . . . , ni−1 , 1, ni+1 , . . . , nJ ), (n1 , . . . , ni−1 , 0, ni+1 , . . . , nJ )) = δ( (1 − nj ) + y),
(2)
j∈Ni
where δ > 0 is in the epidemic context the spreading rate of recovering, and y ≥ 0 is a constant rate with which an infected individual becomes susceptible depending not on the states of its neighborhood. A repair rate which depends in a similar way on the behavior of the nodes in the neighborhood as the breakdown rate is a reasonable property in modeling unreliable queueing networks. The break down - repair models arising then are variants of the well known voter model [BBV08][Section 10.3], in the context of Markov interacting particle systems, see e.g. [Lig85][Chapter 5]. The local ”repair rates” in the model from [JR07] are: If at time t ≥ 0 in state n ∈ S node i ∈ {1, . . . , J} is down, i.e. ni = 1, then its repair rate is q((n1 , . . . , ni−1 , 1, ni+1 , . . . , nJ ), (n1 , . . . , ni−1 , 0, ni+1 , . . . , nJ )) = δ.
(3)
The process with transition mechanism (1), (3) is known as generalized SIS model in the epidemics literature. The effective spreading rate in the network with neighborhood dependent breakdown and repair rates is λ := ν/δ > 0, which is assumed throughout to be strictly positive. The process with transition mechanism (1), (2) is known as (generalized) voter model. The distinction to work in [LP06], [JY07] is that the rates do depend on the actual status (opinion) of the voter himself. As Lopez-Pintado [LP08][p.576] remarked, an explicit analysis of X is extremely complicated. The way out of this is to consider approximate models with averaging over a large population, i.e., mean field analysis [JR07], [LP08]. Example 1. Assume that the network graph is complete, i.e. for any i ∈ V we have Ni = V \ {i}: any two nodes are connected by an edge. In this situation under the above assumptions the total number of infected individuals Z(t) at time t ≥ 0 is a Markov process Z = (Z(t) : t ≥ 0) with discrete state space {0, 1, 2, . . . , J}. In fact, it is a finite birth-death process with transition rates n < J : q(n, n + 1) = (J − n)ν(n + x),
n > 0 : q(n, n − 1) = n(δ(J − n + y)).
Availability in Large Networks
2.2
5
Infinite Networks: Mean Field Description for Neighborhood Dependent Breakdown and Repair Rates
We now assume that the graph of the network has an infinite number of vertices V := IN and is locally finite, i.e., the any node i ∈ V has only a finite number di := |Ni | < ∞ of neighbors. The set of degree numbers di , i ∈ IN, need not be bounded. We assume that the degree distribution of the network is a well defined discrete probability density p = (p(d) : d ∈ IN), and consider d : IN → IN as random variable with distribution p. We interpret p as the degree distribution of a typical node in the network or as the probability p(d) that a randomly chosen node has degree d. We always require that the average degree Ep (d) := p(d)d < ∞ is finite. d∈IN Unless stated otherwise (some remarks on totally disconnected networks will be given) we always require p(0) < 1. Note, that under p(0) < 1 we have Ep (d) > 0. Definition 2. The probability that a randomly selected link originates from a node with degree n (see [JR07][p. 3]) is q = (q(n) : n ∈ IN), computed as q(n) :=
p(n)n , d∈IN p(d)d
n ∈ IN.
The interplay of the following time dependent quantities determine the analysis, for more details and a convenient interpretation see [JR07][p. 3] for rates (1), (3); a similar interpretation is appropriate in case of rates (1), (2): – ρt (d) ∈ [0, 1]: average down rate among nodes with degree d at time t, which is defined as the portion of broken down nodes at time t among nodes with degree d, – ρt = d∈IN p(d)ρt (d) ∈ [0, 1]: average down rate in the network at time t, definedas portion of broken down nodes at time t in the network, – θt = ( d∈IN p(d)dρt (d))( d∈IN p(d)d)−1 : the average neighbor down rate, defined as the portion of broken down nodes at time t in the neighborhood of a randomly selected node. θt can be interpreted as the probability that at time t some node at the end of a randomly selected link is broken down. A mean-field approximation for the development of the system is determined by a system of differential equations which have an appealing intuitive interpretation. For all d ∈ IN ∂ρt (d) = (1 − ρt (d))ν(θt d + x) − ρt (d)δ((1 − θt )d + y), ∂t
t ≥ 0.
(4)
The most important analysis of the model is the search for steady states for ρt (d) and θt . This requires the lefthand side of (4) to be constant zero and consequently the righthand side being independent of t. This makes the steady state quantities ρ(d) and θ amenable to an equilibrium analysis which yields ρ(d) =
λ(θd + x) λ(θd + x) + (1 − θ)d + y
(5)
6
H. Daduna and L.P. Saul
and θ=
λ(θd + x) p(d)d · . p(d)d λ(θd + x) + (1 − θ)d + y d∈IN
d∈IN
(6)
Solving (6) for θ is equivalent to solving a fixed-point problem for the function Hp (θ) : [0, 1] → IR,
θ → Hp (θ) =
d∈IN
p(d)d λ(θd + x) · , p(d)d λ(θd + x) + (1 − θ)d + y d∈IN
and the fixed points of Hp are by definition the stationary points (or stationary states) of the mean-field equations.
3
Infinite Networks: Fixed Point Analysis
For the case of neighborhood independent local repair rates in [JR07] the fixed points of the function θ → Hp (θ) are determined and classified. We classify the fixed points associated with (6) similarly, but the stability pattern is more complicated. On the other hand we can prove that the patterns obey some nice symmetry structures now. We consider degree distributions p = (p(d) : d ∈ IN) with p(0) < 1. Recall Ep (d) := d∈IN p(d)d, and the effective spreading rate λ := ν/δ > 0. Lemma 1. The function Hp on [0, 1] has the following properties: (2) Hp (0) = 0 ⇔ x = 0 (1) 0 ≤ Hp (θ) ≤ 1 (4) Hp is continuously differentiable. (3) Hp (1) = 1 ⇔ y = 0 (5) Hp is strictly increasing on [0, 1]. Furthermore
Hp is strictly concave ⇔ λ > 1 ⇔ ν > δ Hp is strictly convex ⇔ λ < 1 ⇔ ν < δ ⇔λ=1⇔ν=δ Hp is linear
The proof of the lemma is by direct computation. A surprising consequence is Theorem 1. For degree distribution p with p(0) < 1, λ > 0, x, y ≥ 0, and for λ∗ := λ−1 holds Hp (θ) = 1 − Hp∗ (1 − θ) with Hp∗ (θ) =
p(d)d λ∗ (θd + y) · ∗ Ep (d) λ (θd + y) + (1 − θ)d + x
d∈IN
The proof of the theorem is by direct computation. An important consequence of the theorem is that for a fixed point θ¯ of Hp additionally θ¯∗ = 1− θ¯ is a stationary ¯ = 1 − H ∗ (1 − θ) ¯ ⇔ H ∗ (1 − θ) ¯ = 1 − θ. ¯ point as well because of θ¯ = Hp (θ) p p We shall prove that under certain parameter settings there may exist more than one fixed point, which indicates that the limiting distributions will depend on initial conditions. We will not investigate this in detail here and consider only the possible limting pictures.
Availability in Large Networks
3.1
7
Infinite Networks: Stationary States for Mean Field Models
Existence of solutions of the fixed point equation (6) depends on the parameter setting of the model. Whenever such a stationary point exists, we denote by θ¯ the stationary average neighbor down rate, by ρ¯(d) the stationary average down rate among nodes with degree d, and by ρ¯ the stationary average down rate. From (5) follows ¯ + x) λ(θd (7) ρ¯(d) = ¯ ¯ + y, λ(θd + x) + (1 − θ)d and in case of stationarity we can compute ρ¯ = p(d)¯ ρ(d).
(8)
d∈IN
We remark en passant that from (7) and (8) it follows that for a totally disconnected network with x > 0 or y > 0 holds ρ¯ = λx/(λx + y).
(9)
In the following p(0) = 1 will be excluded. Theorem 2. For p(0) < 1 there exist in any case a stationary state. Denote a := Ep (d)(
d∈IN
p(d)
d2 −1 ) ≥1 d+y
and
b := (
d∈IN
p(d)
d2 )(Ep (d))−1 ≤ 1. d+x
For x = y = 0 are θ¯1 = 0 and θ¯2 = 1 stationary. We can characterize existence of additional stationary states as function of the effective spreading rate λ Table 1. Stationary states as function of λ a x = 0 θ¯1 x>y=0 x, y > 0
λ=1
λ 0. We distinguish cases λ > 1, λ < 1 and λ = 1. For λ > 1 holds: Hp is strictly concave (Lemma 1). If x > 0, then Hp (0) > 0 and therefore only one stationary point can exist. For y = 0 this is by Lemma 1 at θ¯ = 1, for y > 0 we have θ¯ ∈ (0, 1).
8
H. Daduna and L.P. Saul Table 2. Stationary states as function of x and y y=0
y>0
λ = 1 : all θ¯ ∈ {0, 1} λ 1 holds, ∂Hp (0) Hp has another fixed point, because Hp is strictly concave. We have ∂θ = P (d)d λd Ep (d) · d+y .
d∈IN
So Hp has a second fixed point θ¯2 , if and only if λ > a. For y = 0 by Lemma 1 this is θ¯2 = 1, for y > 0 the second fixed point lies in ¯ θ2 ∈ (0, 1). For λ < 1 holds: Hp is strictly convex (Lemma 1). If y > 0, then Hp (1) < 1 and therefore only one stationary point can exist. If x = 0 this is θ¯1 = 0, otherwise if x = 0 it lies in θ¯1 ∈ (0, 1). ∂Hp (1) If y = 0, then θ¯1 = 1 is a fixed point. If ∂θ > 1 holds, so a second fixed P (d)d ∂Hp (1) d point θ¯2 exist, because Hp is strictly convex. We have ∂θ = Ep (d) · λ(d+x) . d∈IN
So a second fixed point exists if λ < b. For x = 0 this is (Lemma 1) θ¯2 = 0 while for x > 0 it lies in θ¯2 ∈ (0, 1). (Note, that this result follows also from Theorem 1.) For λ = 1 holds: Hp is linear (Lemma 1), and for all θ ∈ [0, 1] is P (d)d (∗) P (d)d d ∂Hp (θ) = · =1 ≤ ∂θ Ep (d) d + x + y Ep (d) d∈IN
d∈IN
(∗)
If x = y = 0, in ≤ equality holds and we have Hp (θ) = θ, so all θ¯ ∈ [0, 1] are fixed points. In all other cases only one fixed points exists. For x = 0 and y > 0 it is θ¯1 = 0, for x > 0 and y = 0 it is θ¯1 = 1 and for x, y > 0 it lies in θ¯1 ∈ (0, 1). From Tables 1 and 2 we see all cases when θ¯ ∈ {0, 1} can (and will) occur, the proof is in the Appendix. For θ¯ ∈ (0, 1) we can provide more information. Theorem 3. Let θ¯ ∈ (0, 1) be stationary state of a network with degree distribution p with p(0) =
1. Then (i) λ = 1 ⇔ θ¯ = (ii) λ > 1 ⇔ θ¯ > (iii) λ < 1 ⇔ θ¯
0 or y > 0,
Availability in Large Networks
9
Proof. We can assume x > 0 or y > 0 holds, because otherwise from Table 1 we see that for λ = 1 and x = y = 0 no stationary state θ¯ ∈ (0,1) exists, and for
x λx is well defined, and Hp x+y . = λx+y x (i) From x, y > 0, λ = 1 is obviously equivalent to θ¯ = x+y . (ii) For λ > 1 we see from Table 1 that for y = 0 only θ1 = 0 and θ2 = 1 can be stationary. So we have y > 0. x x x If θ¯ ≤ x+y would hold, then Hp ( x+y ) ≤ x+y , because θ¯ < 1 is the maximal x λx x = λx+y value with Hp (θ) = θ and Hp strictly concave. But Hp x+y > x+y , which is a contradiction. x x x If θ¯ ∈ ( x+y , 1), we must have y > 0 (otherwise x+y = 1 or x+y is not well x x x λx x , defined.) It follows Hp ( x+y ) > x+y . If λ ≤ 1 , then Hp x+y = λx+y ≤ x+y this is a contradiction. (iii) For λ < 1 we see from Table 1 that for x = 0 only θ1 = 0 and θ2 = 1 can be stationary. So we have x > 0. We consider Hp∗ from Theorem 1 is strictly inceasing (1) and obtain for λ∗ = λ−1 > 1 (ii) x y ¯ Hp > 1 − Hp∗ (θ¯∗ ) = Hp (1 − θ¯∗ ) = Hp (θ). = 1 − Hp∗ x+y x+y
λ = 1 by assumption x, y > 0. So
x x+y
x . Because Hp is strictly inceasing in θ, it follows θ¯ < x+y x ¯ If θ ∈ (0, x+y ), then from parts (i) and (ii) we conclude λ < 1.
So for λ = 1, i.e., if breakdown and repair rate equalize, we see that the average neighbor down rate is uniquely determined by the state-independent breakdown and repair rate. In the unbalanced situation x/(x + y) provides bounds for the proportion of down nodes in the neighborhood of a typical node, y/(x + y) provides bounds for the the global availability, the proof is in the Appendix.
4
Infinite Networks: Stochastic Orderings
In this section we provide a parametric analysis of the networks’ global stationary availability states under variation of degree distributions. Our procedure is: We compare two networks which are identical in any of their defining fundamental characteristics other than the degree distributions, which we denote by pi = (pi (n) : n ∈ IN) for i = 1, 2. To be more precise: Given a first network with degree distribution p1 . Replace p1 by p2 to obtain the second network. 4.1
Stochastic Order for Balanced Breakdown and Repair Rates
In this case we have λ = 1 and obtain by direct computations from (8) and (7) and using (i) from Theorem 3: Proposition 1. The stationary average neighbor down rate under λ = 1 is independent of the degree distribution and for x = y = 0 all θ¯ ∈ [0, 1] are stationary. In all other cases the solution sets are discrete, see Table 3. In any case holds θ¯ = ρ¯.
10
H. Daduna and L.P. Saul Table 3. Values of θ¯ and ρ¯ under λ = 1 θ¯
ρ¯
x = y = 0 all θ ∈ [0, 1] θ 0 0 y>x=0 1 1 x>y=0 x x x, y > 0 x+y x+y
Corollary 1. Consider the two networks with different degree distributions pi = (pi (n) : n ∈ IN), i = 1, 2, being identical otherwise. Let θ1 , resp. θ 2 , denote the largest average neighbor breakdown rates and ρ1 , resp. ρ2 , the largest steady state overall breakdown rates, and suppose θi ∈ (0, 1). If λ = ν/δ = 1 then θ2 = θ1 = ρ2 = ρ1 . Compared with the result of Theorem 4 below the result of the corollary is somewhat surprising. The interpretation is: In case λ = ν/δ = 1 the effects of the neighborhood-induced breakdowns and repairs compensate perfectly under any degree distribution. 4.2
Stochastic Order for Unbalanced Breakdown and Repair Rates
For λ = 1 we can prove a more detailed analysis of degree distributions. Definition 3. p1 is stochastically greater than p2 (write p1 ≥st p2 or p2 ≤st p1 ) if ∞ ∞ p1 (d) ≥ p2 (d) ∀ ∈ IN. (10) d=
d=
p1 is greater than p2 in the convex (stochastic) order (write p1 ≥cx p2 or p2 ≤cx p1 ) if for all convex functions f : IN → IR holds ∞
p1 (d)f (d) ≥
d=0
∞
p2 (d)f (d)
if both sums exist.
(11)
d=0
Recall from Definition 2 qi (n), the probability that a randomly selected link originates from a node with degree n. We generalize Theorem 1 of [JR07]. Theorem 4. Consider the two networks with degree distributions pi = (pi (n) : n ∈ IN), i = 1, 2, being identical otherwise. Let θ1 , resp. θ2 , denote the largest average neighbor breakdown rates and ρ1 , resp. ρ2 , the largest steady state overall breakdown rates, and suppose θi ∈ (0, 1). If λ = ν/δ = 1 and if p1 ≥st p2 and q1 ≥st q2 , then (i)
λ > 1 =⇒ θ 1 ≥ θ2
and
ρ1 ≥ ρ2
(ii)
λ < 1 =⇒ θ 1 ≤ θ 2
and
ρ1 ≤ ρ2 .
Availability in Large Networks
11
Proof. (i) We can assume θ¯1 =
θ¯2 . From Table 2 we conclude for θ¯1 , θ¯2 ∈ (0, 1) that Hp1 (1) < 1 holds for θ → Hp1 (θ) =
1
Ep1 (d
d∈IN
p1 (d)d
λ(θd + x) λ(θd + x) + (1 − θ)d + y
θ¯1 is by assumption the largest θ ∈ (0, 1) with Hp1 (θ) = θ, so Hp1 (θ) < θ for all θ ∈ (θ¯1 , 1]. So, if θ¯1 < θ¯2 would hold we would have Hp1 (θ¯2 ) < θ¯2 . For any θ ∈ (0, 1) we can read (5) as definition of the function ρ : IN → IR. We λ(θd+x) formally can extend this function to g : IR+ 0 → IR with g(d) := λ(θd+x)+(1−θ)d+y , which is differentiable with λ(θ(x + y) − x) x ∂g(d) = . >0 ∀ θ> ∂d (λ(θd + x) + (1 − θ)d + y)2 x+y x So, for all θ ∈ ( x+y , 1) ρ(d) is strictly increasing in d. From q1 ≥st q2 follows Eq1 (f (d)) ≥ Eq2 (f (d)) for all increasing functions f , and we conclude
Hp1 (θ) =
q1 (d)ρ(d) ≥
d∈IN
q2 (d)ρ(d) = Hp2 (θ),
d∈IN
∀ θ∈(
x , 1], x+y
which implies θ¯2 > Hp1 (θ¯2 ) ≥ Hp2 (θ¯2 ). This contradicts the fact that θ¯2 is a stationary average neighbor down rate. We must have θ¯1 > θ¯2 . It remains to prove the statement on average down rates ρ(d) which for fixed d is strictly increasing in θ for all d > 0. From (4) in Lemma 1, λd(d + x + y) ∂ρ(d) = > 0. ∂θ (λ(dθ + x) + (1 − θ)d + y)2
∀ θ ∈ [0, 1] :
For θ1 = θ 2 we have ρ1 = ρ2 . Assume θ¯1 > θ¯2 , then for all d > 0 for stationary average down rates of nodes with degree d holds ρ¯1 (d) > ρ¯2 (d), which is used in (1) below. Using p2 ≤st p1 in (2) yields ρ¯1 =
(1)
p1 (d)¯ ρ1 (d) >
d∈IN
(2)
p1 (d)¯ ρ2 (d) ≥
d∈IN
p2 (d)¯ ρ2 (d) = ρ¯2 .
d∈IN
x ). Analogously to (i) the (ii) For the case λ < 1, Theorem 3 says θ¯i ∈ [0, x+y function −g(d) is strictly increasing in d, because of
∂g(d) λ(θ(x + y) − x) x = ) < 0 ∀ θ ∈ [0, ∂d (λ(θd + x) + (1 − θ)d + y)2 x+y So for these θ the function −ρ(d) is strictly increasing d and from q1 ≥st q2 : Hp1 (θ) = −
d∈IN
q1 (d)(−ρ(d)) ≤ −
d∈IN
q2 (d)(−ρ(d)) = Hp2 (θ).
12
H. Daduna and L.P. Saul
If θ¯1 ≥ θ¯2 holds, the proof is completed. Assume θ¯1 > θ¯2 : Because θ¯2 is the greatest θ ∈ (0, 1) with Hp2 (θ) = θ, it follows that Hp2 (θ) < θ for all θ ∈ (θ¯2 , 1]. Summarizing, we have θ¯1 > Hp2 (θ¯1 ) ≥ Hp1 (θ¯1 ). This contradicts the fact that θ¯1 is stationary average neighbor down rate. So ¯ θ1 < θ¯2 . Analogously to (i) we have ρ¯1 (d) < ρ¯2 (d) for all d > 0, because ρ(d) is increasing in θ and θ¯1 < θ¯2 . Utilizing p1 ≥st p2 for , we finally obtain ρ¯1 =
p1 (d)¯ ρ1 (d)
1 =⇒ θ1 ≥ θ2 ,
and
(ii)
λ < 1 =⇒ θ1 ≤ θ2 .
θ¯2 . Define for degree distribution p with p(0) < 1 Proof. We can assume θ¯1 = f : IR+ 0 → IR,
d → f (d) :=
λ(θd + x) d · Ep (d) λ(θd + x) + (1 − θ)d + y
f (d) is well defined, because Ep (d) = 0 and x and y do not vanish concurrently, because otherwise no stationary θ ∈ (0, 1) would exist. f (d) is two times differentiable in d. x , and (i) If λ > 1, from Theorem 3 for the stationary values holds θ¯i > x+y 2
x for all θ > x+y we have ∂ ∂f2(d) d > 0, so f (d) is strictly convex in this case. From p1 ≥cx p2 we therefore conclude Hp1 (θ) = p1 (d)f (d) ≥ p2 (d)f (d) = Hp2 (θ). d∈IN
d∈IN
Assume now, that θ¯1 < θ¯2 holds. Then θ¯2 > Hp1 (θ¯2 ), because θ¯1 is the greatest θ ∈ (0, 1) with Hp1 (θ) = θ and because Hp1 (θ) is concave. So θ¯2 > Hp1 (θ¯2 ) ≥ Hp2 (θ¯2 ), which contradicts that θ¯2 is stationary state for p2 . It follows θ¯1 > θ¯2 . 2
x and ∂ ∂f2(d) < 0. So in this case −f (d) (ii) For λ < 1 holds θ¯i < x+y d strictly convex and therefore Hp1 (θ) = p1 (d)f (d) = − p1 (d)(−f (d)) < d∈IN d∈IN − p2 (d) − f (d)) = p2 (d)f (d) = Hp2 (d)(θ). d∈IN
d∈IN
Assume now θ¯1 > θ¯2 : Then θ¯1 > Hp2 (θ¯1 ) ≥ Hp1 (θ¯1 ), which contradicts the fact that θ¯1 is stationary state for p1 . It follows θ¯1 < θ¯2 .
Availability in Large Networks
5
13
Comparison of Global Availability and Bounds
Theorems 4 and 5 describe the consequences for the availability when degree distributions increase or the degree distributions become more variable. Note, that for simpler notation we denoted by θt the portion of broken down nodes at time t in the neighborhood of a randomly selected node (≡ average neighbor down rate). So, Avt := 1 − θt is the portion of available nodes at time t in the neighborhood of a randomly selected node; denote Av = 1 − θ, which can be interpreted as ”global availability”. The interesting case in applications to reliability of service networks is λ < 1, i.e. the breakdown rate parameter ν is less then the repair rate parameter δ. We consider two networks with degree distributions pi , i = 1, 2, which are structurally identical otherwise. (1) If p1 ≥st p2 and q1 ≥st q2 holds, from Theorem 4 (ii) we conclude for the respective largest average neighbor down rates θ¯i : Av 1 ≥ Av 2 , i.e., in network 1 is the global availability greater than in network 2, which under λ < 1 is intuitive: More connectivity in the network increases the force of the neighborhood dependent characteristics, which enforces the impact of repair against breakdown because of ν < δ. (2) If p1 ≥cx p2 holds, it follows equality of the average node degrees Epi (d) := m, i = 1, 2, see [MS02][Theorem 1.5.3]. From Theorem 5 (ii) we conclude for the respective largest average neighbor down rates θ¯i : Av 1 ≥ Av 2 , i.e., in network 1 is the global availability greater than in network 2, which under λ < 1 seems to be less intuitive than the conclusion under (1): More variability generates more availability. We conclude furthermore: If all nodes in the network have the same degree m ∈ IN, then for all degree distributions p with fixed mean Ep (d) := m ∈ IN, we have a guaranteed (minimal) global availability Av min which is given by the network with fixed degree m =constant: Av min = 1 − x/(x + y) = y/(x + y). This is the bound which we obtained already in Theorem 3, (iii). The reason behind this observation is that in the class of all distributions on IR with fixed mean m the one-point distribution in m is the minimum under the convex order ≥cx , see [MS02][Example 1.10.5].
6
Discussion
We did not prove rigorously the validity of the mean field approximation in Section 2.2. This would be a difficult task. On the other hand, the approach is acknowledged as natural in many fields, e.g. for physical systems or chemical reactions. An example in the field of communication systems is in [PM84], where for an ”Interconnection Network” of completely reliable nodes with mean field analysis performance metrics on the macro level are obtained. Nevertheless, there are fundamental differences between finite state space models, see e.g., Example 1 and an associated mean field model. These are well
14
H. Daduna and L.P. Saul
known in epidemic models, where a deterministic description of an epidemic reveals usually at least one steady state, while the associated micro level models are transient with a unique absorbing state 0. In Example 1 we can classify this completely: If x = y = 0 then (even under complete connectivity) states 0 and J both are absorbing, while under x = 0 and y > 0 only 0 is absorbing, and under y = 0 and x > 0 only J is absorbing. Only for y > 0 and x > 0 the process Z is ergodic with a unique proper steady state. The averaging principle which is behind the definition of, e.g., the portion of broken down nodes over the network resembles flow approximations (Law of Large Numbers limits) in queueing networks. Another connection is developed in [DPR08] where an increasing sequence of finite cycles of exponential queues is investigated. The point of interest is for unboundedly growing networks (number of nodes and number of customers) the average throughput taken as network’s throughput per node. The main result is given by finding conditions on the asymptotic profile of the network sequence which guarantee the existence of a proper limit of the sequence of average throughput. As indicated in the Introduction the results in this paper are motivated by investigation of large networks of unreliable services. Our aim is to combine the approach of [DPR08] with the development presented here and to attack problems beyond limiting average throughput. The investigation and classification of performability in the mean field limit seems to be an open problem. Acknowledgement. We thank the referees for careful reading of the manuscript and for their helpful suggestions.
References [AB00]
Anderson, H., Britton, T.: Stochastic Epedemic Models and Their Statistical Analysis. Lecture Notes in Statistics. Springer, New York (2000) [Bai75] Bailey, N.J.: The Mathematical Theory of Infectiuos Diseases and Its Applications. Hafner Press, New York (1975) [BBV08] Barrat, A., Barthelemy, M., Vespignani, A.: Daynamical Processes on Complex Networks. Cambridge University Press, Cambridge (2008) [BCFH09] Bakhshi, R., Cloth, L., Fokking, W., Haverkoert, B.R.: Mean-field analysis for the evaluation of gossip protocols. In: Proceedings of the Sixth International Conference on Quantitative Evaluation of Systems, pp. 247–256. IEEE Computer Society (2009) [BCFH11] Bakhshi, R., Cloth, L., Fokking, W., Haverkoert, B.R.: Mean-field framework for performance evaluation of push-pull gossip protocols. Performance Analysis 68, 157–179 (2011) [BGFv09] Bakhshi, R., Gavidia, D., Fokking, W., van Steen, M.: An analytical model of information dissemination for a gossip-based protocol. Computer Networks 53, 2288–2303 (2009) [BMM07] Boudec, J.-Y., McDonald, D., Mundinger, J.: A generic mean field convergence result for systems of interacting objects. In: Proceedings of the Fourth International Conference on Quantitative Evaluation of Systems, pp. 3–15. IEEE Computer Society (2007)
Availability in Large Networks [CM96]
15
Chakka, R., Mitrani, I.: Approximate solutions for open networks with breakdowns and repairs. In: Kelly, F.P., Zachary, S., Ziedins, I. (eds.) Stochastic Networks, Theory and Applications. Royal Statistical Society Lecture Notes Series, vol. 4, ch. 16, pp. 267–280. Clarendon Press, Oxford (1996) [DG01] Daley, D.J., Gani, J.: Epidemic Modelling: An Introduction. Cambridge University Press, Cambridge (2001) [DM03] Dorogovtsev, S.N., Mendes, J.F.F.: Evolution of Networks. Oxford University Press, Oxford (2003) (reprint 2004) [DM10] Draief, M., Massoulie, L.: Epidemics and Rumours in Complex Networks. London Mathematical Society Lecture Note Series, vol. 369. Cambridge University Press, Cambridge (2010) [DPR08] Daduna, H., Pestien, V., Ramakrishnan, S.: Throughput limits from the asymptotic profile of cyclic networks with state-dependent service rates. Queueing Systemes and Their Applications 58, 191–219 (2008) [HMRT01] Haverkort, B.R., Marie, R., Rubino, G., Trivedi, K.: Performability Modeling, Technique and Tools. Wiley, New York (2001) [Jac08] Jackson, M.O.: Social and Economic Networks. Princeton University Press, Princeton (2008) [JR07] Jackson, M.O., Rogers, B.W.: Relating network structure to diffusion properties through stochastic dominance. The B.E. Journal of Theoretical Economics 7(1), 1–13 (2007) [JY07] Jackson, M.O., Yariv, L.: Diffusion of behavior and equilibrium properties in network games. American Economical Review (Papers and Proceedings) 97, 92–98 (2007) [Lam09] Lamberson, P.J.: Linking network structure and diffusion through stochastic dominance. In: Complex Adaptive Systems and the Threshold Effects: Views from the Natural and Social Sciences, pp. 76–82. Association for the Advancement of Artificial Intelligence (2009); Papers from the AAAI Fall Symposium: FS-09-03 [Lig85] Liggett, T.M.: Interacting Particle Systems. Grundlehren der mathematischen Wissenschaften, vol. 276. Springer, Berlin (1985) [LP06] Lopez-Pintado, D.: Contagion and coordination in random networks. International Journal of Game Theory 34, 371–381 (2006) [LP08] Lopez-Pintado, D.: Diffusion in complex social networks. GAMES and Economic Behavior 62, 573–590 (2008) [MS02] M¨ uller, A., Stoyan, D., Liggett, T.M.: Comparison Methods for Stochastic Models and Risks. Wiley, Chichester (2002) [PM84] Pinski, E., Yemini, Y.: A statistical mechanics of some interconnetion networks. In: Gelenbe, E. (ed.) Performance 1984, pp. 147–158. NorthHolland, Amsterdam (1984) [Sau06] Sauer, C.: Stochastic product form networks with unreliable nodes: Analysis of performance and availability. PhD thesis, University of Hamburg, Department of Mathematics (2006) [SD03] Sauer, C., Daduna, H.: Availability formulas and performance measures for separable degradable networks. Economic Quality Control 18, 165–194 (2003) [WDW07] Wang, Y., Dang, H., Wu, H.: A survey on analytic studies of Delay-Tolerant Mobile Sensor Networks. Wireless Communications and Mobile Computing 7, 1197–1208 (2007)
Stochastic Analysis of a Finite Source Retrial Queue with Spares and Orbit Search Feng Zhang and Jinting Wang Department of Mathematics, Beijing Jiaotong University, Beijing, 100044, China {zhangfeng,jtwang}@bjtu.edu.cn
Abstract. This paper aims at presenting an analytic approach for investigating a single server finite-source retrial queue with spares and constant retrial rate. We assume that there is a single repair facility (server) and K independent parts (customers) in the system. The customers’ life times are assumed to be exponentially distributed random variables. Once a customer breaks down, it is sent for repair immediately. If the server is idle upon the failed customer’s arrival, the customer receives repair immediately. The failed customer that finds the server busy upon arrival enters into the retrial orbit. Upon completion of a repair, the server searches for a customer from orbit if any. However, when a new primary customer arrives during the seeking process, the server interrupts the seeking process and serves the new customer. There are some spares for substitution of failed machines and the system is maintained by replacing failed part by spares and by repairing failed parts so that they may become spares when they are repaired. We carry out the steady-state analysis of the model and obtain various steady-state performance measures. Keywords: Quasi-random input, orbital search, spares, busy period, waiting time.
1
Introduction
Retrial queues have been widely used to model many problems arising in telephone switching systems, telecommunication networks, computer networks, etc. The main characteristic of a retrial queue is that a customer who finds the service facility busy upon arrival is obliged to leave the service area, but some time later he comes back to re-initiate his demand. Between trials a customer is said to be in “orbit”. The literature on retrial queueing systems is very extensive. For a recent account, readers may refer to the recent books of Falin and Templeton [8] and Artalejo and G´omez-Corral [5] that summarize the main models and methods. Most studies in the literature assume the source size of primary customers to be infinite and then the flow of primary calls can be modeled by Poisson process. However, when the customer population is of moderate size, it seems more
Corresponding author.
J.B. Schmitt (Ed.): MMB & DFT 2012, LNCS 7201, pp. 16–30, 2012. c Springer-Verlag Berlin Heidelberg 2012
A Finite Source Retrial Queue with Spares and Orbit Search
17
appropriate that the retrial queueing systems should be studied as a system with finite source of customers. In the queueing literature, it is called machine interference problem (MIP) or machine repairman problem which can be used to model a wide variety of real systems, see [10]. In these situations, it is important to take into account the fact that the rate of generation of new primary calls decreases as the number of customers in the system increases. Such a finite source queue is also known as queues with quasi-random input. Retrial queues with quasi-random input are of recent interest in modeling many practical systems such as magnetic disk memory systems, cellular mobile networks, computer networks, and local-area networks, see [4, 9] for the detailed descriptions. Since [11], there has been a rapid growth in the literature on this topic [1–3, 6, 7, 15]. Recently, retrial queueing systems with a finite source of customers where the servers search for customers after service have been investigated in several papers [16–18]. The authors used the software MOSEL (Modeling, Specification and Evaluation Language) to formulate and solve their problems. Such retrial models are typically used in the study of computer networks, and they differ from the majority of articles in the retrial queueing literature where each blocked customer joins the retrial orbit and becomes a source of repeated requests for service at rate ν, independently of the other customers. This is the classical retrial policy that the total retrial rate when there are j customers in the orbit is jν. In contrast to this, for some applications in computer and communication networks, one is interested in designing finite-source retrial queues of search of orbital customers immediately after a service completion, where the time between two successive repeated attempts is controlled by the server. Consequently, the total retrial rate is a constant θ, independently of the number j of customers in orbit. This constant retrial policy is also used to model systems in which the blocked customers leave their contact details when they find the server busy. Then the server seeks for a customer at a constant rate θ after a service completion among those who have left their contract details. Some related studies on the constant retrial policy can be found in [3, 6, 13], among others. In this paper we study a single server finite-source retrial queue with spares and orbit search. Such a model arises in the maintenance of various practical systems. We assume that there exists a single repair facility (server) and K independent machines (customers) in the system. The customers have exponential life times and once breaks down, it is sent to the server for repair immediately. If the server is idle upon arrival, the customer receives repair immediately. Otherwise, the failed customer enters a pool of retrial group which is called orbit for later repair. There are some spares for substitution of failed machines and the system is maintained by replacing failed part by spares and by repairing failed parts so that they may become spares when they are repaired. An examination of the literature shows that there is no work on the analytic approach for investigating a finite source retrial queue taking into account the constant retrial rate and spares. This motivates us to investigate such queueing systems in this work.
18
F. Zhang and J. Wang
The organization of this paper is as follows. The model under consideration is described and several main performance characteristics are obtained in Section 2. Section 3 investigates the busy period of the server. In Section 4, the waiting time distribution is discussed. In Section 5 we show some numerical examples to illustrate the impact of the parameters on the system performance. Finally, some conclusions are given in Section 6.
2
Model Description
We consider a single-server retrial queueing system with no waiting space in which the primary calls are generated by K, 1 < K < ∞, homogeneous sources. The server can be in two states: idle and busy. If the server is idle, it can serve the calls of the sources. The service times of the customers are assumed to be exponentially distributed random variables at rate μ. If a source is free at time t it can generate a primary call during interval (t, t + dt) with probability αdt. If the server is free at the time of arrival of a call then the call starts to be served immediately, the source moves into the under service state and the server moves into busy state. The customers that find the server busy upon arrival abandon the system but leave their contact details; hence, we can think that they join a “virtual” retrial orbit or that they are registered in a server’s waiting list, i.e., the service order in the retrial orbit is first-come first-served. After finishing service, a customer leaves the system and the server seeks to serve a customer from the retrial orbit. The time required to find a customer from the retrial orbit is assumed exponentially distributed with rate θ. However, when a new primary call arrives during the seeking process, the server interrupts the seeking process and serves the new call. We assume that the input stream of primary sources, service times, seeking times are mutually independent. The finite-source model can be generalized to include the use of spares. We assume now that there are K machines in operation plus an additional M spares. When a machine in operation fails, a spare is immediately substituted for it (if available). Once a machine is repaired, it becomes a spare, unless the system is short, in which case the repaired machine goes immediately into service. At any given time, there are at most K machines in operation, so the rate of failures is at most Kα (i.e., spares that are not in operation do not contribute to the failure rate). From the description of the model, we represent the state of the system at time t by a pair (C(t), N (t)), where C(t) denotes the state of the server (0: idle, 1: busy) and N (t) records the number of customers in the orbit, i.e., a (virtual) queue of retrial customers with FIFO scheduling is maintained to record the state of the orbit. It should be noted that the situation C(t) = 0, N (t) = K + M is impossible and thus the state space of the process (C(t), N (t)) is the set {0, 1} × {0, 1, . . . , K + M − 1}. We define the probabilities as follows: pij (t) = P {C(t) = i, N (t) = j},
i = 0, 1,
0 ≤ j ≤ K + M − 1.
A Finite Source Retrial Queue with Spares and Orbit Search
19
Since the state space of the process (C(t), N (t)) is both finite and irreducible for all values of the generation rate of primary calls α, i.e., all states are pairwise reachable from each other via a finite number of transitions. Hence, the CTMC is positive recurrent which implies ergodicity. From now on, the system will be assumed to be in the steady state. Then we let pij be the limiting probabilities as t → ∞ and the balance equations for the stationary distribution are Kαp00 = μp10 , (Kα + θ)p0j = μp1j , ((K + M − j)α + θ)p0j = μp1j ,
(1) j = 1, 2, . . . , M − 1, j = M, M + 1, . . . , K + M − 1,
(Kα + μ)p1j = Kαp0j + θp0,j+1 + Kαp1,j−1 , j = 0, 1, . . . , M − 1,
(2) (3) (4)
((K + M − 1 − j)α + μ)p1j = (K + M − j)αp0j + θp0,j+1 +(K + M − j)αp1,j−1 , j = M, M + 1, . . . , K + M − 1, (5) where p0,−1 = p0,K+M = 0. From (1) and (4), we have Kαp10 = θp01 ,
(6)
together with (2) and (4) we obtain j = 0, 1, . . . , M − 1.
Kαp1j = θp0,j+1 ,
(7)
Putting j = M − 1 in (7) and substituting (3) into (5), we get (K + M − 1 − j)αp1j = θp0,j+1 ,
j = M, M + 1, . . . , K + M − 2.
(8)
From (1) and (6), we have p01 =
K 2 α2 p00 . μθ
(9)
Substituting (2) into (7) and (3) into (8) yields Kα(Kα + θ) p0j , j = 1, 2, . . . , M − 1, μθ (K + M − 1 − j)((K + M − j)α + θ)α p0j , = μθ j = M, M + 1, . . . , K + M − 2.
p0,j+1 = p0,j+1
(10)
(11)
With the help of Eqs. (5), (7)-(11), all probabilities pij can be expressed in terms of p00 : p0j =
(Kα)j+1 (Kα + θ)j−1 p00 , (μθ)j
j = 1, 2, . . . , M,
(12)
20
F. Zhang and J. Wang j
p0j =
(K + M − n)
n=M+1
j−1
((K + M − n)α + θ)
n=M
αj+1 K M+1 (Kα + θ)M−1 p00 , (μθ)j j = M + 1, M + 2, . . . , K + M − 1, (13) Kα Kα(Kα + θ) j ( ) p00 , j = 0, 1, . . . , M, = (14) μ μθ j αj+1 K M+1 (Kα + θ)M (K + M − n)((K + M − n)α + θ) n=M+1 = p00 , μ(μθ)j j = M + 1, M + 2, . . . , K + M − 1. (15) ×
p1j
p1j
where p00 is determined by the normalizing equation
K+M−1
(p0j + p1j ) = 1:
j=0
p00 =
p00 K+M−2
(p0,j+1 + p1j ) + p1,K+M−1 +
M−1
=
(p0,j+1 + p1j ) + p00
j=0
j=M
j K+M−1 αj+1 (K(Kα + θ))M (K + M − n)((K + M − 1 − n)α + θ) n=M
(μθ)j+1
j=M M−1
Kα(Kα + θ) j+1 ) + ( +1 μθ j=0
−1 .
(16)
In the following, we give main performance characteristics of the system which are expressed in terms of probabilities pij : 1. The probability that the server is idle p0 =
K+M−1
p0j .
(17)
p1j = 1 − p0 .
(18)
j=0
2. The probability that the server is busy p1 =
K+M−1 j=0
3. The mean number of customers in the orbit E[N ] =
K+M−1 j=0
j(p0j + p1j ).
(19)
A Finite Source Retrial Queue with Spares and Orbit Search
21
4. The mean rate of generation of primary calls λ = Kα
M−1
(p0j + p1j )
j=0
+α
K+M−1
((K + M − j)p0j + (K + M − 1 − j)p1j ).
(20)
j=M
5. The mean waiting time in the orbit can be easily obtained by using Little’s formula E[W ] = (λ)−1 E[N ].
(21)
Remark 1. When θ → ∞ and M = 0, our model becomes the finite source queue without retrial orbit or spares. In this case, Eqs. (12)-(16) reduce to α K! p0 , n = 1, 2, . . . , K, pn = ( )n μ (K − n)! 1 p0 = , K α n K! 1+ ( μ ) (K−n)!
(22) (23)
n=1
where pn is defined as the probability that the there are n customers in the system (including the one being served). We can see Eqs. (22) and (23) agree with equations (17) and (19) in [12] for r = 1. Remark 2. If M is very large, we essentially have an infinite calling population with mean arrival rate Kα, i.e., our model becomes the constant retrial queue with an infinite population in which customers arrive according to a Poisson process at rate Kα. In this case, we let M → ∞ in Eqs. (12), (14) and (16) and obtain Kα (1 − ρ)ρj , j = 0, 1, 2, . . . Kα + θ(1 − δj0 ) Kα (1 − ρ)ρj , j = 0, 1, 2, . . . = μ
p0j =
(24)
p1j
(25)
where δji is Kronecker’s delta being 1 if j = i and 0 otherwise, and ρ = Kα(Kα+θ) . Eqs. (24) and (25) agree with equations (3.10) and (3.11) in [6] for μθ r = 1 and λ = Kα.
3
The Busy Period
Assume that all sources are free at time t0 = 0, i.e., C(0) = 0, N (0) = 0, and one of them just generates a request for service which initiates a busy period. The busy period ends at the first service completion epoch at which (C(t), N (t))
22
F. Zhang and J. Wang
returns to the state (0, 0). The busy period consists of service periods and seeking periods during which the server is free and there are sources of repeated calls in the system. The length of the busy period will be denoted by L, its distribution function P (L ≤ t) by Π(t) and its Laplace-Stieltjes transform by π(s) (see [9]). Let a busy period start at time t0 = 0. Define: P0j (t) = P {L > t, C(t) = 0, N (t) = j} ,
1 ≤ j ≤ K + M − 1,
(26)
P1j (t) = P {L > t, C(t) = 1, N (t) = j} ,
0 ≤ j ≤ K + M − 1.
(27)
Kolmogorov differential equations that govern the dynamics of these taboo probabilities are given by:
P0j (t) = −(Kα + θ)P0j (t) + μP1j (t),
1 ≤ j ≤ M,
(28)
P0j (t) = −((K + M − j)α + θ)P0j (t) + μP1j (t), M + 1 ≤ j ≤ K + M − 1, (29)
Π (t) = μP10 (t),
(30)
P1j (t) = −(Kα + μ)P1j (t) + KαP1,j−1 (t) + KαP0j (t) + θP0,j+1 (t), 0 ≤ j ≤ M − 1,
(31)
P1j (t) = −((K + M − 1 − j)α + μ)P1j (t) + (K + M − j)αP1,j−1 (t) +(K + M − j)αP0j (t) + θP0,j+1 (t), M ≤ j ≤ K + M − 1, (32) where P00 (t) = P0,K+M (t) = P1,−1 (t) = 0. In addition, the initial conditions are P0j (0) = 0 and P1j (0) = δj0 , where δj0 is Kronecker’s delta. Define: ∞ ∞ −st e P0j (t)dt, ϕ1j (s) = e−st P1j (t)dt. ϕ0j (s) = 0
0
So we get (s + Kα + θ)ϕ0j (s) = μϕ1j (s), (s + (K + M − j)α + θ)ϕ0j (s) = μϕ1j (s),
1 ≤ j ≤ M,
M + 1 ≤ j ≤ K + M − 1, π(s) = μϕ10 (s),
(33) (34) (35)
(s + Kα + μ)ϕ1j (s) = Kαϕ1,j−1 (s) + Kαϕ0j (s) +θϕ0,j+1 (s) + δj0 , 0 ≤ j ≤ M − 1, (s + (K + M − 1 − j)α + μ)ϕ1j (s) = (K + M − j)αϕ1,j−1 (s)
(36)
+(K + M − j)αϕ0j (s) + θϕ0,j+1 (s), M ≤ j ≤ K + M − 1. (37) We observe that ϕ00 (s), ϕ0,K+M (s) and ϕ1,−1 (s) are equal to 0.
A Finite Source Retrial Queue with Spares and Orbit Search
23
Eliminating ϕ0j (s) in (36) and (37) with the help of (33) and (34) respectively, we have
θμ Kαμ ϕ1,j+1 (s) + − (s + Kα + μ) ϕ1j (s) s + Kα + θ s + Kα + θ +Kαϕ1,j−1 (s) = 0, 1 ≤ j ≤ M − 1, (38) (K + M − j)αμ θμ ϕ1,j+1 (s) + ( s + (K + M − 1 − j)α + θ s + (K + M − j)α + θ −(s + (K + M − 1 − j)α + μ))ϕ1j (s) + (K + M − j)αϕ1,j−1 (s) = 0, M ≤ j ≤ K + M − 1, (39) where ϕ1,K+M (s) = 0. From (33), (35) and (36), we obtain π(s) , μ (s + Kα + μ)(s + Kα + θ) s + Kα + θ ϕ11 (s) = . π(s) − θμ2 θμ ϕ10 (s) =
(40) (41)
Similar to [9], we can express all functions ϕ1j (s), 0 ≤ j ≤ K + M − 1, with the help of (38)-(41), in terms of π(s) as follows: ϕ1j (s) = Aj (s)π(s) + Bj (s),
0 ≤ j ≤ K + M − 1.
(42)
The coefficients Aj (s) and Bj (s) can be found with the help of the following recursive relations: 1 , B0 (s) = 0, μ (s + Kα + μ)(s + Kα + θ) s + Kα + θ A1 (s) = , , B1 (s) = − θμ2 θμ θμ Kαμ Aj+1 (s) + ( − (s + Kα + μ))Aj (s) + KαAj−1 (s) = 0, s + Kα + θ s + Kα + θ 1 ≤ j ≤ M − 1, θμ Kαμ Bj+1 (s) + ( − (s + Kα + μ))Bj (s) + KαBj−1 (s) = 0, s + Kα + θ s + Kα + θ 1 ≤ j ≤ M − 1, θμ (K + M − j)αμ Aj+1 (s) + ( (43) s + (K + M − 1 − j)α + θ s + (K + M − j)α + θ −(s + (K + M − 1 − j)α + μ))Aj (s) + (K + M − j)αAj−1 (s) = 0, A0 (s) =
M ≤ j ≤ K + M − 2, θμ (K + M − j)αμ Bj+1 (s) + ( s + (K + M − 1 − j)α + θ s + (K + M − j)α + θ −(s + (K + M − 1 − j)α + μ))Bj (s) + (K + M − j)αBj−1 (s) = 0, M ≤ j ≤ K + M − 2.
24
F. Zhang and J. Wang
Letting j = K + M − 1 in (39), it follows that (
αμ − (s + μ))(AK+M−1 (s)π(s) + BK+M−1 (s)) s+α+θ +α(AK+M−2 (s)π(s) + BK+M−2 (s)) = 0. (44)
Therefore, we can calculate the Laplace-Stieltjes transform of the length of the busy period as follows: π(s) = −
αμ − (s + μ))BK+M−1 (s) + αBK+M−2 (s) ( s+α+θ . αμ ( s+α+θ − (s + μ))AK+M−1 (s) + αAK+M−2 (s)
(45)
Upon suitable differentiation we obtain the mean length of the busy period:
E[L] = −π (0) θμ 2 ) (AK+M−1 (0)BK+M−1 (0) − AK+M−1 (0)BK+M−1 (0)) = (( α+θ αθμ (A + (0)BK+M−2 (0) + AK+M−2 (0)BK+M−1 (0) α + θ K+M−1 −AK+M−1 (0)BK+M−2 (0) − AK+M−2 (0)BK+M−1 (0)) αμ )(AK+M−1 (0)BK+M−2 (0) − AK+M−2 (0)BK+M−1 (0)) +α(1 + (α + θ)2
+α2 (AK+M−2 (0)BK+M−2 (0) − AK+M−2 (0)BK+M−2 (0))) θμ ×(− AK+M−1 (0) + αAK+M−2 (0))−2 . α+θ
4
(46)
Waiting Time
The analysis of the waiting time process for retrial queues is usually far more difficult than the analysis of the number in the system. To study the waiting time, first we need to obtain the arriving customer’s distribution of the server state and the queue length denoted by qij , where qij is the state probability that the given source finds the system in the state (i, j), i.e., the server is at state i and there are j customers in the system, when a primary arrival occurs. Here, we have qij = pij . Therefore, we have to relate the stationary probability pij to the probability qij that a primary arrival finds the system in the state (i, j). We follow Theorem 2.10.6 in Walrand [14] and obtain qij as follows: q0j = (λ)−1 Kαp0j ,
0 ≤ j ≤ M,
(47)
−1
q0j = (λ) (K + M − j)αp0j , M + 1 ≤ j ≤ K + M − 1, q1j = (λ)−1 Kαp1j , 0 ≤ j ≤ M − 1,
(48) (49)
q1j = (λ)−1 (K + M − 1 − j)αp1j ,
(50)
M ≤ j ≤ K + M − 2.
Assume that at time t = 0 there are j sources in the orbit and i customers in service, 1 ≤ j ≤ K + M − 1, i = 0, 1. We mark the kth customer in the queue
A Finite Source Retrial Queue with Spares and Orbit Search
25
in orbit, 1 ≤ k ≤ j, and denote by fijk (t) the probability that by the time t this customer is not served yet, i.e., the residual waiting time of the tagged customer, τijk , is greater than t fijk (t) = P {τijk > t}. In terms of these probabilities the complementary waiting time distribution function of a new primary call F (t) can be expressed as follows: F (t) =
K+M−1
q1,j−1 f1jj (t).
(51)
j=1
Using (49)-(50) and (7)-(8), we can rewrite (51) as M K+M−1
F (t) = (λ)−1 α Kp1,j−1 + (K + M − j)p1,j−1 f1jj (t) j=1
= (λ)−1 θ
K+M−1
j=M+1
p0j f1jj (t).
(52)
n=1
We introduce an auxiliary Markov processζ(t) with the state space {(i, j, k) | i = 0, 1; j = 1, 2, . . . , K + M − 1; 1 ≤ k ≤ j} {(1, j, 0) | j = 0, 1, . . . , K + M − 2}. State (i, j, k) can be thought of as the presence in the system of i customers in service, j customers in orbit, and the tagged customer is at the kth position in the queue in orbit. The special states {(1, j, 0) | j = 0, 1, . . . , K+M −2} are absorbing states, and transition into one of these states means that the tagged customer starts to be served. Thus the residual waiting time of the tagged customer, τijk , is simply the time until absorption. From the Kolmogorov backward equations for the Markov chain ζ(t) we get:
f0jk (t) = −(Kα + θ)f0jk (t) + θf1,j−1,k−1 (t) + Kαf1jk (t), 1 ≤ j ≤ M − 1, 1 ≤ k ≤ j,
(53)
f0jk (t) = −((K + M − j)α + θ)f0jk (t) + θf1,j−1,k−1 (t) +(K + M − j)αf1jk (t),
M ≤ j ≤ K + M − 1, 1 ≤ k ≤ j,
(54)
f1jk (t) = −(Kα + μ)f1jk (t) + μf0jk (t) + Kαf1,j+1,k (t), 1 ≤ j ≤ M − 1, 1 ≤ k ≤ j,
(55)
f1jk (t) = −((K + M − 1 − j)α + μ)f1jk (t) + μf0jk (t) +(K + M − 1 − j)αf1,j+1,k (t), M ≤ j ≤ K + M − 1, 1 ≤ k ≤ j. (56) ∞ −st For Laplace transform φijk (s) = 0 e fijk (t)dt, introducing the LaplaceStieltjes transform of the waiting time W , ∞ W (s) = 1 − s e−st F (t)dt, (57) 0
26
F. Zhang and J. Wang
and Laplace-Stieltjes transform of the conditional waiting times τijk , τijk (s) = 1 − sφijk (s).
(58)
and combining (52) and (57)-(58), we get W (s) = 1 − (λ)−1 θ
K+M−1
p0j (1 − τ1jj (s)).
(59)
j=1
Differentiating this relation with respect to s at the point s = 0 we get the following formula for the nth moment of the waiting time W : E[W n ] = (λ)−1 θ
K+M−1
n p0j E[τ1jj ],
n ≥ 1.
(60)
j=1
Thus, to calculate the nth moment of W we need to know the nth moment of the conditional waiting times τ1jj , 1 ≤ j ≤ K + M − 1. Next, we will show how to compute recursively the moments of τijk . (n) (n) n n First, we denote E[τ0jk ] by ajk and E[τ1jk ] by bjk . Multiplying Eqs. (53)(56) by tn−1 and integrating with respect to t, from t = 0 to t = ∞, we get the following set of equations for n ≥ 1: (n)
(n)
(n)
(n−1)
−(Kα + θ)ajk + θbj−1,k−1 + Kαbjk = −najk
,
1 ≤ j ≤ M − 1, 1 ≤ k ≤ j, −((K + M − j)α + (n)
(n) θ)ajk
+
(n) θbj−1,k−1
(n)
(n) j)αbjk
+ (K + M − = M ≤ j ≤ K + M − 1, 1 ≤ k ≤ j, (62)
(n)
(n−1)
−(Kα + μ)bjk + μajk + Kαbj+1,k = −nbjk −((K + M − 1 − j)α +
(n) μ)bjk
(n) μajk
+
(61)
(n−1) −najk ,
, 1 ≤ j ≤ M − 1, 1 ≤ k ≤ j, (n) j)αbj+1,k
(63)
(n−1) −nbjk ,
+ (K + M − 1 − = M ≤ j ≤ K + M − 1, 1 ≤ k ≤ j. (64)
(n)
Eliminating ajk from these relations we find that: (n)
(n)
(n)
Kα(Kα + θ)bj+1,k + (Kαμ − (Kα + μ)(Kα + θ))bjk + θμbj−1,k−1 (n−1)
= −n(μajk
(n−1)
+ (Kα + θ)bjk
),
(K + M − 1 − j)((K + M − j)α +
1 ≤ j ≤ M − 1, 1 ≤ k ≤ j, (n) θ)αbj+1,k
(65)
+ ((K + M − j)αμ (n)
(n)
−((K + M − 1 − j)α + μ)((K + M − j)α + θ))bjk + θμbj−1,k−1 (n−1)
= −n(μajk
(n−1)
+ ((K + M − j)α + θ)bjk ), M ≤ j ≤ K + M − 1, 1 ≤ k ≤ j. (66) (n)
(n)
(0)
(0)
It is easy to see that bj−1,0 = bK+M,k = 0 and ajk = bjk = 1 for n ≥ 1,1 ≤ j ≤ K + M − 1 and 1 ≤ k ≤ j.
A Finite Source Retrial Queue with Spares and Orbit Search
27
This set of equations can be solved with the help of the following algorithm: Step 1: Put n = 1 in (61)-(66). Step 2: Putting j = K + M − 1, K + M − 2, . . . , M in (66) and j = M − 1, M − (n) 2, . . . , 1 in (65) when k = 1, we can obtain the values of bj1 , 1 ≤ j ≤ K + M − 1. Step 3: Repeat Step 2 by putting k = 2, 3, . . . , j sequentially, 2 ≤ j ≤ K +M −1. (n) Now we have calculated bjk , 1 ≤ j ≤ K + M − 1, 1 ≤ k ≤ j. (n)
(n)
Step 4: Substituting all bjk in (62)-(63), we obtain ajk for 1 ≤ j ≤ K + M − 1, 1 ≤ k ≤ j. Step 5: Repeat Step 2-4 by putting n = 2, 3, . . .. Then we can calculate all n moments of the conditional waiting times E[τijk ].
5
Numerical Examples
In this section we investigate the effect of the parameters on the main performance characteristics of the system. To this end, three curves which correspond to M = 1, 5, 10 are presented in Fig. 1-3 where the figures depict the rate of generation of primary calls α versus the mean number of customers in orbit E[N ], the mean waiting time in orbit E[W ], and the mean length of the busy period E[L]. The model is considered with K = 10 sources, service rate μ = 1 and retrial rate θ = 0.1. From Fig. 1 we can get some conclusions. It is evident that E[N ] is a monotonically increasing function of both α and M . This is due to the fact that there will be more primary calls arriving to the system with the greater values of α
20 18 M=10
Mean queue length E[N]
16 14 M=5
12 10
M=1
8 6 4 2 0
0
0.1
0.2
0.3 0.4 0.5 Source arrival rate α
0.6
Fig. 1. Mean queue length vs. α
0.7
0.8
28
F. Zhang and J. Wang
60
Mean waiting time E[W]
50
M=10
40 M=5 30 M=1 20
10
0
0
0.1
0.2
0.3 0.4 0.5 Source arrival rate α
0.6
0.7
0.8
Fig. 2. Mean waiting time vs. α 180
Mean length of the busy period E[L]
160 140 120 100 80 M=10 60 40 M=5 20 M=1 0 0.02
0.022
0.024 0.026 Source arrival rate α
0.028
0.03
Fig. 3. Mean length of the busy period vs. α
and M . When the server is busy, the more primary calls arrive, the more the number of sources in the orbit will be. Fig. 2 describes the the influence of the parameters α and M on the mean waiting time in orbit E[W ]. We observe that E[W ] increases with increasing value of α form 0 to some value, i.e., it has a maximum, but then becomes a decreasing function of α. On the other hand, with the increasing of the value M , there are more primary calls having to move into the orbit when they find the server is busy upon their arrival. Thus, E[W ] is an monotonically increasing function of M . Fig. 3 depicts the behavior of the mean value of the busy period E[L] against α and M . As intuition tells us, with the increase of the arrival rate α, there are
A Finite Source Retrial Queue with Spares and Orbit Search
29
more sources going to the retrial orbit so that the length of the busy period is to be increased. Meanwhile, more spares that are considered may increase the number of customers waiting in the orbit, which leads to a longer length of the busy period. Therefore, as M increases, E[L] also increases.
6
Conclusions
In this paper we present an exhaustive study of the queueing measures of a finite source retrial queueing system with orbit search, in which the system is maintained by replacing failed part by spares. We model our system as a Markov chain and derive some important queueing measures in steady-state. This research presents an extension of the finite source retrial queueing theory and the analysis of the model will provide a useful performance evaluation tool for more general situations arising in practical applications, such as production systems, flexible manufacturing systems, computer and communication systems, and many other related systems. Acknowledgments. This work was sponsored by the National Natural Science Foundation of China (Grant No. 11171019) and the Fundamental Research Funds for the Central Universities (Nos. 2011JBZ012 and 2011YJS281).
References 1. Alfa, A.S., Sapna, I.K.P.: An M/P H/k retrial queue with finite number of sources. Computers & Operations Research 31, 1455–1464 (2004) 2. Alm´ asi, B., Bolch, G., Sztrik, J.: Heterogeneous finite-source retrial queues. Journal of Mathematical Science 121, 2590–2596 (2004) 3. Artalejo, J.R., G´ omez-Corral, A.: Steady state solution of a single-server queue with linear repeated requests. Journal of Applied Probability 34, 223–233 (1997) 4. Artalejo, J.R.: Retrial queues with a finite number of sources. Journal of the Korean Mathematical Society 35, 503–525 (1998) 5. Artalejo, J.R., G´ omez-Corral, A.: Retrial queueing systems: a computational approach. Springer, Heidelberg (2008) 6. Economou, A., Kanta, S.: Equilibrium customer strategies and social-profit maximization in the single-server constant retrial queue. Naval Research Logistics 58, 107–122 (2011) 7. Efrosinin, D., Sztrik, J.: Stochastic analysis of a controlled queue with heterogeneous servers and constant retrial rate. Information Processes 11, 114–139 (2011) 8. Falin, G.I., Templeton, J.G.C.: Retrial queues. Chapman & Hall, London (1997) 9. Falin, G.I., Artalejo, J.R.: A finite source retrial queue. European Journal of Operational Research 108, 409–424 (1998) 10. Haque, L., Armstrong, M.J.: A survey of the machine interference problem. European Journal of Operational Research 179, 469–482 (2007) 11. Kornyshev, Y.N.: Design of a fully accessible switching system with repeated calls. Telecommunications 23, 46–52 (1969) 12. Naor, P.: On machine interference. Journal of the Royal Statistical Society, Series B (Methodological) 18, 280–287 (1956)
30
F. Zhang and J. Wang
13. Neuts, M.F., Ramalhoto, M.F.: A service model in which the server is required to search for customers. Journal of Applied Probability 21, 157–166 (1984) 14. Walrand, J.: An Introduction to Queueing Networks. Prentice Hall, Englewood Cliffs (1988) 15. Wang, J., Zhao, L., Zhang, F.: Analysis of the finite source retrial queues with server breakdowns and repairs. Journal of Industrial and Management Optimization 7, 655–676 (2011) 16. W¨ uchner, P., Sztrik, J., de Meer, H.: Structured Markov chains arising from homogeneous finite-source retrial queues with orbit search. In: Dagstuhl Seminar Proceedings 07461, Numerical Methods for Structured Markov Chains. Dagstuhl. Germany (2008) 17. W¨ uchner, P., Sztrik, J., de Meer, H.: Homogeneous finite-source retrial queues with search of customers from the orbit. In: Proceedings of 14th GI/ITG Conference MMB - Measurements, Modelling and Evaluation of Computer and Communication Systems, Dortmund, Germany, pp. 109–123 (2008) 18. W¨ uchner, P., Sztrik, J., de Meer, H.: Finite-source M/M/s retrial queue with search for balking and impatient customers from the orbit. Computer Networks 53, 1264–1273 (2009)
Bounds for Two-Terminal Network Reliability with Dependent Basic Events Minh Lˆe1 and Max Walter2 1
Lehrstuhl f¨ ur Rechnertechnik und Rechnerorganisation Technische Universit¨ at M¨ unchen Munich, Germany 2 Siemens AG N¨ urnberg, Germany
Abstract. The two-terminal reliability problem has been a widely studied issue since 1970s. Therefore many efficient algorithms were proposed. Nevertheless, all these algorithms imply that all system components must be independent. With regard to nowadays applications it is not sufficient to assume independent component failures because in fault tolerant systems components may fail due to common cause failures or fault propagation. We therefore propose an algorithm which deals with upcoming dependencies. In addition to that, lower and upper bounds can be obtained in case the algorithm cannot be conducted until the end. The performance and accuracy of the algorithm is demonstrated on a certain network obeying a recursive structure where the exact result is given by a polynomial.
1
Introduction
For determining the reliability or availability of a fault tolerant system, the system redundancy structure can be modelled by a Reliability Block Diagram (RBD) [20]. Therein the edges represent the system components with binary state. Two nodes are specified to be the terminal nodes. Under the assumption of independent components, the probability is seeked that there exist at least a path with working edges that connects the terminal nodes (Fig. 1). This problem is known to be NP complete and many different algorithms have been conceived to solve it in an efficient way. In the literature one can find the following methods: State enumeration and sum of disjoint products (SDP [23]), Factoring theorem with series and parallel reductions [1] and Edge Expansion Diagrams (EED) using Ordered Binary Decision Diagram (OBDD) based method [4]. The methods using SDP and state enumeration require that minimal paths or cuts have to be enumerated in advance. However, the vital drawback of those methods is that the computational effort in disjointing of the minimal path or cut sets grows rapidly with the network size. Instead, it is more recommended to apply Factoring or EED using OBDDs [4]. The efficiency of the BDD based methods depends largely on BDD variable ordering which itself is known to be NP hard [4]. Furthermore, both mentioned methods lack the ability of providing bounds in case J.B. Schmitt (Ed.): MMB & DFT 2012, LNCS 7201, pp. 31–45, 2012. c Springer-Verlag Berlin Heidelberg 2012
32
M. Lˆe and M. Walter
of non termination. Thus, another method was proposed by Gobien and Dotson which is based on set theoretical partition of the sample space into disjoint sets and can yield at least lower and upper bounds in case the computation cannot be led to the end [22]. This method has been extended with series and parallel reductions for increasing efficiency [17]. As we already stressed, the appearance of dependencies in component failures have become an integral part for the appropriate reliability assessment of e.g. communication, water supply or power networks because the simplification of independent failures would lead to overoptimistic results. For instance, certain close-by components in a large water supply network may fail dependently due to natural influences like local earthquakes. Servers may fail dependently due to power spikes. Hence, those interdependencies must be taken into account. This can be done by introducing disjoint sets of interdependent components (SICs). More precisely, interdependent components are enclosed in one SIC whereby a system can contain several different SICs. Furthermore, any component of an arbitrary SIC is independent from all other components beyond this SIC. With regard to dynamic fault trees (DFT) [9] [8] where dependencies can also be considered, our approach can handle the arbitrary arrangement of SICs in the system structure, whereas a DFT can only be decomposed in independent subtrees if the leafs which have a common AND/OR-gate are in one SIC. So the exploitation of independent subtrees is only possible for a certain configuration of SICs. The same issue holds when applying GSPNs (Generalized stochastic petri nets [20]) to the SICs of a DFT. For each SIC a stochastic based model (SBM) is then generated wherefrom the probability of dependent failure combinations can be obtained. A SBM can be represented by Petri nets [11], Copulas [16][19], Stochastic Process Algebras [7][12] or stochastic simulation models [6]. In our previous works [14] and [13] we already considered the arbitrary arrangement of SICs: The first work [14] is based on EEDs using OBDD and uses the Shannon expansion for each SIC. This becomes quite complex for growing SIC sizes. So the latter work [13] proposes a method based on Factoring and reductions where only relevant dependent basic events are considered instead of all possible combinations. Nevertheless, the efficiency of method [13] strongly depends on a good variable ordering. Because the limitation of those two methods can soon be reached by larger graph sizes, it is important to obtain at least bounds in case a computation cannot be finished. On these grounds we want to extend the Gobien Dotson (GD) algorithm with series and parallel reductions in order to handle component interdependencies. In view of the LARES framework [15], this algorithm shall be integrated into the LARES toolchain using CASPA [12] as a solver for the SBM. After giving the formal statement of the described problem and a brief introduction of the GD algorithm in section 2, we will present the dependent version of the GD algorithm in section 3.1 and apply it to an example network Fig. 1. To show that the algorithm can deal with large networks without any reducible structures, we will demonstrate in section 4 the performance of the algorithm on a recursive network structure named K4-ladder. Finally, an outlook will be given in section 5.
Bounds for Two-Terminal Network Reliability
33
Fig. 1. Initial graph
2 2.1
Preliminaries Formal Description
We have a multigraph G := (V, E) with no loops and where V stands for a set of vertices or nodes and E a multiset of unordered pairs of vertices, called edges. Given the redundancy structure of a system modelled by a network graph G := (V, E), we specify two nodes s and t which characterize the terminal nodes (In Fig.1 those nodes are colored in black). We define two not necessarily injective maps f and g, where f :E→C assigns the edges to the system components and g : E → V 2 assigns each edge to a pair of nodes. The finite set of system components is defined by C = S ∪ T where T is the set of independent components - each component in T can be regarded as a SIC on its own - and S = SIC1 ∪SIC2 ∪. . .∪SICn , n ∈ N represents the disjoint union of SICs. The mapping of several edges to one component c implies that c is a multiple component whereas component c can be from any SIC or from set T . The dependency relation between two components infers that they must belong to the same SIC. Because the dependency property is transitive, a SIC can be regarded as a transitive closure where each element depends on the others whether directly or indirectly. So two components which are dependent must belong to the same SIC and if they are from different SICs they are independent. In other words, for all i, j with i = j it holds that SICi ∩ SICj = ∅. Moreover it holds that |SICk | ≥ 2 ∀k ∈ I. So if there are two components x1 , x2 from different SICs, then their conjoint probability of failure is the product of their respective failure probabilities. If they would be in the same SIC, we have to establish a separate SBM for this SIC to obtain the conjoint probability. Because each SIC can be regarded as a random variable X : ω ∈ SIC → 2|SIC| (states), the effort for generating the state space probabilities in terms of the SBM (the respective probability density function for X) grows exponentially by the quantity of components in one SIC. In our example graph of Fig. 1 there are two SICs: SIC1 contains three components and SIC2 two. All remaining independent components are assigned to T . Because f is not necessarily injective we can assign several edges to one component. This does not infer that f is surjective because there might be components which are not represented by any edge. For example two components x, y fail due to a common cause failure coming from component z, but z plays no role in the system’s redundancy structure. This means that we allow the multiple
34
M. Lˆe and M. Walter
occurrence of one component in any system’s redundancy structure. This fact is implied by saying that there exist a multiple component in the redundancy structure. The example network from Fig. 1 contains three multiple components 1, 2 and 3 which represent a two out of three redundancy substructure. After introducing the notations we can now formulate the problem as follows: Statement of the problem. Given a network graph G := (V, E), its terminal nodes s, t, a set of system components C = S ∪ T and two not necessarily injective maps f : E → C and g : E → V 2 . Each component c ∈ C represents a random variable with two states: failed or working. The reliability for each c ∈ C is given by pc . For each SIC ⊆ S there exist a corresponding SBM. The system’s terminal pair reliability R is the probability that the two specified terminal nodes can be connected by at least one path consisting of only edges associated with working components. Rules for series and parallel reductions. For speeding up the calculation, reduction techniques should be applied to the graph whenever it is possible. With negligible expense certain substructures can be identified and simplified so that the graph size decreases under preservation of the probability for the reliability of the original graph. Typical well known reduction methods are those of series and parallel reductions. Under the consideration of dependencies among certain system components we want to sum up the rules and heuristics for series and parallel reductions proposed in our last work [13]: – Reductions with multiple components are not allowed except multiple components become unique in the course of the algorithm. – Reductions can only be performed among independent components and among components which are from the same SIC, i.e. e1 , e2 ∈ T or e1 , e2 ∈ SICi , i ∈ N. For the algorithm in section 3 we use a map named EdgeProbMap (epm) where the failure probability for each edge is stored. In addition, each edge is initially associated to a component in order to distinguish edges which are multiple, dependent or independent. By doing so one can take account of reductions and hence this map can only change in case a reduction has taken place. In case of an independent reduction, the probability of the edge resulting from the reduction will be re-adjusted according to the rules for a series or parallel reduction. So for edges e1 , e2 ∈ T the probability for the new edge - labeled with ri - would be pri = pe1 pe2 for a series and pri = pe1 + pe2 − pe1 pe2 for a parallel reduction whereby i stands for the i-th independent reduction. In case of a dependent reduction, we would relabel one of the two affected edges with a capital letter Rj whereby j stands for the j-th dependent reduction. W.l.o.g. we label e1 with Rj and delete e2 . Rj comprises the concatenated expression of the two affected edges. Here we introduce the labeling function l : E → B where B stands for a boolean expression in disjunctive normal form - initially l equals f where all components are literals set to true. To be more precise, for edges e1 , e2 ∈
Bounds for Two-Terminal Network Reliability
35
SICk , k ∈ N , Rj = l(e1 ) ∧ l(e2 ) for a series reduction and Rj = l(e1 ) ∨ l(e2 ) for a parallel reduction. Hereafter e2 will be deleted and l(e1 ) = Rj . For details of the algorithmic approach we refer to [13]. There are also other established reduction methods such as the polygon-to-chain [10] or triangle reduction [21] which have not yet been considered due to their high complexity. 2.2
Basics of the Independent GD Algorithm
Before starting with the dependent extension for the Gobien Dotson algorithm, we want to recapitulate the independent version. According to [3], a network consisting of n links, an elementary event E is a binary specification of n links in a n-dimensional sample space. E can be represented by a vector where each entry bears a link label which can be negated or not. A full event is recursively defined as either an elementary event or the union of two events differing only in one entry of the event vector. For example, if m = 2 then full event [1] is the union of [1, 2] and [1, ¯ 2]. A path is a sequence of links l1 , l2 , . . . , lk such that the terminal node of li coincides with the initial node of li+1 , 1 ≤ i ≤ k − 1. A path is the full event that is the union of all elementary events that include the links in the path. A success event is defined as a full event such that each of its elementary events contains an s-t-path where s and t are the terminal nodes. For a failure event it holds the same only that each of its elementary event contains an s-t-cut. Assume there exist a path P = [1, 2, . . . , r] in our network graph G. After [17] the reliability expression of G can be obtained by recursively applying the factoring theorem for each of the r edges in path P . Starting with the first edge e1 gives us: Rel(G) = p1 · Rel(G ∗ e1 ) + q1 · Rel(G − e1 ), where ∗/− stands for a contraction/deletion of the edge and qi = 1 − pi , 1 ≤ i ≤ r is the probability of the edge ei ’s failure. Then the last term Rel(G − e1 ) will again be expanded by factoring on edge e2 . Overall it follows: Rel(G) = + +
q1 · Rel(G − e1 ) p1 q2 · Rel(G ∗ e1 − e2 ) ...
+ p1 p2 · · · pr−1 qr · Rel(G ∗ e1 ∗ e2 ∗ . . . ∗ er−1 − er ) r + k=1 pk So we have r subproblems respectively subgraphs emanating from path P. Again, for each subproblem this equation can be recursively applied. Thus, for each subproblem we are looking for the topologically shortest path to keep the number of subproblems low. In each subgraph series and parallel reductions can be performed if possible. Suppose S to be a disjoint exhaustive success collection of success events Si , 1 ≤ i ≤ N in G. Then after [22] the terminal pair reliability of G is represented by N Rel(G) = P(Si ). i=1
36
M. Lˆe and M. Walter
Where in our example P(Si ) = rk=1 pk if Si = P . Analogously it holds for the exhaustive failure collection F := {Fi , 1 ≤ i ≤ M } Rel(G) = 1 −
M
P(Fi ).
i=1
For u < |S|, v < |F | and u, v ∈ N the lower and upper bounds for the reliability are u v P(Si ) ≤ Rel(G) ≤ 1 − P(Fi ). i=1
i=1
The cut and path terms contributing to the bounds will be exemplified by an example in section 3.2.
3
The Dependent GD with Reductions
This part of our work will give a brief overview of the whole dependent GD algorithm. First we will give the appropriate explanations for the listed procedures underneath. Then in section 3.2 we demonstrate the algorithm by means of an example network. 3.1
The Algorithm
Starting with procedure 1 the relevant structure of the input network is required in the form of a RBD. All edge probabilities of the RBD are contained in the map epm. The map componentSICIndex assigns each component to its appropriate SIC index and componentEdges to the respective edge. Therefrom we can deduce the M ultipleEdges map which contains all multiple edges in the current network graph. Afterwards we initialize the global lists which stores maps of the accumulated graph operations (M reconstr), the accumulated path information (M acc2) and the current edge probabilities (M epm) to be processed. We start the computation by calling procedure 2. The recursive algorithm generates a call tree wherein the edges are labeled with the edges to be contracted or deleted and the nodes contain the subproblem derived from the parent node (e.g. Fig.3). As it can be seen, the level parameters are set in such a way that the nodes of the recursion tree are processed obeying the breadth first search (BFS). Proceeding this way causes a high memory consumption as the breadth of the tree can grow exponentially in relation to its depths whereas the bounds would converge faster after proceeding each depth level because the most probable paths or cuts can be found in the upper levels of the tree. We could also proceed the nodes by depth first search (DFS). On the one hand this would be less memory consuming but on the other hand the computation of bounds turns out to be obsolete due to the extremely slow convergence. In order to retain a good bound convergence behavior and at the same time keeping the memory consumption as low as possible, we have to put up with longer runtimes. This can be done
Bounds for Two-Terminal Network Reliability
37
by only storing the changes made in the graph in map M reconstr. Instead of storing the subgraphs, we reconstruct them using function reconstructGraph() and map M reconstr (procedure 4, line 7). In procedure 2 we check whether the current graph is connected otherwise there is a cut which will be processed by the function classif yCutT erms: All edges belonging to the cut will be classified according to their SICs. For each SIC the extracted Boolean term will be stored in the cutDep Map whereas the probabilities of the independent edges are multiplied and the result stored in the cutIndep. Both entries obtain the same index i for the i-th cut - so that later on when the probabilities for the dependent Boolean terms are returned from the SBM, the whole probability for the cut can be reassembled. In line 9 of procedure 2 the graph will be reduced if possible. In case a reduction has taken place, the epm would change accordingly. Additionally M reconstr must be updated with the changes from the reduction. Because the graph is connected, we are seeking for the topologically shortest path (sp) by applying BFS. Then the sp will be classified together with the accumulated path terms in function classif yP athT erms as just described. In procedure 3 line 6 multiple edges are treated for the case that we have at least two edges assigned to the same component in our sp. Hereafter all edges from the sp to be contracted will be collected by the temporary list Lcollect and finally be delivered to the Lcontract List. All edge contractions and deletions are stored in the map M acc (line 10-23) in order to accumulate them to map M acc2 at the end (line 24). Finally all relevant maps will be parsed to the next level in order to be processed (line 28-29). After finishing procedure 3, procedure 4 will be called in line 10 of procedure 1. There the nodes of the respective levels will be first reconstructed by reconstructGraph() and processed so that one can obtain the current values for the lower (lb) and upper bounds (ub) of the unreliability (C and 1 − R) after finalizing all nodes of each level.
Procedure 1. GobienDotsonDependent Input: RBD InitGraph, EdgeProbMap epm 1: //Initialize mappings 2: Map componentSICIndex, componentEdges, M ultipleEdges ⇐ InitM appings(); 3: //Initialize global lists 4: List lastLevel = new List; 5: List nextLevel = new List; 6: //Start computation 7: computeRel(InitGraph,new Map<Edge,Bool>,new Map<Edge,Bool>,epm); 8: lastLevel = nextLevel; 9: nextLevel =new List; 10: bf sLevel();
38
M. Lˆe and M. Walter
Procedure 2. ComputeRel Input: RBD Graph,Map M acc,Map M reconstracc,EdgeProbMap epm 1: bool b = Graph.f indP ath() 2: if b == f alse then 3: //Cut found resp. Graph is not connected. 4: classif yCutT erms(M acc, epm); 5: return 6: end if 7: //Preprocessing: Reduce graph. 8: SPRed red = new SPRed(Graph, epm, IsM ultiple, CompSICIndex); 9: Graph = red.Reduce(); 10: //Update edge probabilities and accumulate changes due to reductions 11: epm = red.getEdgeP robM ap(); 12: M reconstracc.add(red.getreconstr()); 13: //Find shortest path by breadth first search 14: List sp = BF S.shortestP ath(Graph); 15: //Group terms according to their SICs and add them to pathIndep and pathDep 16: classif yP athT erms(M acc, sp, epm); 17: processShortestP ath(sp, M acc, M reconstracc, epm);
3.2
A Case Study
Now we will describe the workings of the algorithm on the example in Fig.1. The example graph G0 comprises two SICs: The first SIC contains three components and the second two. All other components which do not belong to any SIC are regarded to be independent. There is an 2-out-of-3 edge modeled by multiple edges assigned to components 1, 2 and 3. The two terminal nodes s and t are marked in black. Because there are no possible reductions, we start by looking for a shortest path. The algorithm delivers for example path 1 ∧ 2 (Fig.2a). In the next step we would obtain two subgraphs: G1 by deleting edges labeled with component 1. G2 by contracting edges assigned to component 1 and deleting edges assigned to component 2. Again, for each of those subgraphs we try to reduce. We notice that a series reduction can be made between the edges labeled with 2 and 3, because 2 and 3 are in the same SIC. One of the edges will be labeled as R1 and the other deleted. We store the reduction made in our edge-probability-map for graph G1 which is a complete graph containing four nodes. In the literature it is also known as a K4 graph. Now we continue to search for the shortest path which is obviously R1 . After deleting R1 (Fig.2b), we would obtain G3. Normally the algorithm would proceed with subgraph G2 which has the same structure as G1 . Hence we omit the sketch of processing G2, nevertheless it can be reconstructed by the help of Fig.3. Continuing with G3 we would look for a shortest path since no parallel or series reduction are possible. Partitioning the graph on the base of shortest path 4 ∧ 8 we arrive at G5 and G6. Though G5 is a series structure, we are not allowed to reduce because components 7 and 6 are not from the same SIC. Proceeding on the basis of the shortest path 7 ∧ 6 we obtain two cuts since the terminal nodes are
Bounds for Two-Terminal Network Reliability
Procedure 3. processShortestPath Input: List sp,Map M acc,Map M reconstracc,EdgeProbMap epm 1: //Initialization of auxiliary maps and lists 2: Map M reconstr, M acc2, M epm, M newacc, ⇐ InitEmptyM aps(); 3: List Lcollect, Lcontract = new List<Edge>; 4: int i = 0; 5: for each edge e ∈ sp do 6: avoid a redundant deletion/contraction of multiple edges; 7: i = i + 1; 8: Lcollect.add(e); 9: if i == 1 then 10: M reconstr.put(i, M reconstracc); 11: M reconstr.get(i).put(e, f alse); 12: M newacc.put(e, f alse); 13: else 14: M reconstr.put(i, M reconstracc); 15: add all edges ee ∈ Lcollect to Lcontract, ee = e; 16: M reconstr.get(i).putAll(Lcontract, true); 17: M reconstr.get(i).put(e, f alse); 18: M newacc.putAll(Lcontract, true); 19: M newacc.put(e, f alse); 20: Lcontract.clear; 21: end if 22: //Setting parameters for lower recursive levels. 23: M acc.putAll(M newacc); 24: M acc2.put(i, M acc); 25: M epm.put(i, epm); 26: end for 27: List level = Level(M reconstr, M acc2, M epm, i) 28: nextLevel.add(level)
Procedure 4. bfsLevel 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15:
if lastLevel.size()==0 then return ; end if for each level ∈ lastLevel do Map M reconstr, M acc2, M epm ⇐ level.getM aps; for i = 1 to level.Length() do RBD rbd = reconstructGraph(M reconstruct.get(i)); computeRel(rbd, M acc2.get(i), M reconstr.get(i), M epm.get(i)); end for end for lastLevel = nextLevel; nextLevel = new List; print ”upper bound for Unreliability = 1 − computeRelbyP aths()”; print ”lower bound for Unreliability = computeRelbyCuts()”; bf sLevel();
39
40
M. Lˆe and M. Walter
disconnected. The cuts are highlighted in the dotted boxes of Fig.3. We climb up the recursion tree to go on with G6. There we can do a dependent parallel reduction since components 5 and 7 are from the same SIC. The reduction is captured in a separate map as R2 = 5 ∨ 7. Again, we end up with two cuts on the base of the shortest path R2 ∧ 6. The recursion tree in Fig.3 illustrates all possible paths and cuts obtained in each depth/level of the recursion. Every time just before a path/cut is added to the path/cut list, it was analyzed by the procedure classif yP ath/CutT erms. Therein the expressions within the Boolean term, respectively path/cut are rearranged and grouped according to their SIC filiation. The grouped terms are stored separately in the lists pathDep/cutDep to be handed over to the SBM later on. The probabilities of the independent expressions are simply multiplied and then added to the lists pathIndep/cutIndep. After the probabilities of the independent terms were returned from the SBM, the whole probability for any path/cut term can be reassembled by multiplying, since the SICs are independent among each other. For instance, the relevant probabilities of the first cut term !1 ∧ R1 ∧!4∧!7 would be classified as follows: The value of p!4 would be added to the list cutIndep at the first index whereas !1 ∧ R1 (belonging to SIC 1) and !7 (belonging to SIC 2) would be added to the first position of the list cutDep. Analogously the second cut would be stored at the second position of the respective list. When all values for the dependent basic events are returned from the stochastic model, the probability for the first cut is computed by p!4 · p!1∧R1 · p!7 .
(a) First level
(b) Second and third level
Fig. 2. Algorithm
Bounds for Two-Terminal Network Reliability
41
Fig. 3. Recursion Tree
4
A Recursive Example
In this section we provide some experimental results of the algorithm performed on a recursive structure - the K4 ladder (Fig.4). By knowing the exact result which is given by a polynomial for each ladder dimension (refer to [2]), we can validate on the one hand the correctness of the algorithm for the independent case and on the other hand we can assess the accuracy of bounds obtained from large network graphs. We assume that each edge is assigned to a different component. All components have the same failure probability of 0.1. The experiments were conducted on a 2.1 GHz machine with 2 GB RAM. It can be taken from Table 1 that up to ladder size of 9 the algorithm terminates. As we can see, the computation time increases exponentially with the number of components. All result values are exactly the same as those obtained from the polynomial in [2]. For the ladder size of 10 - corresponding to 51 components - we obtain lower and upper bounds for the unreliability (3.1557 · 10−3 and 3.1560 · 10−3 ) after 18913 milliseconds(ms) runtime. Following this, the tightness of those bounds can be justified by knowing the exact result of 3, 1558 · 10−3 . For the dependent case we have set up two SIC configurations, one where there are two SICs each containing the three components emerging from the terminal nodes and another where the numbers of SICs (each having two components) equals the ladder size (Fig.4). We impose that each component in one SIC is correlated to all others of this respective SIC with correlation factor ρ > 0 (see [16]). ρ has an
Fig. 4. K4-ladders with 2 SICs (upper) and N SICs (lower)
42
M. Lˆe and M. Walter
increasing effect on the probabilities of failure combinations between SIC components . Those probabilities become larger with rising ρ. In our example ρ takes three values: 0.01, 0.05 and 0.1. For the case of two SICs we manage to determine the unreliability until ladder size 7 whereas for the case of N SICs the algorithm can cope until ladder size N = 6. This is justified by the fact that the effort grows by the number of SICs which corresponds to the number of SBMs to be evaluated. The runtimes for the dependent cases are about the same for N = 2. From size three on (16 components), more computation time is needed for the N SIC case. The unreliability values for the two cases and their respective ladder sizes are depicted in Fig.5 and Fig.6. It can be seen that the unreliability increases with ρ. Furthermore, the position of the SICs plays a significant role: The reliabilities for the case of two SICs are lower than for the N SICs case. This is due to the fact that a dependent failure of the three components at the terminal nodes leads to a system failure event respectively a disconnection of the terminal nodes whereas dependent failures of two components of the N SICs case can be tolerated by the redundancy structure. For each of the two cases the bounds are computed as accurately as possible for the next higher ladder size (Table 2) - the tightness of the bounds is shown by δ. To show that the algorithm still delivers acceptable bounds for large graph sizes, we have computed bounds for the independent case with 71 components and for the N SICs case with 66 components. As a matter of fact, the tightness of the bounds deteriorates with the rise of the ladder size. Parts of the results are listed in Table 3 which corresponds to Fig.7 showing the fast convergence of the bounds towards the exact value. We have also taken into consideration to estimate the effect of series and parallel reductions by omitting them in the algorithm. The runtimes show that the overhead for the reductions pays off for all ladder sizes. This can be observed for the independent and dependent case. For instance, it would take 30ms instead of 19ms for ladder size 2 and even 2615ms instead of 415ms for ladder size 5 to compute the unreliabilities (independent case). It would not be possible to compute the unreliability for ladder sizes larger than 5 due to the enormous number of subproblems. Similar observations can be made for the dependent cases. For the lack of space we omit the listing of the runtime tables without reductions. Table 1. K4-ladder independent case Size
2
3
4
5
6
7
8
9
Unrel·10−3 2.2013687 2.3206358 2.4399907 2.5593314 2.6786578 2.7979700 2.9172679 3.0365515 Time (ms)
19
71
216
415
647
1644
7946
47109
#Comp.
11
16
21
26
31
36
41
46
Bounds for Two-Terminal Network Reliability
43
Table 2. Bounds for N & 2 SICs, (lb/ub/δ) ·10−3 , time in ms Case
2 SICs
(size 8)
ub
δ
lb
N SICs (size 7) time
lb
ub
δ
time
Corr 0.01 3.0751421 3.0771007 0.00195 14911 2.8156037 2.8209134 0.00530 11433 Corr 0.05 3.8989072 3.9018633 0.00295 14643 2.8853496 2.8865489 0.00119 10345 Corr 0.10 4.9122819 4.9157460 0.00346 15066 2.9775923 2.9808147 0.00322 12009
Table 3. Bounds for high ladder sizes - independent & N SICs case (lb/ub)·10−3 indep.
(size 14)
time (ms)
lb
ub
time (ms)
lb
ub
891
1.3734
71.3803
3395
0
188.2954
1515
2.5433
25.0081
4822
1.2460
73.1824
2905
3.1343
9.4892
7030
2.5394
26.7525
7703
3.3891
4.9957
11038
3.2936
10.6765
23857
3.4781
3.8455
28099
3.6580
5.7223
(size 13 Corr 0.1)
31
36
independent corr = 0.01 corr = 0.05 corr = 0.1
6 Unreliability ·10−3
N SICs
5.5 5 4.5 4 3.5 3 2.5 2
11
16
21
26
# Components
Unreliability ·10−3
Fig. 5. K4-ladder with 2 SICs at terminal nodes
4 independent corr = 0.01 corr = 0.05 corr = 0.1
3.5 3 2.5 2
11
16
21
26
# Components
Fig. 6. K4-ladder with N different SICs
31
M. Lˆe and M. Walter
Unreliability ·10−3
44
200 independent corr = 0.01 corr = 0.05 corr = 0.1
150 100 50 0
0
5
10
15
20
25
30
Time in seconds
Fig. 7. Convergence of bounds for independent case (size 14) & N SICs case (size 13)
5
Conclusion
By this work we have shown that the dependent GD algorithm with reductions is an effective method to assess the reliability of systems incoorporating dependencies. The algorithm can cope with large network sizes depending on their redundancy structure. Meaning that the memory saving technique allows us to proceed further in the recursion tree and hence obtain more accurate bounds. We manage to determine good bounds for the K4 ladder example consisting of 71 components for the independent case and 66 components for the dependent case (Table 3). The correctness of the algorithm has been reaffirmed for the independent case by providing exact results in advance [2]. Depending on certain SIC settings, the results in Fig.5 and Fig.6 indicate that the algorithm delivers reasonable unreliability values. Further on, the measurements show that the number of subproblems can be dramatically reduced by the help of reductions. In future, we want to extend our algorithm to be able to deal with the K-terminal problem [5] [10]. Also node failures shall be considered because in reality nodes representing servers or routers might fail as well [18][3]. As this work is part of the main framework LARES [15], the conceived algorithms will prospectively be integrated into the LARES tool-chain for allowing to assess the availability of systems modelled by LARES. Acknowledgements. We would like to thank M.Siegle, A. Gouberman, M. Riedl and J. Schuster from Universit¨ at der Bundeswehr for their insightful discussions. A special thank is dedicated to C. Tanguy from Orange FTGroup for his cordial support. We also thank the four anonymous reviewers for their helpful comments. This work is partly funded by the Deutsche Forschungsgemeinschaft within the project BO 818/8-1: ”Modellbasierte Analyse der Verl¨ asslichkeit komplexer fehlertoleranter Systeme”.
References 1. Satyanarayana, A., Chang, M.K.: Network reliability and the factoring theorem. Networks 13(1), 107–120 (1983) 2. Tanguy, C.: Asymptotic mean time to failure and higher moments for large, recursive networks. In: CoRR (2008)
Bounds for Two-Terminal Network Reliability
45
3. Torrieri, D.: Calculation of node-pair reliability in large networks with unreliable nodes. IEEE Trans. Reliability 43(3), 375–377 (1994) 4. Yeh, F.M., Lu, S.K., Kuo, S.Y.: Determining terminal-pair reliability based on edge expansion diagrams using obdd. IEEE Trans. Reliability 48(3), 234–246 (1999) 5. Hardy, G., Lucet, C., Limnios, N.: K terminal network reliability measures with binary decision diagrams. IEEE Trans. Reliability 56(3), 506–515 (2007) 6. Gillespie, D.T.: Exact stochastic simulation of coupled chemical reactions. Journal of Physical Chemistry 81(25), 2340–2361 (1977) 7. Hermanns, H., Herzog, U., Katoen, J.P.: Process algebra for performance evaluation. Theoretical Computer Science Archive 274(1-2), 43–87 (2002) 8. Dugan, J.B., Venkataraman, B., Gulati, R.: Diftree: a software package for the analysis of dynamic fault tree models. In: RAMS (1997) 9. Sullivan, K.J., Coppit, D.: Galileo: A tool built from mass-market applications. In: ICSE (2000) 10. Wood, K.: A factoring algorithm using polygon-to-chain reductions for computing k-terminal network reliability. Networks 15(2), 173–190 (1985) 11. Marsan, M.A., Balbo, G., Conte, G., Donatelli, S., Franceschinis, G.: Modelling with generalized stochastic Petri nets. John Wiley & Sons (1995) 12. Kuntz, M., Siegle, M., Werner, E.: Caspa - a tool for symbolic performance and dependability evaluation. In: EPEW (FORTE Co-located Workshop), pp. 293–307 (2004) 13. Lˆe, M., Walter, M.: Considering dependent components in the terminal pair reliability problem. In: DYADEM-FTS 2011, pp. 415–422 (2011) 14. Pock, M., Walter, M.: Efficient extraction of the structure formula from reliability block diagrams with dependent basic events. Journal of Risk and Reliability 222(3), 393–402 (2008) 15. Walter, M., Gouberman, A., Riedl, M., Schuster, J., Siegle, M.: LARES - A Novel Approach for Describing System Reconfigurability in Dependability Models of Fault-Tolerant Systems. In: ESREL (2009) 16. Walter, M., Esch, S., Limbourg, P.: A copula-based approach for dependability analyses of fault-tolerant systems with interdependent basic events. In: ESREL, pp. 1705–1714 (2008) 17. Deo, N., Medidi, M.: Parallel algorithms for terminal pair reliability. IEEE Trans. Reliability 41(2), 201–209 (1992) 18. Theologou, O.R., Carlier, J.G.: Factoring and reductions for networks with imperfect vertices. IEEE Trans. Reliability 40(2), 210–217 (1991) 19. Nelsen, R.B.: An Introduction to Copulas. Springer, Heidelberg (1999) 20. Sahner, R., Trivedi, K., Puliafito, A.: Performance and Reliability Analysis of Computer Systems. Kluwer Academic Publishers (1996) 21. Hsu, S.J., Yuang, M.C.: Efficient computation of terminal-pair reliability using triangle reduction in network management. ICC on Communications 1, 281–285 (1998) 22. Dotson, W.P., Gobein, J.: A new analysis technique for probabilistic graphs. IEEE Trans. Circuit & Systems 26(10), 855–865 (1979) 23. Chen, Y.G., Yuang, M.C.: A cut-based method for terminal-pair reliability. IEEE Trans. Reliability 45(3), 413–416 (1996)
Software Reliability Testing Covering Subsystem Interactions Matthias Meitner and Francesca Saglietti Chair of Software Engineering, University of Erlangen-Nuremberg, 91058 Erlangen, Germany {matthias.meitner,saglietti}@informatik.uni-erlangen.de
Abstract. This article proposes a novel approach to quantitative software reliability assessment ensuring high interplay coverage for software components and decentralized (sub-)systems. The generation of adequate test cases is based on the measurement of their operational representativeness, stochastic independence and interaction coverage. The underlying multi-objective optimization problem is solved by genetic algorithms. The resulting automatic test case generation supports the derivation of conservative reliability measures as well as high interaction coverage. The practicability of the approach developed is finally demonstrated in the light of an interaction-intensive example. Keywords: Software reliability, interaction coverage, component-based system, system of systems, emergent behavior, statistical sampling theory, testing profile, multi-objective optimization, genetic algorithm.
1
Introduction
The systematic re-use of tested and proven-in-use software components evidently contributes to a significant reduction in software development effort. By the principles of abstraction and partition the component-based paradigm supports the transparency of complex logic both from a constructive and an analytical point of view. Nonetheless, a number of spectacular incidents [5] proved that various risks may still be hidden behind inappropriate component interaction, even in case of inherently correct components. For such reasons, novel approaches were recently developed, aimed at systematic, measurable and reproducible integration testing for componentbased software (e.g. [1, 4, 13, 14]). Meanwhile, this issue is becoming particularly crucial in case of decentralized, autonomous systems interacting for coordination purposes, so-called systems-ofsystems: in fact, while so far classical software engineering has been mainly concerned with the implementation of well-structured, monolithic or component-based code to be designed in the context of a common project, modern applications increasingly involve the independent development of autonomous software systems, merely J.B. Schmitt (Ed.): MMB & DFT 2012, LNCS 7201, pp. 46–60, 2012. © Springer-Verlag Berlin Heidelberg 2012
Software Reliability Testing Covering Subsystem Interactions
47
communicating with each other for the purpose of a super-ordinate cooperation. Due to the inherent autonomy of such systems, often enough the multiplicity of their potential interactions cannot be systematically tested during integration. As the functional scope of single subsystems may evolve with time at a high degree of autonomy, the multiplicity of their interplay may increase at rapid pace, possibly resulting in unforeseeable interplay effects, generally known as emergent behavior [8]. Therefore, especially when dealing with safety-critical applications, a preliminary software reliability assessment must take into account potential emergent behavior by accurately identifying the variety of potential scenarios involving the interaction of autonomous parts and by assessing their adequacy via operationally representative test cases. In other words, a rigorous reliability evaluation must be based on behavioral observations. • reflecting operative conditions, • at the same time capturing high amounts of potential interactions between components resp. subsystems. Concerning the first requirement, a well-known and technically sound approach to quantitative software reliability estimation is provided by statistical sampling theory (introduced in section 2) on the basis of operationally representative observations. Admittedly, in general this technique may not be easily practicable; nonetheless, it could be successfully applied to a real-world software-based gearbox controller for trucks within an industrial cooperation [15, 16]. Though successful in evaluating operational experience, statistical sampling does not address interplay coverage which may be measured according to different criteria (s. section 3 and [1, 4, 9, 13, 14, 17]). An integration test exclusively targeted to the detection of emergent behavior, on the other hand, is not necessarily representative for the application-specific operational profile and thus does not support the sound derivation of probabilistic software reliability estimates. The novelty of the approach presented in this article consists in combining the above mentioned, diverse perspectives into a common procedure capable of generating test cases supporting both sound reliability estimation and high interaction coverage for highly reliable and interaction-intensive software. The article is organized as follows: • section 2 provides a brief introduction into software reliability evaluation by statistical sampling theory; • section 3 presents a number of metrics addressing coverage of component resp. (sub-) system interaction; • section 4 illustrates the potential shortcomings of a statistical sample exclusively based on the operational profile; • section 5 proposes a novel approach targeting the combined optimization of three different objectives (namely, operational representativeness, stochastic independence and interaction coverage);
48
M. Meitner and F. Saglietti
• section 6 reports on the application of the approach to a highly-interactive component-based software system; • finally, section 7 summarizes the investigations reported and the conclusions drawn.
2
Reliability Assessment by Statistical Sampling Theory
Statistical sampling theory is a well-established approach for deriving a reliability estimate for a software system [3, 7, 10, 11]. It allows to derive • at any given confidence level β and • for a sufficiently high number n of correctly performing test cases (n > 100) • an upper bound ~ p of the unknown failure probability p, i.e. with:
P (p ≤ ~ p)= β
(1)
The theory requires the fulfillment of a number of conditions; a part of them concerns the testing process and must be ensured by appropriate test bed resp. quality assurance measures: • Test run independence: the execution of a test case must not influence the execution of other test cases. If required, this may be enforced by resetting mechanisms. • Failure identification: in order to exclude optimistic reliability estimates, failure occurrence must be properly identified by dependable test oracles, typically plausibility checks based on domain-specific expert judgment. • Correct test behavior: no failure occurrence is observed during testing. In principle, the theory allows for a low number of failure observations, at the cost of correspondingly lower derivable reliability estimates.
Other conditions concern the selection of test data which is central in this article: • Test data independence: as the application of statistical sampling theory is based on a number of independent experiments, the selection of one test case must not influence the selection of the other test cases. • Operationally representative test profile: as reliability measures must refer to a given operational profile, test cases must be selected with the same probability of occurrence.
If all the conditions mentioned above are met, the following quantitative relation between the number n of test cases, the failure probability upper bound ~p and the confidence level β can be derived [3, 18]:
~ p =1− n 1− β Table 1 shows some examples for this relation.
(2)
Software Reliability Testing Covering Subsystem Interactions
49
p and β Table 1. Examples for the relation between n, ~ n
~ p
β
4 603
10-3
0.99
46 050
10-4
0.99
69 074
10
-4
0.999
10
-5
0.999
690 773
Because the costs for applying this approach during a preliminary testing phase may be considerable, posterior evidence collected during operation may be also exploited to lower the costs significantly. This was successfully carried out for a software-based gearbox controller within an industrial research cooperation [15, 16]. In order to derive conservative reliability estimates at pre-defined confidence levels, operational data collected during road testing was analyzed. In this particular case test validation simply consisted of checking that the gear shifts commanded were actually carried out within a pre-defined time frame.
3
Measures of Interaction Coverage
Several measures of interaction coverage were introduced in the past, a. o. [1, 4, 13, 14, 17]; some of them are based on models arising during the early or late design phases (like state diagrams and sequence diagrams), while others directly relate to component resp. system invocations captured at code level. As the perspective taken in this article is focused on the assessment of highly reliable software by statistical testing, also the amount of interactions covered by test cases is measured in the light of executed code instructions. Inspired by classical data flow coverage [12], interaction testing criteria transfer structural concepts from code to interfaces. Among them, coupling-based testing [4] addresses the following coupling categories: • parameter coupling, where one method calls another method and passes parameters to it; • shared data coupling, where two methods use the same global variable; • external device coupling, where two methods use the same external device (e.g. a database or a file).
Coupling-based testing examines the interactions between components or systems, where one method (the caller) calls another method (the callee). The node in the control flow graph containing the invocation is called a call site. A node containing the definition of a variable that can reach a use in another component on some execution path is called a coupling-def. [4] distinguishes three types of coupling-defs:
50
M. Meitner and F. Saglietti
• last-def-before-call: last definition of a formal parameter before a call; • last-def-before-return: last definition of a formal parameter before a return statement; • shared-data-def: definition of a global variable.
A coupling-use is a node containing the use of a variable that has been defined in another component and that can be reached on some execution path. There are three different kinds of coupling-uses [4]: • first-use-after-call: first use of a formal parameter in the caller after the return statement; • first-use-in-callee: first use of a formal parameter in the callee; • shared-data-use: use of a global variable.
A path is called a def-clear path with respect to a certain variable if there is no definition of that variable along that path. A coupling path between two components is a def-clear path from a coupling-def of a variable to a coupling-use of the same variable in another component. Figure 1 illustrates the concepts for coupling-based testing by means of an example.
Fig. 1. Coupling-based testing
In [4] the following coupling-based coverage criteria were defined: • call coupling: all call sites must be covered; • all-coupling-defs: for each variable at least one coupling-path from each definition to at least one of its coupling-uses must be covered; • all-coupling-uses: for each variable at least one coupling-path from each definition to all reachable coupling-uses must be covered; • all-coupling-paths: for each variable all coupling-paths from each definition to all reachable coupling-uses must be covered. As this definition would require an unbounded number of test cases in case of loops, the criterion was weakened, so that each loop body has to be skipped and executed at least once.
Software Reliability Testing Covering Subsystem Interactions
51
In the particular case of object-oriented programming supporting reusability and maintainability by inheritance and polymorphism, interaction coverage has to be strengthened to take account of potential side effects. For example, faults may arise by incorrect dynamic binding in case of a property being fulfilled in some inheritance contexts, but violated in others. In order to increase the chances of detecting such anomalies, the above mentioned coupling-based coverage concepts were extended to the object-oriented paradigm [1] by additionally requiring also context coverage.
4
Potential Bias of Operationally Representative Samples
4.1
Shortcoming of Reliability Testing
Though well-founded, reliability testing by statistical sampling theory has a fundamental shortcoming: it depends on one single random experiment. This experiment consists of generating a number of test cases according to a given distribution intended to reflect the expected operational profile. As this random generation is carried out only once, it cannot guarantee to cover all required testing scenarios. In fact, even assuming an accurate knowledge of the operational profile, the resulting randomly generated test case sample may still deviate from statistical expectation. In particular, a considerable number of relevant scenarios involving crucial interactions may be fully neglected, as will be illustrated by the following example. For this reason, it is felt that, especially in case of safety-critical software-based applications, reliability testing should be enhanced by extending the sample demands to include – beyond operating representativeness and stochastic independence – also interaction coverage. The novel approach developed for this purpose will be introduced in section 5. 4.2
Application Example
This section introduces an example for a software-based system involving a high degree of component interactions. It consists of 4 parts: • one of them represents a central controller (the so-called Control Component), • while the remaining 3 components represent cooperating tasks (Service 1, Service 2 and Service 3) to be invoked and parameterized by the Control Component.
The application processes the following 4 input parameters: • • • •
parameter 1 of type Integer; parameter 2 of type Integer; parameter 3 of type Double; parameter 4 of type Double.
Depending on these inputs, the Control Component invokes one or more of the Service Components providing each of them with 8 parameters: • 4 of them (the so-called data parameters) are common to all components, • while the remaining 4 control parameters are component-specific control elements.
52
M. Meitner and F. Saglietti
Figure 2 offers a graphical representation of the invocation hierarchy.
Fig. 2. Interacting components of software application
In terms of the all-coupling-uses criterion the system includes 155 def-use pairs. The operational profile of the application is provided by defining for each of the 4 independent inputs its corresponding probability density function, as shown in Table 2. 4.3
Evaluation
For each n ∈ {1 000, 3 000, 10 000, 20 000, 50 000} 10 experiments were carried out, each consisting of generating n test cases according to the given operational profile. The number of def-use pairs covered by each of the 50 resulting experiments was successively determined, as shown in Table 3. Table 2. Parameters with corresponding probability density functions
software inputs
distribution
probability density function
a = −10 000
parameter1 uniform distribution
f (x) =
parameter2
parameter3
parameter4
distribution parameters
b = 20 000
1 b−a
a=0 b = 1 000 000
Weibull distribution
normal distribution
a⎛x−c⎞ f (x) = ⎜ ⎟ b⎝ b ⎠
f (x) =
1 2πσ 2
a −1 −⎛⎜ x −c ⎞⎟ e ⎝ b ⎠
e
−
(x −μ )2 2 σ2
a
a=2 b = c = 10 000 μ=0 σ = 1 000
Software Reliability Testing Covering Subsystem Interactions
53
For example, it can be noticed that the 10 experiments devoted to the random generation of 10 000 test cases only covered between 124 and 132 def-use pairs; in other words, they missed to generate test cases triggering between 23 and 31 component interactions. This means that potential faults affecting the uncovered interactions (about 15% - 20% of all interactions) would remain undetected by any of these test samples. Although these effects tend to diminish with increasing test size, the concern remains valid even for the highest test size. In order to address potential emergent behavior during reliability testing, therefore, the original statistical sampling approach was extended to capture also interaction coverage among components, subsystems or systems. Table 3. Minimum and maximum numbers of def-use pairs covered for each test size n
test size n
5
1 000
3 000
10 000
20 000
50 000
min coverage
76
97
124
136
144
max coverage
89
109
132
141
146
Multi-criteria Test Case Optimization
Since the approach presented in this article focuses on the generation of optimal test case sets, only the conditions concerning the selection of test cases are considered. Further conditions mentioned in section 2 and concerning quality assurance of product and test bed (like resetting mechanisms, test oracles, restart of the whole process after fault removal) are outside the scope of this article. Statistical sampling theory requires independently selected test cases; in other words, the input values must not be correlated. On the other hand, input parameters may be functionally or semantically dependent due to the nature of the application under test, e.g. • by physical laws, like wavelength, speed and frequency, or • by logical patterns, like the coefficient values of an invertible matrix. Evidently, correlations due to physical laws or logical patterns cannot be removed; these application-inherent dependencies must be captured by the operational profile. Further correlations arising by instantiation of functionally independent parameters, (e.g. numerical dependencies), however, have to be avoided or removed by filters. 5.1
Objective 1: Operational Representativeness of Test Cases
The operational profile of the system under test is assumed to be available on the basis of a preliminary estimation concerning the frequency of occurrence of the input
54
M. Meitner and F. Saglietti
parameters. While functionally independent parameters can be randomly generated according to this operational profile, the functionally correlated parameters must be defined such as to fulfill their application-specific dependencies. The degree of operational representativeness of test cases can be measured by different goodness-of-fit tests, like the χ2 test [6], the Kolmogorov-Smirnov test [6] or the Anderson-Darling test [6]. They quantify the confidence in the validity of a null hypothesis, in our case Ho: “the observed distribution is consistent with the pre-specified distribution”
in the light of the data observed. Depending on the goodness-of-fit test, a so-called test statistic S1 is first determined; for example, for the χ2 test, the statistic is defined as follows: k
χ2 = ∑
(Oi − E i )2
i =1
(3)
Ei
where k denotes the number of bins that contain the data, Oi the observed and Ei the expected frequency of bin i (1 ≤ i ≤k). The validity of the null hypothesis is successively verified by a significance test determining a critical threshold T1. If the test statistic S1 is higher than the critical value T1, then the null hypothesis is rejected, otherwise accepted; in this case the test data can be taken as sufficiently representative for the distribution specified. Figure 3 shows an exemplifying goodness-of-fit test for a specified Gamma distribution.
Fig. 3. Distribution fitting
5.2
Objective 2: Test Case Independence
Statistical sampling theory further requires the independent selection of test cases; in other words, the values of parameters previously identified as independent must not be correlated. For this purpose, both auto- and cross-correlation measures are considered, each yielding a statistical correlation coefficient S2. Auto-correlation describes the dependence of a specific parameter instance within a test case to other instances of the same parameter in further test cases. Hereby, the
Software Reliability Testing Covering Subsystem Interactions
55
auto-correlation metrics consider the so-called lag between test cases, i.e. the distance between two test cases w.r.t. to the sequence order of their generation; in particular, the lag between test cases generated in direct succession is 1. Cross-correlation, on the other hand, describes the dependencies between different parameters within the same test case. There are several metrics to measure crosscorrelation, mainly differing in terms of computational complexity and dependency type addressed. Among the metrics selected for the approach illustrated in this article are Pearson’s product moment correlation coefficient [2], Spearman’s rank correlation coefficient and Cramer’s V [19]. For example, for two different random variables X and Y, Pearson’s product moment correlation coefficient rxy is defined as follows:
r xy =
1 n 1 n
∑ (x n
∑ (x n
i =1
i
)(
− x ⋅ yi − y
i =1
i
−x
)
2
1 n
∑ (y n
i =1
i
) −y
)
(4)
2
where n denotes the number of test cases and xi resp. yi (1 ≤ i ≤ n) denote values of X and Y with corresponding average values x and y . To determine whether the parameters are correlated or not, the correlation coefficient S2 is compared with a maximum threshold value T2. Similarly to the goodness-offit tests, parameters with correlation S2 higher than T2 cannot be taken as sufficiently independent. 5.3
Objective 3: Interaction Coverage
The coupling-based testing criteria presented in section 3 can be arranged in the subsumption hierarchy shown in Figure 4 [4].
Fig. 4. Subsumption hierarchy of coupling-based testing
While call coupling only considers the invocation of methods and is therefore too weak to measure interaction coverage, the other criteria are appropriate for being used in the optimization procedure. The example presented in section 6 is based on the allcoupling-uses criterion, whose coverage measure will be denoted by S3.
56
5.4
M. Meitner and F. Saglietti
Combination of Objectives 1, 2 and 3
The new approach presented in this article combines the three above mentioned criteria. Since the main objective is the generation of a dependable software reliability estimate, objectives 1 and 2 must be fulfilled in order to enable the application of statistical sampling theory. In other words, both these criteria are knock-out criteria dominating over objective 3 which should be maximized without violating them. The high complexity of this problem makes the application of systematic and analytical techniques inadequate. Therefore, a heuristic approach to this multi-objective optimization problem is applied, making use of genetic algorithms. In general, these proceed iteratively by evaluating the fitness of single individuals and by generating new populations based on the best individuals found so far and on genetic manipulations of past populations. In this specific case single individuals are sets of test cases, where each single test case consists of values to be assigned to the input variables. Cross-over operations may be used at two different levels: at the higher level, test case sets exchange test cases, while at the lower level only values of single input variables are swapped. Test case sets are mutated by deleting individual test cases and by generating an identical number of new ones, or by random mutations of individual input parameters. In addition to the genetic manipulations described, the elitism operator is applied to maintain unaltered the best test case sets generated up to then. In order to determine the fitness of a candidate test case set with respect to its fulfillment of objectives 1 and 2, the values Si (i ∈ {1,2}) determined by goodness-of-fit tests (as introduced in section 5.1) and by auto- resp. cross-correlation metrics (as introduced in section 5.2) are first normalized such as to lie within the interval [0; 1] by the following normalization procedure N: • for Si ∈ [0; Ti] let N(Si) = 1; • for Si ∈ ]Ti ; maxi] let N(Si) be defined as shown in Figure 5, where maxi denotes the highest value taken by Si within a test case population.
Fig. 5. Normalization function for objectives 1 and 2
Software Reliability Testing Covering Subsystem Interactions
57
Interaction coverage measures S3 do not require any normalization, as they already lie within the interval [0; 1] by definition. For each test case set, the fitness function is defined as the following weighted sum of its three normalized measures: fitness value (test case set) = 1.0 * N(S1) + 1.0 * N(S2) + 0.1 * S3
The coefficient 0.1 is chosen such as to maximize the weight of S3, while preventing the violation of any of the two knock-out criteria, even in case of full interaction coverage, because: • a test case set that violates a knock-out criteria has a fitness value < 2, • a test case set that fulfills both knock-out criteria has a fitness value ≥ 2. The genetic algorithm successively selects individuals with higher fitness values at higher probability, such that the result of the optimization is a test case set that does not violate any knock-out criteria.
6
Example
The new approach was applied to the software system presented in section 4.2. The interaction coverage is measured with respect to the all-coupling-uses criterion introduced in section 3. Goodness-of-fit tests are carried out at a significance level of 0.1. The genetic algorithm proceeds as follows:
• initially, it generates 10 random test case sets according to the operational profile, • successively, it evaluates their fitness and • starts the optimization process. This process involves 10 optimization runs where the genetic operators (selection, cross-over and mutation) are applied to test case sets consisting of a fixed number of test cases. This predefined number is chosen in view of the reliability estimate to be derived after optimization; therefore, it does not change during the optimization procedure. Figure 6 shows the evolution of the coverage achieved for test set sizes of 10 000 resp. 50 000 test cases. As already mentioned in section 4.2, the application contains 155 feasible def-use pairs in terms of the all-coupling-uses criterion. The best of the initial 10 test case sets consisting of 10 000 test cases managed to cover 132 def-use pairs. After genetic optimization the resulting test set improved to 152 covered def-use pairs or nearly 98% of all feasible pairs. The multi-objective approach applied to sets containing 50 000 test cases managed to reach 100% coverage after optimization. After validation of the test results, such a test allows to derive a conservative software reliability p < 9.21 ⋅ 10 −5 at confidence level β = 0.99. estimate ~ Figure 7 shows the output of such an optimization run.
58
M. Meitner and F. Saglietti
Coverage Def-Use Pairs 160 155 150 145 n = 10 000
140
n = 50 000
135 130 125 120 Start
1
2
3
4
5
6
7
8
9
10
Fig. 6. Covered def-use pairs for test case sets with n = 10 000 and n = 50 000
Fig. 7. Optimization result
Software Reliability Testing Covering Subsystem Interactions
7
59
Conclusion
In this article a new approach to software reliability assessment combined with high interaction coverage was presented. Optimal test case sets are generated by use of genetic algorithms. An adequate fitness function considers the objectives of operational representativeness, test case selection independence and interaction coverage. The approach was tested on a software system showing that interaction coverage can be significantly increased while guaranteeing the conditions required for statistical testing, such that a well-founded, conservative reliability estimate can be derived. Acknowledgment. The authors gratefully acknowledge that the work presented was partly funded by Siemens Corporate Technology.
References 1. Alexander, R.T., Offutt, A.J.: Coupling-based Testing of O-O Programs. Journal of Universal Computer Science 10(4) (2004) 2. Hartung, J.: Statistik. Oldenbourg (1995) 3. Ehrenberger, W.: Software-Verifikation. Hanser Verlag (2002) 4. Jin, Z., Offutt, A.J.: Coupling-based Criteria for Integration Testing. Software Testing, Verification & Reliability 8(3), 133–154 (1998) 5. Jung, M., Saglietti, F.: Supporting Component and Architectural Re-usage by Detection and Tolerance of Integration Faults. In: 9th IEEE International Symposium on High Assurance Systems Engineering (HASE 2005). IEEE Computer Society (2005) 6. Law, A.M., Kelton, W.D.: Simulation, Modeling and Analysis. McGraw-Hill (2000) 7. Littlewood, B., Wright, D.: Stopping Rules for Operational Testing of Safety Critical Software. In: 25th International Symposium Fault Tolerant Computing, FCTS 25 (1995) 8. Maier, M.W.: Architecting Principles for Systems-of-Systems. Systems Engineering 1(4), 267–284 (1998) 9. Oster, N., Saglietti, F.: Automatic Test Data Generation by Multi-objective Optimisation. In: Górski, J. (ed.) SAFECOMP 2006. LNCS, vol. 4166, pp. 426–438. Springer, Heidelberg (2006) 10. Parnas, D., van Schouwen, J., Kwan, S.: Evaluation of Safety-critical Software. Communications of the ACM 33(6) (1990) 11. Quirk, W.J. (ed.): Verification and Validation of Real-time Software. Springer, Heidelberg (1985) 12. Rapps, S., Weyuker, E.J.: Data Flow Analysis Techniques for Test Data Selection. In: 6th International Conference on Software Engineering, ICSE 1982 (1982) 13. Rehman, M., Jabeen, F., Bertolino, A., Polini, A.: Software Component Integration Testing: A Survey. Journal of Software Testing, Verification, and Reliability, STVR (2006) 14. Saglietti, F., Oster, N., Pinte, F.: Interface Coverage Criteria Supporting Model-Based Integration Testing. In: Workshop Proceedings of the 20th International Conference on Architecture of Computing Systems (ARCS 2007), VDE (2007) 15. Söhnlein, S., Saglietti, F., Bitzer, F., Meitner, M., Baryschew, S.: Software Reliability Assessment based on the Evaluation of Operational Experience. In: Müller-Clostermann, B., Echtle, K., Rathgeb, E.P. (eds.) MMB&DFT 2010. LNCS, vol. 5987, pp. 24–38. Springer, Heidelberg (2010)
60
M. Meitner and F. Saglietti
16. Söhnlein, S., Saglietti, F., Meitner, M., Bitzer, F.: Bewertung der Zuverlässigkeit von Software, Automatisierungstechnische Praxis, 52. Jahrgang, 6/2010, 32-39, Oldenbourg Industrieverlag (2010) 17. Spillner, A.: Test Criteria and Coverage Measures for Software Integration Testing. Software Quality Journal 4, 275–286 (1995) 18. Störmer, H.: Mathematische Theorie der Zuverlässigkeit. R. Oldenbourg (1970) 19. Storm, R.: Wahrscheinlichkeitsrechnung, mathematische Statistik und Qualitätskontrolle. Hanser Verlag (2007)
Failure-Dependent Timing Analysis A New Methodology for Probabilistic Worst-Case Execution Time Analysis Kai H¨ ofig AG Software Engineering: Dependability University of Kaiserslautern Kaiserslautern, Germany
[email protected] http://agde.cs.uni-kl.de/
Abstract. Embedded real-time systems are growing in complexity, which goes far beyond simplistic closed-loop functionality. Current approaches for worst-case execution time (WCET) analysis are used to verify the deadlines of such systems. These approaches calculate or measure the WCET as a single value that is expected as an upper bound for a system’s execution time. Overestimations are taken into account to make this upper bound a safe bound, but modern processor architectures expand those overestimations into unrealistic areas. Therefore, we present in this paper how of safety analysis model probabilities can be combined with elements of system development models to calculate a probabilistic WCET. This approach can be applied to systems that use mechanisms belonging to the area of fault tolerance, since such mechanisms are usually quantified using safety analyses to certify the system as being highly reliable or safe. A tool prototype implementing this approach is also presented which provides reliable safe upper bounds by performing a static WCET analysis and which overcomes the frequently encountered problem of dependence structures by using a fault injection approach. Keywords: fault tolerance, software safety, static analysis, tool, WCET, fault tree.
1
Introduction
Embedded real-time systems are growing in complexity, which goes far beyond simplistic closed-loop functionality. Modern systems also execute on complex input data types or implement rich communication protocols. The underlying hardware has to provide more and more resources and is therefore extended by caches or multi-core processors, for instance. To assure the quality of such systems, e.g., their reliability or safety, analyses have to cope with this extended complexity. We present in this paper a new approach for timing analysis that J.B. Schmitt (Ed.): MMB & DFT 2012, LNCS 7201, pp. 61–75, 2012. c Springer-Verlag Berlin Heidelberg 2012
62
K. H¨ ofig
reduces overestimations, which are often based on the system’s increased complexity. Since many embedded systems are real-time systems, which are frequently safety critical, the probability of a timing failure can violate reliability requirements. Current approaches for worst-case execution time (WCET) analysis are used to verify the execution time of a system under worst-case conditions. These approaches calculate or measure the WCET as a single value that is expected as an upper bound for a system’s execution time. Overestimations are taken into account to make this upper bound a safe bound that guarantees termination within a given deadline. Modern processor architectures with caches, multi-threading, and instruction pipelines often expand those overestimations for safe upper bounds into unrealistic areas, making them useless in an industrial context [1]. The former assumption that a missed deadline is, in the worst case, equivalent to always missing the deadline is too stringent for systems that require only a probabilistic guarantee that a task’s deadline miss ratio is below a given threshold [2]. Some approaches try to solve this problem by calculating multiple upper bounds and argue that each single upper bound will hold for a certain probability (probabilistic worst-case execution time). As summarized in Section 2 of this paper, many of those approaches require either probabilities as input or make assumptions, which have to be verified for each system to be fulfilled, in order to apply statistical methods. In contrast to these approaches, we show in this paper how safety analysis model probabilities can be combined with elements of system development models to calculate a probabilistic worst-case execution time. Safety analysis models are used here as a source of probabilities. Since safety analysis models typically reflect the occurrence of failures and their propagation through the system, our approach aims at mechanisms in systems that are executed in addition to a failure. Such mechanisms usually belong to the area of fault tolerance and detect or process an error. Examples of safety analysis models that reflect fault tolerance mechanisms can, e.g., be found in [3–9]. The remainder of the paper is organized as follows: Section 2 discusses related approaches. In Section 3, the methodology of failure-dependent timing analysis is formalized. This analysis is applied in an example in Section 4 using a tool prototype. The results of the tool provide multiple worst-case execution times under certain failure conditions. Section 5 concludes this paper and provides a perspective for future work.
2
Related Work
This section describes related work regarding worst-case execution time analysis with particular attention being given to probabilistic approaches. As far as we know, there exists no approach that uses safety analysis models as probabilistic input for WCET analysis. Current approaches in WCET analysis can be divided into deterministic and probabilistic analysis approaches. The difference between typical deterministic WCET approaches and probabilistic approaches is that deterministic WCET
Failure-Dependent Timing Analysis
63
approaches calculate a single execution time for a program, whereas probabilistic WCET approaches calculate multiple execution times for a program, each valid with a certain probability. The approach presented here can be assigned to the category of probabilistic WCET approaches. Approaches in both groups can be further classified into measurement-based approaches and static analysis approaches. In static timing analysis, the execution times of individual static blocks are computed for a given program or a part of it [10–12]. These approaches are able to provide safe upper bounds for the WCET by making pessimistic assumptions at the expense of overestimating the WCET in order to guarantee deadlines for the analyzed program [13]. Advanced (deterministic) approaches, such as those presented in [14, 15], encompass precise hardware models to reduce overestimation of the WCET as much as possible. On the other hand, measurement-based approaches do not need to perform any complex analysis [16–18]. They measure the execution time of a program on real hardware or processor simulators. These approaches are generally unable to provide a safe upper bound for the WCET, since neither an initial state nor a given input sequence can be proven to be the one that produces the WCET [13]. Static analysis approaches thus have important benefits over measurement-based approaches when safe upper bound guarantees are required. For this reason, the approach presented here also uses static analysis. Since deterministic WCET approaches do not distinguish between different execution times for certain probabilities, even a tight upper bound with minor overestimation can be so improbable that it has no real significance. Some probabilistic WCET approaches have emerged in recent years that combine execution times with probabilities in order to focus on significant execution times by incorporating probabilities for different execution times of a program. In [19], the authors use measured execution time samples of a task and apply the central limit theorem to approximate the probability that a task finishes its execution within a given threshold. The authors assume that the inputs for each sample are equally distributed in the real-world application, but this has to be proven for each specific application. For example, when the probability of one input leading to high execution time is more likely than other inputs, the approach cannot be applied. This approach is extended to Extreme Value Theory in [20]. Measurements from random sample data are used to approximate a Gumbel distribution of the execution times. This methodology requires the stochastic independence and identical distribution (i.i.d.) of the input data. This is generally a problematic assumption as mentioned before, especially since stochastic independence is not given for programs that change the world. This problem is considered in [21], where the authors propose resetting the system after every execution or proving that the i.i.d. assumption can be applied for a specific program. To derive safe probabilities (not safe execution time bounds) without proof or reset, they shift the calculated distribution into safe bounds. Furthermore, since failures that influence systems are typically not distributed equally and since failures can also be self-energizing, this approach cannot be
64
K. H¨ ofig
applied to solve the problem of analyzing different execution times under certain failure conditions. A different approach presented in [22] measures the execution times of atomic units of execution, so-called basic blocks, to obtain a probabilistic distribution. These distributions for basic blocks are then combined by applying different rules for each control structure in the syntax tree in a bottom-up process. This results in a distribution for different execution times of an entire program. In an ongoing work, the authors present such a set of rules for the sequential, conditional, and iterative execution of basic blocks [23]. Starting with a simple timing schema, where A and B are basic blocks of a program, e.g., leaves of a syntax tree, they formulate the problem that a probability distribution Z of the worst-case execution times of the sequential execution W CET (A, B) is hard to determine because of the dependencies of the probability distributions of A and B. In [1], these dependency structures are estimated by upper and lower bounds using copulas. However, this approach does not provide input for the calculation of distribution functions of basic blocks, and deriving the dependence structure for every control structure of a program is quite complex. In contrast to the measurement-based approaches discussed above, the approach presented in this paper uses safety analysis models as inputs and provides a way to combine them with model elements of embedded systems to perform probabilistic worst-case execution time analysis. Furthermore, the dependency structures of the failure probabilities are handled here by using widely accepted and proven-in-use fault trees with Boolean logic (see Section 4). Complex manual analysis of dependency structures in the code is also not necessary, since automated fault injection can overcome this problem, which is also described in Section 4. Similar to the aforementioned measurement-based approaches, the authors present in [24] an annotation scheme that allows enriching a program’s source code with conditional statements. For the top-down decomposition of a source code into a syntax tree, the composition of statements is used bottom-up to calculate different execution times along with their probability of occurrence. The timing schema presented in [24] is quite simplistic and thus not suitable for modern processor architectures. Furthermore, in order to analyze an entire program, the probability for each conditional branch has to be known as input for this methodology. Besides these approaches, which can be clearly assigned to probabilistic WCET analysis, there exist also approaches that deal with probabilistic response times of tasks. In [25], the authors use a so-called Null-Code Service, which immediately terminates when executed, to measure task execution times from response times. Distributions are measured for both the Null-Code Service and the task to be analyzed. The different response times are then subtracted to obtain execution times. Another approach for response times is described in [26]. The authors simulate the execution of a task in a typical embedded environment with state variables and scheduling. Extreme Value Theory is applied to the measurements of task response times to estimate the upper bound of the task’s response time for a given probabilistic threshold. Since in this approach,
Failure-Dependent Timing Analysis
65
only response times that are close to a worst-case response time are relevant, the authors use a method called block maxima to eliminate less important measurements and to optimize the number of measuring points. Applicability to entire systems is, as the authors conclude, questionable and part of their future work. Also, some approaches dealing with probabilistic execution times or response times can be found that clearly aim at scheduling, e.g. [2], where the backlog of different scheduling strategies is modeled using Markov chains. The previously discussed approach presented in [20] also aims at scheduling by using a probabilistic WCET. It can be concluded that current probabilistic approaches in WCET analysis mainly derive probabilistically distributed execution times from measurements. As already stated in the introduction to this section, measured execution times can be problematic when upper bound guarantees are required. In general, it cannot be assured that a measured execution time is the WCET or close to it. In contrast to that, the approach presented in this paper is based on static analysis and thereby can provide safe upper bound guarantees. Probabilities and execution times are not obtained via measurements that require certain statistical assumptions to be true, but rather probabilities are extracted from safety analysis models and execution times are proven safe upper bounds taken from static analysis. Dependency structures are here handled using proven-inuse static analysis. Dependencies in failure probabilities can be modeled using Boolean fault tree logic, which is also approved in industry. Since we use failure probabilities as probabilistic input, the approach presented here is limited to systems that act differently in the case of failures. For other systems, the approach presented in this paper is at least as good as a static non-probabilistic analysis. In the next section, such a system is initially described as an example. After that, the Failure-Dependent Timing Analysis approach is presented.
3
Failure-Dependent Timing Analysis
In this section, an example system is first described in paragraph 3.1. This system is used to illustrate the Failure-Dependent Timing Analysis (FDTA) approach and is later analyzed in Section 4 using a tool chain that implements this approach. Paragraph 3.2 describes the combination of development model elements and safety analysis model elements for automated Failure-Dependent Timing Analysis. 3.1
Example System
The example system in this paper is the fault-tolerant subsystem of a FaultTolerant Fuel Control System [27]. The Simulink model of this system is depicted in figure 1. The subsystem is fed with four sensor values: throttle delivers the throttle bias, speed delivers the engine speed, map is short for manifold pressure and delivers the intake air pressure, and ego is short for exhaust gas oxygen and
66 1 Sensors
K. H¨ ofig Throttle Estimate
throttle
m speed
Sensors
throtle
EGO MAP
Speed Estimate Sensors
we
u
throttle sensor failure speed sensor failure
2 Failures
m
1 Corrected
EstimateMAP Sensors
map
pressure sensor failure
Fig. 1. Simulink model of the subsystem SensorFaultCorrection
delivers the measured amount of oxygen in the exhaust gas. These sensor values are used to calculate the gasoline intake for the lowest engine emissions. The sensors throttle, speed, and map can be estimated if they are not measurable due to failures. To estimate one of them by doing a table lookup, the other two corresponding sensor values are required. The subsystems Throttle Estimate, Speed Estimate, and EstimateMAP are technically equivalent subsystems that estimate the values for detected sensor failures. The main system detects sensor failures by performing a range check, where all incoming sensor values are checked against a given range. Sensor values that are outside of this range are assumed to be incorrect and the value of this sensor is estimated instead. If one or more of the sensor readings is erroneous, the system switches from low emission to rich mixture mode. In this case, the engine operates with a non-optimal mixture. Since the estimation of such a sensor value requires an additional calculation compared to simply routing the signal through the subsystem SensorFaultCorrection, the execution time of this subsystem depends on the occurrence of failures in the sensors (failure-dependent execution time). The probabilities of such sensor failures are typically part of safety analysis models. Since these models themselves are not new, and since presenting a safety analysis model showing the safety behavior of an entire system would clearly exceed the limitations of this paper, only a simplistic fault tree model for the MAP sensor is provided in figure 2. The fault tree models for the other two sensors are equivalent. The failure mode detected failure (leftmost rectangle on the top) occurs when either the sensor value is outside a given range or the range check erroneously judges the sensor data to be out of range. The failure mode erroneous sensor data propagated occurs if the sensor data is out of range and the range check erroneously judges the data to be valid. In this case, erroneous sensor data remains undetected and is used erroneously for further processing. Both failure modes are typical for a safety measure, such as the range check applied here, since they model the effectiveness of the measure. The failure mode no TLookup
Failure-Dependent Timing Analysis detected_failure
67
erroneous sensor data propagated
no_TLookup
CFT MAP OR &
OR
Range Check false negative
MAP out of Range
Range Check false positive
Fig. 2. Component Fault Tree for SensorFaultCorrection
additionally models that no table lookup is performed. This is the case if either the range check erroneously judges a result as valid or if no failure is detected. This failure must also be modeled for the methodology presented here in order to reflect the required behavior for the failure-dependent execution path. In paragraph 3.2, we describe the combination of development model elements and safety analysis model elements for automated Failure-Dependent Timing Analysis. To illustrate the methodology, the above example is used. 3.2
Analysis
The above description of the system’s functionality shows that sensor failures cause additional execution time for a table lookup compared to a failure-free execution. This relation between safety analysis model and system development model can be combined to calculate different execution times for a system along with their probabilities of occurrence. Some of the connections depicted in figure 1 are signals that can be related to failures. They are labeled as throttle-, speed-, and pressure sensor failure and carry specific data indicating failures. They are therefore called failure-dependent. Let all such connections be in a set C, with C = {c1 , .., cn }. In the example, each sensor failure signal is either carrying a 1 to indicate that the corresponding sensor signal is fault-free, or carrying a 0 to indicate that a
68
K. H¨ ofig
sensor signal is erroneous. Since such signals may also carry data types other than 1 and 0, e.g., true and f alse or more complex data types, we call the set of possible configurations for a failure-dependent connection ci execution scenarios S(ci ), with S(ci ) = {s1 , .., sm }. In the example, S(ci ) = {0, 1} for all ci ∈ C. Each execution scenario si of a failure-dependent connection cj has an associated failure mode f mi of a fault tree as depicted in figure 2, with F M (si ) = f mi . For example, F M (1) = no T Lookup and F M (0) = detected f ailure for S(pressure sensor f ailure) = {0, 1}. Since all combinations of failures are possible in general, e.g., in the example system all three sensor values may be erroneous or none may be so, all possible combinations of execution scenarios have to be considered. Such a combination is called a mode here and all possible combinations for a system are modeled by the set M , with M = {{s1 , .., sn } | s1 ∈ S(c1 ), .., sn ∈ S(cn ), C = (c1 , .., cn ), n ∈ N}. Since every execution scenario si has an associated failure mode f mi of a fault tree, the overall probability for each mode m ∈ M can be extracted from Boolean logic by combining them using the Boolean and (in the formula represented by ∧), with
F M (m) = {
n
F M (si ) | m = {s1 , .., sn }
i=1
m ∈ M, n ∈ N}. Each mode m ∈ M has a specific (worst-case) execution time tm for the corresponding execution scenarios si ∈ m of its connections ci . In combination with its failure modes F M (m), a set of measure points Ω can be obtained that allows a probabilistic (worst-case) execution time analysis: Ω = {(tm , F M (m))|m ∈ M }. In the next section, we describe how the execution times of each mode can be determined automatically using an analytical WCET approach. The entire process of system modeling, safety analysis, and Failure-Dependent Timing Analysis is demonstrated in the next section using a tool chain. The results of the FDTA is a set of (worst-case) execution time upper bound guarantees, each valid for a certain probability.
Failure-Dependent Timing Analysis
4
69
FDTA Tool Chain
In this section, we present a tool chain for Failure-Dependent Timing Analysis (FDTA). First, we present in Paragraph 4.1 the tool chain’s architecture. After that, we describe in Paragraph 4.2 how the previously introduced example system is analyzed. A Failure-Dependent Timing Analysis is performed by associating the failure modes of a safety analysis model with the connections of a system development model. The methodology of modes presented in Section 3 is implemented in this tool chain to automate the process of obtaining failure-dependent execution times. 4.1
Architecture
The architecture of the Failure-Dependent Timing Analysis tool chain is depicted in figure 3. It automates the entire analysis process and is based on Enterprise Architect (EA) (see figure 3:2) [28]. Systems are modeled in Simulink and then imported into EA as SysML models (see figure 3:1 and 3:3) [29]. The process of safety engineering is performed in EA using so-called Component Fault Trees (CFTs) (see figure 3:4). CFTs are a special kind of the widely accepted Fault Trees. They allow modeling the behavior of the system under failure conditions and are used to model quantitative as well as qualitative dependability-related statements for assessing the safety of a system [30]. Failure modes of the CFTs can be associated with connections of the imported Simulink model in a new diagram, the Failure-Dependent Timing Analysis Diagram. The diagram is defined using UML profile mechanisms [31]. In this diagram, the execution scenarios as described in Section 3 can also be set for every connection. The diagram is then evaluated by the tool to calculate all modes as presented in Section 3. For each particular mode, the connections of the Simulink model are then changed to the value provided by the execution scenarios. This results in multiple versions of a system (one for each mode), each with different failure conditions injected (see figure 3:5). Those remodeled systems are used to generate C code, which is compiled for the ARM7 Processor Family [32] using the YAGARTO tool chain [33] (see figures 3:6 and 3:7). The different compiled versions of the former Simulink models are afterwards analyzed regarding their worst-case execution time (WCET) using aiT (see figure 3:8) [34]. The results of the WCET analysis are then related to the previously extracted modes of the FDTA Diagram. For each mode, the probability is calculated using the associated failure modes. To obtain the probabilities, the widely accepted fault tree analysis tool FaultTree+ is used [35]. The resulting execution times along with their probabilities are then depicted in combination with their probabilities of occurring extracted from the safety analysis model. Paragraph 4.2 demonstrates the Failure-Dependent Timing Analysis of the example system.
70
K. H¨ ofig
Matlab Simulink
6
1
7 Real Time Workshop & Compiler Path-Specific Sources & Executables
5 .mdl File
.mdl Files For Specific Execution Paths
8 WCET Analyzer
2 Failure Dependent Timing Analyzer
3
SysML Architecture Model
4 Safety Analysis CFT Model
Fig. 3. Architecture of the Failure-Dependent Timing Analysis (FDTA) Tool Chain
4.2
Analysis Example
To perform a Failure-Dependent Timing Analysis of the system described in Section 4, the system is imported from Simulink into EA and the fault trees for the sensors are modeled as depicted in figure 2 for the MAP sensor. After that, the failure-dependent connections of the imported model are identified and associated with values that reflect a certain execution scenario. Each scenario is then related to a certain failure mode of the fault tree. This is modeled in a additional view, the Failure-Dependent Timing Analysis Diagram, which is depicted for the example system in figure 4. On the leftmost side of this figure, the failure-dependent connections are depicted as double-arrows. These are the connections from the failure demux to the blocks that perform a table lookup as depicted in figure 1. Each connection can either be 1 for the scenario of a
Failure-Dependent Timing Analysis
71
correct sensor value or 0 for the scenario of an erroneous sensor value. The circles in the middle of this picture represent these execution scenarios. Each execution scenario has an associated failure mode of the fault tree, e.g. the execution scenario with the value 0 of the connection between failure demux and MAP table lookup is connected to the failure mode detected failure, since this failure mode corresponds to this value.
Value 1 Speed: no_TLookup Failure_Demux/2-> Corrected_Speed/2
Value 0 Speed: detected_failure Value 1 MAP: no_TLookup
Failure_Demux/3-> Corrected_MAP/2
Value 0 MAP: detected_failure Value 1 Throttle: no_TLookup
Failure_Demux/1-> Corrected_Throttle/2
Value 0 Throttle: detected_failure
Fig. 4. Failure-Dependent Timing Analysis Diagram for the system SensorFaultCorrection
From this diagram, all possible modes can be derived as described in Section 3. For each mode, a new Simulink model is generated, with the corresponding values for all connections injected. These different generated models are then used to generate code, which is compiled and analyzed regarding its WCET using the external static timing analysis tool aiT. After that, the associated failure modes are quantified as described in Section 3. For probabilistic quantification, we assumed the range check to be 100% reliable and set the failure probability for all false positive and false negative failure modes, like range check false negative and range check false positive as depicted in figure 2, to zero (as it is sometimes done for testable software routines, e.g. in [36]). For the failure probabilities of the out-of-range failure modes of the sensors, we decided to use quantifications from standards, since the system analyzed here is an academic example that demonstrates the methodology of FDTA. We set the mean time to failure (MTTF) to two million hours for the MAP sensor, according to the 20PC SMT Honeywell Pressure Sensor; the MTTF for the throttle sensor to 3767 years and the MTTF
72
K. H¨ ofig
Speed MAP Throttle
Table 1. Analysis results. First data row indicates execution time without manipulations.
0 1 0 1 1 0 0 1
0 1 1 0 1 0 1 0
0 1 1 1 0 1 0 0
Exec. Time (µs) Probability 593.633333 600.133333 0.00006 032.033333 0.44316 221.400000 0.51145 221.400000 0.01984 221.400000 0.00118 410.766667 0.02290 410.766667 0.00136 410.766667 0.00005
for the speed sensor to 114155 hours, both according to DIN EN ISO 13849-1. The results of the FDT Analysis are depicted in table 1. The first row of table 1 shows the WCET for the original system in addition to the results from the FDT Analysis. The other rows show the WCETs for the systems that have values injected, e.g., the system represented by the second row has all failure-dependent connections set to 0 (all sensor data range checks indicate errors) and a WCET of approx. 600μs with a probability of 6∗10−5. The difference in WCET for the first row and the second row results from the injection of values into the Simulink model. Additional code constructs are required that slightly extend the WCET estimation, e.g., to set the connection from failure demux to MAP to 1. In our experiments we measured that there is a small overestimation at about 3μs for every value that is injected into the model. These experiments are not part of this paper. The results show a strong dependency between execution times and the occurrence of failures. Execution times vary vastly for the different modes of the system. The probability for the upper bound guarantee of about 600μs is comparatively low. With a probability of 0.99994, the system will execute within an upper bound of about 410μs, nearly a 33% reduction compared to the overall WCET value.
5
Conclusion and Future Work
In this paper, the methodology of Failure-Dependent Timing Analysis was presented. The methodology allows analyzing a system’s (worst-case) execution times under certain failure conditions. The timely termination of a system can be analyzed to provide a probabilistic guarantee that a task’s deadline miss ratio is below a given threshold.
Failure-Dependent Timing Analysis
73
A tool chain for FDTA was presented that uses elements of safety analysis models and elements of system development models to derive failure-dependent execution times. The tool allows importing Simulink models into Enterprise Architect as SysML models. Using the UML profiling mechanism, connections carrying failure-dependent data can be identified, as can be elements of Component Fault Trees that are related to those failure-indicating data. Both types of elements are related in an additional view, the Failure-Dependent Timing Analysis Diagram. Supported by this relation, the tool can automatically evaluate the different execution times and relate them to failure probabilities. Since the worst-case execution times are calculated using static analysis, they represent safe upper bound guarantees for different failure scenarios. Injecting faults into system development models implies small overestimations compared to non-exploited systems, but the example shows that these are negligible. The use of Fault Trees as a source of probabilities supports arguing for a reliable probabilistic behavior, since such safety analysis models are already accepted by authorities for quantifying quality attributes like reliability and safety. In future work, the methodology and the tool will be evaluated further in an industrial environment with special attention being paid to the use of the results for certification purposes. Additionally, the analysis results will be processed to allow deeper-level analyses and to support better graphical evaluation of the results. Since the UML stereotype mechanisms can be applied to many model elements, different safety analysis models, such as Markov Chains or Petri Nets, can be easily included in the methodology. This is beneficial for analyzing fault tolerance mechanisms with more complex failure behavior.
References 1. Bernat, G., Burns, A., Newby, M.: Probabilistic timing analysis: An approach using copulas. J. Embedded Comput. 1, 179–194 (2005) 2. Diaz, J.L., Garcia, D.F., Kim, K., Lee, C.-G., Lo Bello, L., Lopez, J.M., Min, S.L., Mirabella, O.: Stochastic analysis of periodic real-time systems. In: 23rd IEEE Real-Time Systems Symposium, RTSS 2002, pp. 289–300 (2002) 3. Laprie, J.-C., Arlat, J., Beounes, C., Kanoun, K.: Definition and analysis of hardware- and software-fault-tolerant architectures. Computer 23(7), 39–51 (1990) 4. Arlat, J., Kanoun, K., Laprie, J.-C.: Dependability modeling and evaluation of software fault-tolerant systems. IEEE Transactions on Computers 39(4), 504–513 (1990) 5. Belli, F., Jedrzejowicz, P.: Fault-tolerant programs and their reliability. IEEE Transactions on Reliability 39(2), 184–192 (1990) 6. Pucci, G.: A new approach to the modeling of recovery block structures. IEEE Transactions on Software Engineering 18(2), 159–167 (1992) 7. Dugan, J.B., Doyle, S.A., Patterson-Hine, F.A.: Simple models of hardware and software fault tolerance. In: Proceedings of the Annual Reliability and Maintainability Symposium, January 24-27, pp. 124–129 (1994) 8. Doyle, S.A., Mackey, J.L.: Comparative analysis of two architectural alternatives for the n-version programming (nvp) system. In: Proceedings of the Annual Reliability and Maintainability Symposium, pp. 275–282 (January 1995)
74
K. H¨ ofig
9. Tyrrell, A.M.: Recovery blocks and algorithm-based fault tolerance. In: Proceedings of the 22nd EUROMICRO Conference EUROMICRO 1996. Beyond 2000: Hardware and Software Design Strategies, pp. 292–299, 2-5 (1996) 10. Mok, A., Amerasinghe, P., Chen, M., Tantisirivat, K.: Evaluating tight execution time bounds of programs by annotations. IEEE Real-Time Syst. Newsl. 5(2-3), 81–86 (1989) 11. Lindgren, M., Hansson, H., Thane, H.: Using measurements to derive the worstcase execution time. In: Proceedings of the Seventh International Conference on Real-Time Computing Systems and Applications, pp. 15–22 (2000) 12. Gustafsson, J., Ermedahl, A., Lisper, B.: Towards a flow analysis for embedded system C programs. In: 10th IEEE International Workshop on Object-Oriented Real-Time Dependable Systems, WORDS 2005, pp. 287–297, 2-4 (2005) 13. Wilhelm, R., Engblom, J., Ermedahl, A., Holsti, N., Thesing, S., Whalley, D., Bernat, G., Ferdinand, C., Heckmann, R., Mitra, T., Mueller, F., Puaut, I., Puschner, P., Staschulat, J., Stenstr¨ om, P.: The worst-case execution-time problem—overview of methods and survey of tools. ACM Trans. Embed. Comput. Syst. 7(3), 1–53 (2008) 14. Ferdinand, C.: Worst case execution time prediction by static program analysis. In: Proceedings of the 18th International Parallel and Distributed Processing Symposium, p. 125 (April 2004) 15. Ferdinand, C., Heckmann, R.: aiT: Worst-Case Execution Time Prediction by Static Program Analysis. Building the Information Society 156, 377–383 (2004) 16. Puschner, P., Nossal, R.: Testing the results of static worst-case execution-time analysis. In: Proceedings of the 19th IEEE Real-Time Systems Symposium, pp. 134–143, 2-4 (1998) 17. Wolf, F., Staschulat, J., Ernst, R.: Hybrid cache analysis in running time verification of embedded software. Design Automation for Embedded Systems 7(3), 271–295 (2002) 18. Li, X., Mitra, T., Roychoudhury, A.: Modeling control speculation for timing analysis. Real-Time Syst. 29(1), 27–58 (2005) 19. Burns, A., Edgar, S.: Predicting computation time for advanced processor architectures. In: 12th Euromicro Conference on Real-Time Systems, Euromicro RTS 2000, pp. 89–96 (2000) 20. Burns, A., Edgar, S.: Statistical analysis of WCET for scheduling. In: Proceedings of the 22nd IEEE Real-Time Systems Symposium, pp. 215–224 (December 2001) 21. Griffin, D., Burns, A.: Realism in Statistical Analysis of Worst Case Execution Times. In: Lisper, B. (ed.) 10th International Workshop on Worst-Case Execution Time Analysis (WCET 2010). OpenAccess Series in Informatics (OASIcs), vol. 15, pp. 44–53. Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik, Dagstuhl, Germany (2010); The printed version of the WCET 2010 proceedings are published by OCG (www.ocg.at) - ISBN 978-3-85403-268-7 22. Bernat, G., Colin, A., Petters, S.M.: WCET Analysis of Probabilistic Hard RealTime Systems. In: Proceedings of the 23rd Real-Time Systems Symposium, RTSS 2002, pp. 279–288 (2002) 23. Bernat, G., Colin, A., Petters, S.: pWCET: A tool for probabilistic worst-case execution time analysis of real-time systems. Technical report, University of York. England UK (2003) 24. David, L., Puaut, I.: Static determination of probabilistic execution times. In: Proceedings of the 16th Euromicro Conference on Real-Time Systems, ECRTS 2004, June-2 July, pp. 223–230 (2004)
Failure-Dependent Timing Analysis
75
25. Perrone, R., Macedo, R., Lima, G., Lima, V.: An approach for estimating execution time probability distributions of component-based real-time systems. Journal of Universal Computer Science 15(11), 2142–2165 (2009), http://www.jucs.org/jucs_15_11/an_approach_for_estimating 26. Lu, Y., Nolte, T., Kraft, J., Norstrom, C.: Statistical-based response-time analysis of systems with execution dependencies between tasks. In: 15th IEEE International Conference on Engineering of Complex Computer Systems (ICECCS), pp. 169–179 (March 2010) c 1994-2011 The MathWorks Inc., 3 Apple Hill DriveNatick, MA 0176027. Simulink 2098, United States of America, http://www.mathworks.de/products/simulink c 2000-2011 Sparx Systems Pty Ltd., Creswick, Victoria, 28. Enterprise Architect, 3363, Australia, http://www.sparxsystems.com.au c 1997-2011 Object Management Group Inc., 29. OMG Systems Modeling Language, 140 Kendrick Street, Building A, Suite 300 Needham, MA 02494, United States of America, http://www.omgsysml.org 30. Kaiser, B., Liggesmeyer, P., M¨ ackel, O.: A new component concept for fault trees. In: SCS 2003: Proceedings of the 8th Australian Workshop on Safety Critical Systems and Software, pp. 37–46. Australian Computer Society, Inc., Darlinghurst (2003) 31. OMG. A UML Profile for MARTE: Modeling and Analysis of Real-Time Embedded systems, Beta 2, 2008. Object Management Group (July 2009), http://omgmarte.org, OMG Document Number: ptc/2008-06-09 c 2011 ARM Ltd., Equiniti Aspect House, Spencer Road Lancing BN99 32. ARM7, 6DA, United Kingdom, http://www.arm.com/products/processors/classic/arm7 33. YAGARTO, Yet another GNU ARM toolchain, Michael Fischer, Faustmuehlenweg 11, 34253 Lohfelden, Germany, http://www.yagarto.de/imprint.html c 1998-2011 AbsInt Angewandte In34. aiT Worst-Case Execution Time Analyzers, formatik GmbH, Science Park 1, 66123 Saarbruecken, Germany, http://www.absint.com/ait c 1986-2011 Isograph Ltd., 2020 Main Street, Suite 1180, Irvine, CA 35. FaultTree+, 92614, United States of America, http://www.isograph-software.com/ftpover.htm 36. DO-178B. Software Considerations in Airbone Systems and Equipment Certification Standard, Radio Technincal Commission for Aeronautics (1991)
A Calculus for SLA Delay Properties Sebastian Vastag Technische Universit¨ at Dortmund, Informatik IV, D-44221 Dortmund, Germany
[email protected] Abstract. Service Providers in Service-Oriented Architectures (SOA) often specify system performance values with the help of Service Level Agreements (SLAs) that do not specify details of how the system realizes services. Analytic modeling of SOA to estimate performance values is thus made difficult without knowledge of service rates. Service components are characterized by quantitative requirements in SLAs only, that are not supported by most modeling methods. We propose a calculus to model and evaluate SOA with quantitative properties described in SLAs. Instead of defining a system by its service capacity we will use flexible constraints on delays as found in SLAs. From these delays approximate service rates to fulfill the delay will be derived. Keywords: Network Calculus, SOA, SLA.
1
Introduction
Service-Oriented Architectures (SOAs) are based on the idea that processing functions of software systems can be offered as services accessible over the net. Services can be composed of other services and may form hierarchies [1]. A common implementation of SOA are Web services [1]. With cloud computing as an emerging system structure users are not able to distinguish anymore between local, remote and composed services [2]. Service performance is unknown to users and not measurable unless one is a customer in contract. This can result in situations where users of a service do not obtain the required performance and availability of service components. To avoid system shortages quantitative requirements for quantitative measures like reliability and response times are laid down in Service Level Agreements (SLAs) [3, 4]. SLAs are a contract between user and service provider. When the first agrees to limit its workload to the system the other one is able to guarantee a certain level of service. SLAs can be issued by service providers as an offer or by customers as a requirement. Challenging for system modeling including SLAs is that the performance of the service, or more specifically, the processing rate for requests, is unknown. The only available model parameters are upper bounds for the workload and response times as defined in the SLAs. For analytical models which are often used for capacity planing and validation the missing service rate leaves a gap.
This research was supported by the Deutsche Forschungsgemeinschaft (DFG).
J.B. Schmitt (Ed.): MMB & DFT 2012, LNCS 7201, pp. 76–90, 2012. c Springer-Verlag Berlin Heidelberg 2012
A Calculus for SLA Delay Properties
77
Example 1. A university plans to offer a central literature database using a Web service. The service can answer queries by author or year so researchers can include a list of their own papers on their homepage. The service shall be hosted by an external provider. The university formulates an SLA. Service description, query and output format as well as contract duration and pricing are functional SLA properties. Of course, fast service respond times are desirable. The SLA requires the Web service to respond within 2 seconds, this is a quantitative (nonfunctional) SLA property. In addition to unknown processing rates, modeling and validation of SLAs in SOAs has several characteristics differing from other modeling domains. Ideally SLAs define boundaries for performance numbers that should not be undercut or exceeded. However, a system communicating over a network does not rely on a completely controlled environment and can be influenced by external network traffic and other factors out of the control of service user and provider. Hence quantitative requirements in SLAs often have to include tolerances. Especially for delays a hard deadline, as found in embedded systems, would often be violated. Nondeterministic methods like queueing theory allow tolerances in models by choosing appropriate distributions and can be used to compute mean performance values in steady state [5]. In our opinion, the use of mean values to validate SLAs in SOA introduces several problems: Mean values do not indicate SLA limits violated. Furthermore there is no option to take into account startup phases or short service request bursts. Example 2. All literature database providers offer their services over the Internet. Even if the maximum response time of 2 seconds could be delivered no provider agrees on a hard deadline. Reliable transmission rates over the public network cannot be guaranteed and so no one will include a fixed number in a contract. Therefore the university has to relax its nonfunctional requirements on delay times. As a consequence, modeling for SOA with SLAs has to consider on the one hand limits for quantitative requirements that a system should meet and on the other hand has to allow flexibility to transform hard deadlines into soft ones. This is a requirement of modeling nonfunctional properties in SLAs with several open questions: How to build analytical models for systems when only SLAs for components are known? How to model quantitative requirements in SLAs? How to determine performance bounds of a SOA composed of services that feature SLAs? And, from the viewpoint of service providers: Can a SLA be fulfilled? In this paper we extend our approach [6] to model requirements on delays in SLAs by introducing recently developed SLA Calculus. – SLA Calculus is a deterministic calculus for quantitative requirements in SLAs under worst-case assumptions. It is a subset of (min,+) system theory [7–10] to form delay properties for services. This reflects the situation in SOA modeling: service performance is often unknown, only SLAs specifying bounds on request arrival rates and delays are available for modeling.
78
S. Vastag
– Service demand to a system with arrival rates and short term bursts is described by arrival curves as common in Network Calculus [9, 10]. The delay occurring at a service provider is also captured by curves limiting a longterm delay rate but also allowing short phases of lower system performance and longer delay. – Arrivals, delay and service in systems are related. Network Calculus is used unidirectional to derive delays from arrivals and service. We provide a method to derive the required service rates to fulfill an SLA containing bounds on demand and delay. The output of our method are service curves also found in Network Calculus closing the circle. – Our model includes elements for abstract service providers and quantitative requirements in SLAs. The service provider model contains the structure of a SOA and arising interdependency of SLAs. This allows us to reason about performance values of composed systems. To define delay curves we will use the time difference between input and output of a system. These values can be derived by measurement or, for worst-case analysis, with Network Calculus. This paper extends previous results published in [6] in several directions. In our previous approach [6] derivation of a service curve was based on an optimization problem. By giving an analytic method based on (min,+) system theory optimization algorithms become redundant and faster implementations are possible. Further contributions of this work are a improved model for SOA with quantitative requirements in SLAs. SLA Delay Properties are formulated and their concatenation in workflows are discussed. 1.1
Related Work
For analyzing models of Service-Oriented Architectures simulations can be used. In [11] a simulation framework for SOA was proposed. Model analysis gives performance numbers but does not include the description of SLAs. Based on a process chain modeling language SOA models are analyzed in [4, 12]. Timeouts are included as hard deadlines for service calls and quantitative requirements are considered. Serious disadvantages of simulation are a high effort for model generation, parametrization and the computational effort of simulation runs. The classical approach to analytical system analysis uses (extended) queueing theory. It has been shown to be suitable for efficient analysis of computer systems [5] or SOA [13]. [14] also includes functional properties of SLAs for system modeling with queueing systems. SLAs include limits of system load and system performance to deliver. The way boundaries are chosen in SLAs is similar to descriptions used in computer networks or realtime systems. A preferred tool here is Network Calculus [9, 10] to obtain deterministic bounds in queuing systems. It uses (min,+)-algebra [9, 10, 15] to set up a filtering theory for flows in networks and to derive worst-case performance bounds. Network Calculus was successfully used for computer network analysis [10, 16] and software tools including the approach are available [17, 18]. Many extensions [19, 20] as well as different
A Calculus for SLA Delay Properties
79
fields of application apart from data networks have been developed [21–23]. One of the most advanced derivatives is the Real-Time Calculus [24, 25] bringing the ideas of Network Calculus to the analysis of realtime systems. Although queueing theory gives mean values and Network Calculus works with worst-case scenarios both can describe the same systems [26] from different viewpoints. Not much is known how to combine both techniques. The Stochastic Network Calculus [27] for example combines distributions for rate modeling with (min,+) algebra. Although SLAs can be represented with Network Calculus it is rarely used to analyze SOA. Attempts were made in [28] and in our own work [6]. 1.2
Outline
The next section gives a short introduction to (min,+)-algebra. We extend the calculus with Delay Curves in section 3. A model for systems with SLAs featuring quantitative requirements is presented in section 4. It allows us to derive new SLAs when system nodes are combined to more complex systems. We consider serial concatenation of services in this paper. The inverse application of delay curves to derive a service curve is discussed in 5. Finally, an outlook of future work is given in section 6.
2
(min,+) Calculus
This work is based on Minplus or (min,+) algebra. (min,+) uses the minimum function as addition and replaces multiplication with addition. In this notation, (+, ·) becomes (min,+). Operators min() and + form a dioid [10] with ∞ as neutral element of addition and 0 as neutral element of multiplication. An extensive overview of the theory can be found in the book by Baccelli et al. [15]. Key elements for the linear time invariant filtering theory in Network Calculus are (min,+) equivalents of convolution and deconvolution of functions (cf. [10]). Definition 1 (Wide-sense increasing functions). A function is wide-sense increasing if and only if f (a) ≤ f (b) for all a ≤ b. F is the set of wide-sense increasing functions with f (t) ≥ 0 ∀t, f ∈ F . F0 is the subset of F with functions that are zero for t < 0. Definition 2 ((min,+) (de-) convolution). Let f and g be two functions or sequences in F0 . The (min,+) convolution of f and g (notation f ⊗ g) is the function (f ⊗ g)(t) = inf {f (t − s) + g(s)} 0≤s≤t
If t < 0 : (f ⊗ g)(t) = 0. The dual operation to ⊗ in (min,+) is deconvolution (notation f g): (f g)(t) = sup {f (t + s) − g(s)} s≥0
S. Vastag
80
Fig. 1. Arrival flow r(t), arrival function R(t) and arrival curve α(t)
Fig. 2. Horizantal and vertical deviation of R(t) and R∗ (t)
(min,+)-convolution is associative, commutative, distributive in respect to min() and closed in F0 . Again (F , min, ⊗ ) is a diod [9, 19]. For additional properties we refer to chapter 3 in [10]. 2.1
Arrival and Service Curves
In Network Calculus there are two main sets of functions to represent arrivals to a system over time and the service available to process them. Arrivals to a system are measured in bits, packets or any other quantitative unit used to describe data transmissions. Let r(t) be the number of arrivals of an arrival process at time t. Definition 3 (Arrival Function). Arrival function R(t) is the cumulative sum t of arrivals in time interval [0, t], thus R(t) = 0 r(x) dx. R(t) is continuous, wide-sense increasing and R(t) = 0 for t ≤ 0, thus R(t) ∈ F0 . To characterize arrival flows and to set bounds on arrival functions, Network Calculus abstracts arrival functions with functions called arrival curves conforming to the arrival curve property [24]. Definition 4 (Arrival Curve). A function αU ∈ F is an upper arrival curve for arrival function R(t) iff R ≤ R ⊗ αU . A lower curve αL is given by R ≥ R ⊗ αL . Arrival curve αU limits R from above while αL is lower boundary. When an arrival flow R is processed in a system it will leave as outgoing flow R∗ . In general R ≥ R∗ holds and R∗ fulfills the arrival curve property, hence it is often referred as outgoing arrival flow. Figure 2 shows both flows. R∗ can be obtained by measurements or derived by a second system property: Analogous to arrivals the processing resources a system can offer to its arrival flow at time t is given by function b(t). It should be noted that b(t) is the service the system is able to offer although r(t) < b(t) will usually hold. Definition 5 (Service Function). Service function C(t) is the cumulative t sum of service a system can deliver in time interval [0, t], thus C(t) = 0 b(x) dx.
A Calculus for SLA Delay Properties
81
Abstraction from arbitrary service functions is done with minimum and maximum functions satisfying the service curve property: Definition 6 (Service Curve). An upper service curve β U or a lower service curve β L for a service function C(t) is given by the relation: R∗ ≥ R ⊗ β L and R∗ ≤ R ⊗ β U
(1)
Definition 7 (Horizontal Deviation). Let f, g ∈ F be two functions. The horizontal distance between both in t is δ(f, g)(t) = inf {τ ≥ 0 : f (t) ≤ g(t + τ )}
(2)
Network Calculus uses horizontal deviation (Fig. 2) to find the maximum system latency [9, 10]. A flow with arrival curve α processed in a system offering service curve β has a maximum delay of h(t) = sup {δ(α, β)(t)}. The vertical deviation α(t) − β(t) gives the backlog (buffer content) of the flow (Fig. 2).
3
Delay Curves
Next to arrival and service curves a third curve variant was introduced in [6]. Delay curves limit the delay of a flow traversing a system. In the same way as arrival functions that are based on the cumulated number of arrivals, delay functions are an expression of the delay in a time interval. Definition 8 (System Delay). The delay between input flow R and output flow R∗ at time t is the horizontal deviation (Def. 7) between both functions d(R, R∗ )(t) = δ(R, R∗ )(t) = inf {τ ≥ 0 : R(t) ≤ R∗ (t + τ )}
(3)
The unit of delay is a unit of time. Delays in a time interval can be seen as a delay flow similar to arrival flows. To describe delay flows in a time interval we use delay functions. Definition 9 (Delay Function). Let d(R, R∗ )(t) be the delay between an arrival curve and an departure curve. Delay function D(t) is the cumulative sum of delays in time interval [0, t]. t d(R, R∗ )(x) dx (4) D(t) = 0
D(t) ∈ F since D(t) = 0 for t ≤ 0 and D(t) is wide-sense increasing. Thus D(t) features the same properties as arrival functions and can be described with similar algebraic methods. Example 3. Between request arrival to the exemplary literature web service and the time a reply is delivered time passes. The time each request waits is measured and written to the system log. The log file can be seen as a flow of delay analogously to traces of arrivals and departures. When we sum up the delay flow in a time interval we obtain a bound on the cumulated waiting times in this interval.
82
S. Vastag
Fig. 3. Service provider with SLA Delay Property
Fig. 4. Concatenation of service providers with resulting SLA Delay Property
The last step is to define delay curves with a delay curve property. Definition 10 (Delay Curve). A lower delay curve Ψ L or upper delay curve Ψ U for delay functions D(t) satisfy the relations D ≥ D ⊗ Ψ L and D ≤ D ⊗ Ψ U
(5)
This is equivalent to Ψ L (t − s) ≤ D(t) − D(s) ≤ Ψ U (t − s) ∀ 0 ≤ s ≤ t Ψ L (t) and Ψ U (t) are lower and upper bounds of delays occurring in the interval [0, t].
4
SLA Calculus System Model
This section introduces an abstract system model for SOA systems with quantitative requirements specified in SLAs. It includes descriptions for task arrivals to service providers and constraint delays that still allow flexibility. SLA Delay Properties providing the scopes for valid agreements will be defined and we will discuss combinations of several delay properties. 4.1
Requests to Systems
The basic system model in Network Calculus and Real-Time Calculus is similar to queueing systems. Workload arrives at a system and awaits service, after processing it leaves (figure 3). Workload can be either computational tasks, customers, data packets or anything else the model specifies. For SLA Calculus we will also use a similar interpretation of workload, jobs or tasks sent to a service as they occur in SOA are pooled to the term requests. A service request is the invocation of a service including transmission of input data, processing and sending the result. Triggering a Web service, an implementation of the SOA paradigm, are examples. Requests are discrete arrivals to systems weighted by their request size or processing complexity [4]. We will use a fluid model of requests with continuous time domain and abstract discrete jobs to a request flow. A request flow is the workload a customer requests from a service. Function R(t) is the cumulated number of service requests that arrived in the interval [0, t].
A Calculus for SLA Delay Properties
83
Definition 11 (Arrival Curve for Request Flow). A request flow to a system is limited by function αU (t) if αU (t) is a upper arrival curve for request function R(t) (see Def. 4). Arrival curves for request flows are a limitation on the usage of services. They provide a bound from above on the workload that is sent to service providers. From the perspective of SLAs it is the customers part of the contract. 4.2
SLA Delay Properties
SLAs for SOA can contain various aspects in different definition formats. In this work nonfunctional properties are considered, especially delay and timing. Description languages and system management [29] are out of scope of this paper, so we focus on delay properties as a fraction of an SLA description. Delay Curves limit the cumulated delay within a time interval. This curve has no expressiveness without a limit on arrivals. On the one hand a node might easily fulfill a delay curve Ψ U if there is only one arrival. On the other hand, no node can fulfill a delay curve if the arrival rate exceeds the processing rate for a long time. In consequence SLA Delay Properties include boundaries for the customer side and the provider part. They are valid under the condition that customers use the service according to the agreement. As SLA Delay Properties only virtually regulate request flows service users are free to produce workloads above limits but in those cases they cannot claim any guarantees on delays. Definition 12 (SLA Delay Property). An SLA Delay Property is a set SLA = {αL , αU , Ψ L , Ψ U } with αL , αU ∈ F0 and Ψ U , Ψ L ∈ F0 , αL ≤ αU , Ψ L ≤ Ψ U and αU (t), Ψ U (t) > 0∀t > 0. The definition implicates αL = 0 as default value when lower envelopes of arrival flows are not known or unnecessary. The same works for Ψ L = 0 since the majority of service demands do not require a minimum processing time. However, knowledge of a lower envelope for processing times can avoid excessive reservation of resources in system capacity planning. For the remaining paper we use {0, αU , 0, Ψ U } = {αU , Ψ U }. To instantiate SLA Delay Properties in SLA Calculus request arrival and delay curves can use the same function set f ∈ F0 . They only differ in their quantitative parameters. Piecewise linear functions are most convenient to use for the description of upper bounds within a time interval. Affine functions γr,b (t) = max(0, rt + b) match leaky buckets with rate r and burst size b [10]. In Network Calculus T-SPEC traffic specifications for computer networks curves are frequently used. They combine two leaky buckets T-SPEC(p, M, r, b)(t) = min(γp,M (t), γr,b (t)) with M as maximum packet size and p as peak rate. For delay curves in SLA Calculus we use a different interpretation with delay time instead of packets. Due to the fluid model we set M = 0. r is the maximum long term delay rate, b adds flexibility for variations in request processing rate and p is a higher delay rate when the service provider slows down for a short time.
84
S. Vastag
Example 4. We are going to relate an SLA Delay Property to the literature web service. The service should be able to process requests on an average rate of 1.5 requests per second. We also know that sometimes user demand increase for a short time to over 4 requests per second. To include this information on rate and safety margin we formulate a request arrival curve using the T-SPEC pattern: αU = T-SPEC(4.0, 0, 1.5, 10). For the delay we use the delay rate of 2 seconds per request. However, not expecting realtime performance from a Web service, we also grant a burst in the delay flow to relax our performance requirements. During a burst of 40 seconds in the delay flow a request may take up to 5.0 seconds. We also formulate this information as a delay curve: Ψ U = T-SPEC(5.0, 0, 2, 40). Figure 6 shows both curves. 4.3
Service Provider
Request flows are served by service providers. Definition 13 (Service Provider). A service provider S accepts a request flow R(t), processes the request and produces an outgoing request flow R∗ (t). In general R(t) ≥ R∗ (t) ∀t. While f traverses S, a delay flow of rate d(t) = δ(R, R∗ )(t) is generated by S. Figure 3 contains the basic system model. Arrivals (R(t)) enter the system from the left and leave after they have been served to the right (R∗ (t)). The emitted delay flow is represented by delay function D(t). Definition 14 (Service Provider with SLA Delay Property). {αU , Ψ U } is a SLA Delay Property. A service provider does conform to {αU , Ψ U } if D ≤ D ⊗ Ψ U with condition R ≤ R ⊗ αU . Ψ U is a shaping curve for D. The delay flow is bounded similar to an maximal f -source in [9], but without any input traffic or delay. Precondition R ≤ R ⊗ αU is the user part of a SLA contract, it limits the request arrivals. An assignment of a SLA Delay Property {αU , Ψ U } to a service is depicted in figure 3. 4.4
Service Provider Structure
The system structure in SOA is configured by the ordering of tasks that have to be processed. In Web Services as a common implementation of SOA, the ordering of tasks is called a workflow. A workflow controller invokes web services to execute workflows, a process which is called orchestration [1]. There has also been extensive work on the optimal selection of web services for workflows [14, 28]. As service orchestration and selection are out of the scope of this paper we consider workflows as predetermed. Therefore the workflow controller is omitted in our model. The sequence of service requests in a workflow is mapped to a feed-forward network [9] of service providers. A workflow requires the invocation of service Si+1 after service Si . Then, in the SLA Calculus system model, the outgoing arrival flow Ri∗ of service Si is feed into service Si+1 as an arrival flow. Figure 4 shows a construction blueprint. A similar model is proposed in [28].
A Calculus for SLA Delay Properties
4.5
85
Combination of SLA Delay Properties
In the following we will reason about SLA Delay Properties for workflows composed of more than one service request. SOA systems can be designed in service hierarchies with workflows presenting themselves as single services to other services. When SLA Delay Properties for services are known, requested properties for higher hierarchy levels can be formulated. Lemma 1 (Delay Function Concatenation). Workflow W is composed sequentially from services Si , i = 1 . . . n. Each service emits a cumulated delay flow Di . The delay function of workflow W is given by D(t) =
n
Di (t)
(6)
i=1
A request arrival flow traversing the system defined by a workflow passes each service. When a service process is requested it emits a delay flow as defined in the system model for service providers (Def. 13). Per definition a delay flow is a kind of arrival flow of equal time units, thus blind multiplexing can be applied.The output of an ideal multiplexer with inputs Ai , i = 1 . . . n satisfies n A(t) = i=1 Ai (t) [9]. Figure 4 illustrates the principle. With help of lemma 1 we can formulate the concatenation of SLA Delay Properties. Theorem 1 (SLA Delay Property Concatenation). Workflow W is composed from services Si , i = 1 . . . n with associated SLA Delay Properties pi = U ¯ for workflow W has a request {αU i , Ψi }. The composed SLA Delay Property p arrival curve and a upper delay curve given by αU (t) =
n
αi (t) and Ψ U (t) =
n
i=1
ΨiU (t)
(7)
i=1
Proof. R is the request flow processed by workflow W . Each request arrival curve in a SLA Delay Property acts as a constraint on R (Def. 11). In [9] this constraint is known as maximal f -regulator with f as limiting function. It limits R traversing the regulator to B1 = R ⊗ f1 . When B1 is feed into a second f2 -regulator the arrival flow is limited to B = B1 ⊗ f2 = (R ⊗ f1 ) ⊗ f2 = A ⊗ (f1 ⊗ f2 ) using assiociativity of ⊗ . Repeating the concatenation for f1 . . . fn leads to n Bn = R ⊗ i=1 fi . Setting αU i = fi completes the first part of the proof. Delay curves are a upper bound for the delay flow emitted by each service (Def. 10). The second part of the theorem follows from D = Ψ U and lemma 1.
This brings up the following: Corollary 1. Let W a workflow with serial requests to services S1 . . . Sn . WorkU flow W is delay constrained by ΨW if U with DW = D W ≤ D W ⊗ ΨW
n i=1
Di
(8)
86
5
S. Vastag
Lower Service Curves for Delay Properties
When a SLA Delay Property for a single service is known or has been derived for composed workflows a natural step is to determine a system that fulfills the property. This is a common task in capacity planning [14] or for validating if a workflow is executable in time with an existing system. In terms of Network Calculus this equals the derivation of a service curve β L from the SLA property {αU , Ψ U }. Let (R, R∗ ) be a pair of arrival and departure functions of a system. Then the problem of service curve estimation [30, 31] is to find a maximal solution for a lower service curve β L with R∗ (t) ≥ (R ⊗ β L )(t) ∀t ≥ 0. The lower service curve β L describes the lower boundary of processing speed a system has to deliver to observe the SLA Delay Property. Service curves were derived from given αU and Ψ U in [6]. The method supports T-SPEC type arrival and delay curves and derives a rate-latency service curve. A flaw of the method proposed in [6] was the use of optimization to match the service rate. In the following we are going to generalize the derivation of a service curve by dropping the limitation on T-SPEC curves. We will use an analytic method and remove the need for an optimizer. Our method is based on recent results for bandwidth estimation in networks. A approach based on (min,+) deconvolution was used in [30, 31] to estimate the bandwidth in form of service curves for network links based on measurements. Deconvolution (Def. 2) is the dual operation to (min,+) convolution. In general, deconvolution does not invert convolution [10, 31]: f = (f ⊗ g) g. For system curves with f = β L and g = αU one cannot completely reconstruct service curves. However, Liebeherr et al. [31] showed that deconvolution indeed gives sufficient estimates for lower service curves when applied to input/output functions. Their estimation goal was to deduce an unknown lower service curve C(t) from measured cumulated arrival and departure flows R and R∗ with R∗ ≥ (R ⊗ C)(t) for all pairs of R, R∗ . Deconvolution gives C˜ = R∗ R ˜ So C˜ is a service curve that reconstructs with C˜ ≤ C and R∗ (t) = (R ⊗ C)(t). the departure curve and can replace the unknown curve C. The bandwidth estimation in [31] is based on traces of a real systems and has to assume linear and time invariant systems. As most systems have a nonlinear input/output behavior deconvolution still has limited use and is replaced by other estimation schemes. However, service curve estimation for SLA Delay Properties is a special case where deconvolution can be applied since the available informations on systems are the SLA Delay Properties {αU , Ψ U } they conform to. They have a simpler structure than time varying measurements used in [31]. We use αU as worst-case assumption for input and also a worst-case assumption on the output based on αU and Ψ U . To find a service curve that fulfills a SLA Delay Property two steps have to be performed. 1. For a SLA Delay Property {αU , Ψ U } find a maximum output function R∗ with Ψ U (t) ≥ D(αU , R∗ )(t). 2. Use (min,+)-deconvolution to derive a lower service curve: β L = R∗ αU .
A Calculus for SLA Delay Properties
87
When defining Ψ U with piecewise linear functions like T-SPEC functions again some difficulties arise. A delay function is a cumulative sum and thus based on integration (see Def. 10). To get the actual delay d(t) Ψ U (t) has to be derived, but intersections of two linear pieces are not continuous and problems in the d f (t) = rt(f, t) for derivation. derivation arise. We use notation dy The following theorem introduces a valid output flow for an SLA Delay Property. Theorem 2. Let s = αU , Ψ U be an SLA Delay Property, αU , Ψ U are concave. A valid output flow function B(t) with Ψ U ≥ D(αU , B) is given by B(t) = αU (t − rt(Ψ U , t));
(9)
Proof. To prove that B(t) is an outgoing arrival flow that complies with SLA t Delay Property s = αU , Ψ U we have to show that D(t) = 0 h(αU , B)(x) dx ≤ Ψ U (t) for all t ≥ 0. We use rt(D, t) = d(t). Def. 7 rt(Ψ U , t) ≥ d(t) = inf τ ≥ 0 : αU (t) ≤ B(t + τ ) U = inf τ + t ≥ 0 : α (t) ≤ B(t + τ ) − t t + τ = Δ (10) = inf Δ ≥ 0 : αU (t) ≤ B(Δ) − t This enables Δ as control variable and gives a simplified notation. Arrival function α is wide-sense increasing, so Δ ≥ t holds. Delay function Ψ U is concave, so the delay rate between t and Δ is constant or is decreasing. A constant rate is found in piecewise-linear functions often used for approximation, too. First we consider the case of a constant delay rate rt(Ψ U , t) = rt(Ψ U , Δ). rt(Ψ U , t) ≥ d(t) = inf Δ ≥ 0 : αU (t) ≤ αU (Δ − rt(Ψ U , Δ)) = inf Δ ≥ 0 : αU (t) ≤ αU (Δ − rt(Ψ U , t)) − t = t + rt(Ψ U , t) − t = rt(Ψ U , t) Now the case when the rate decreases, i.e. rt(Ψ U , t) < rt(Ψ U , Δ). Let r∗ be the value of rt(Ψ U , Δ) with r∗ = rt(Ψ U , Δ) < rt(Ψ U , t). rt(Ψ U , t) ≥ d(t) = inf Δ ≥ 0 : αU (t) ≤ αU (Δ − rt(Ψ U , Δ)) = inf Δ ≥ 0 : αU (t) ≤ αU (Δ − r∗ ) − t = t + r∗ − t < rt(Ψ U , t)
The estimated bound for the output flow B(t) is not sharp. A limitation of function B(t) is that at time t only arrival and delay rate are known. Better bounds can be computed if B(t) is estimated with the knowledge of earlier delay rates. Corresponding methods will be subject to future research.
88
S. Vastag
Fig. 5. Cases 1 and 2 for theorem 2
Example 5. A service provider was asked by the university to host the literature Web service according to the SLA Delay Property. It receives the SLA Delay Property and computes the necessary service capacity by estimating an output function and applying deconvolution. The resulting lower service curve is the dashed line in figure 6.
Fig. 6. Arrival curve, delay curve and derived service curve for Web service example
6
Conclusions and Future Work
In this paper we formulate a calculus based on (min,+) filtering theory to reason about quantitative delay properties in SLAs. SLA Calculus can be used for SOA
A Calculus for SLA Delay Properties
89
performance modeling when only SLAs for services instead of service rates are known. The concept of delay flows allows one to limit service delay with delay curves and to model delay requirements in SLAs including a quantifiable amount of flexibility. Together with arrival curves delay curves are paired to form SLA Delay Properties. For building SOA models SLA Calculus can be used to reason about SLA Delay Properties in composed workflows of multiple services. Results for serial compositions are shown. Additionally, we complete the triplet of curves by deriving service curves from SLA Delay Properties. Future research will consider parallel compositions of services and their combined SLA Delay Properties. As each service in parallel is forced by delay curves to deliver some service we expect better worst-case estimates than in queueing theory. Another interesting question is how delay curves of systems with serial composition are subject to the “pay burst only once” principle of service curves.
References 1. Peltz, C.: Web services orchestration and choreography. Computer, 46–52 (2003) 2. Hayes, B.: Cloud computing. Communications of the ACM 51, 9–11 (2008) 3. Trienekens, J.J.M., Bouman, J.J., van der Zwan, M.: Specification of Service Level Agreements: Problems, Principles and Practices. Software Quality Journal 12, 43–57 (2004) 4. Bause, F., Buchholz, P., Kriege, J., Vastag, S.: Simulation Based Validation of Quantitative Requirements in Service Oriented Architectures. In: Rossetti, M.D., Hill, R.R., Johansson, B., Dunkin, A., Ingalls, R.G. (eds.) Proceedings of the 2009 Winter Simulation Conference, pp. 1015–1026. IEEE (2009) 5. Menasce, D., Almeida, V., Dowdy, L., Dowdy, L.: Performance by design: computer capacity planning by example. Prentice Hall (2004) 6. Vastag, S.: Modeling quantitative requirements in SLAs with Network Calculus. In: Proceedings of the 5th International ICST Conference on Performance Evaluation Methologies and Tools (ValueTools), ENS, Cachan, France, ICST (2011) 7. Cruz, R.: A calculus for network delay, part I: Network elements in isolation. IEEE Transactions on Information Theory 37, 114–131 (1991) 8. Cruz, R.: A calculus for network delay, part II: Network analysis. IEEE Transactions on Information Theory 37, 132–141 (1991) 9. Chang, C.: Performance guarantees in communication networks. European Transactions on Telecommunications 12, 357–358 (2001) 10. Le Boudec, J.Y., Thiran, P.: Network Calculus - A Theory of Deterministic Queuing Systems for the Internet. LNCS, vol. 4. Springer, Heidelberg (2004) 11. Sarjoughian, H., Kim, S., Ramaswamy, M., Yau, S.: A simulation framework for service-oriented computing systems. In: Mason, S.J., Hill, R.R., M¨ onch, L., Rose, O., Jefferson, T., Fowler, J.W. (eds.) Proceedings of the 2008 Winter Simulation Conference, pp. 845–853. IEEE (2008) 12. Vastag, S.: ProC/B for Networks: Integrated INET Models. In: M¨ ullerClostermann, B., Echtle, K., Rathgeb, E.P. (eds.) MMB&DFT 2010. LNCS, vol. 5987, pp. 315–318. Springer, Heidelberg (2010) 13. Urgaonkar, B., Pacifici, G., Shenoy, P., Spreitzer, M., Tantawi, A.: An analytical model for multi-tier internet services and its applications. ACM SIGMETRICS Performance Evaluation Review 33, 291–302 (2005)
90
S. Vastag
14. Eckert, J., Schulte, S., Repp, N., Berbner, R., Steinmetz, R.: Queuing-based capacity planning approach for Web service workflows using optimization algorithms. In: Digital Ecosystems and Technologies, DEST 2008, pp. 313–318. IEEE (2008) 15. Baccelli, F., Cohen, G., Olsder, G., Quadrat, J.: Synchronization and Linearity. Wiley, New York (1992) 16. Altman, E., Avrachenkov, K., Barakat, C.: TCP network calculus: The case of large delay-bandwidth product. In: INFOCOM 2002, vol. 1, pp. 417–426. IEEE (2002) 17. Schmitt, J., Zdarsky, F.: The DISCO network calculator: a toolbox for worst case analysis. In: 1st International Conference on Performance Evaluation Methodolgies and Tools, pages 8. ACM (2006) 18. Undheim, A., Jiang, Y., Emstad, P.: Network Calculus approach to router modeling with external measurements. In: Communications and Networking in China, CHINACOM 2007, pp. 276–280. IEEE (2007) 19. Fidler, M., Recker, S.: Conjugate Network Calculus: A dual approach applying the Legendre transform. Computer Networks 50, 1026–1039 (2006) 20. Xie, J., Jiang, Y.: A Temporal Network Calculus Approach to Service Guarantee Analysis of Stochastic Networks. In: Proceedings of the 5th International ICST Conference on Performance Evaluation Methologies and Tools (ValueTools), ENS, Cachan, France (2011) 21. Schmitt, J.B., Roedig, U.: Sensor Network Calculus – A Framework for Worst Case Analysis. In: Prasanna, V.K., Iyengar, S.S., Spirakis, P.G., Welsh, M. (eds.) DCOSS 2005. LNCS, vol. 3560, pp. 141–154. Springer, Heidelberg (2005) 22. Touseau, L., Donsez, D., Rudametkin, W.: Towards a SLA-based approach to handle service disruptions. In: Services Computing, SCC 2008, vol. 1, pp. 415– 422. IEEE (2008) ´ Optimal routing for end-to-end 23. Bouillard, A., Gaujal, B., Lagrange, S., Thierry, E.: guarantees using Network Calculus. Performance Evaluation 65, 883–906 (2008) 24. Thiele, L., Chakraborty, S., Naedele, M.: Real-time calculus for scheduling hard real-time systems. In: ISCAS 2000, vol. 4 (2000) 25. Chakraborty, S., K¨ unzli, S., Thiele, L.: A general framework for analysing system properties in platform-based embedded system designs. In: Proc. 6th Design, Automation and Test in Europe (DATE), pp. 190–195 (2003) 26. Jiang, Y.: Network Calculus and Queueing Theory: Two sides of one coin. ICST ValueTools (2009) 27. Jiang, Y., Liu, Y.: Stochastic Network Calculus. Springer-Verlag New York Inc. (2008) 28. Eckert, J., Pandit, K., Repp, N., Berbner, R., Steinmetz, R.: Worst-case performance analysis of Web service workflows. In: Proceedings of the 9th International Conference on Information Integration and Web-based Application & Services (2007) 29. Molina-Jim´enez, C., Pruyne, J., van Moorsel, A.: The Role of Agreements in IT Management Software. In: de Lemos, R., Gacek, C., Romanovsky, A. (eds.) Architecting Dependable Systems III. LNCS, vol. 3549, pp. 36–58. Springer, Heidelberg (2005) 30. Liebeherr, J., Fidler, M., Valaee, S.: A min-plus system interpretation of bandwidth estimation. In: 26th IEEE International Conference on Computer Communications, INFOCOM 2007, pp. 1127–1135. IEEE (2007) 31. Liebeherr, J., Fidler, M., Valaee, S.: A system-theoretic approach to bandwidth estimation. IEEE/ACM Transactions on Networking 18, 1040–1053 (2010)
Verifying Worst Case Delays in Controller Area Network* Nikola Ivkovic1, Dario Kresic1, Kai-Steffen Hielscher2, and Reinhard German2 1
Department of Information Technologies and Computing, Faculty of Organization and Informatics, University of Zagreb, Pavlinska 2, HR-42000 Varaždin, Croatia {nikola.ivkovic,dario.kresic}@foi.hr 2 Department of Computer Networks and Communication Systems, University of Erlangen-Nuremberg, Martensstraße 3, D-91058 Erlangen, Germany {ksjh,german}@informatik.uni-erlangen.de
Abstract. Controller Area Network (CAN) protocol was developed to fulfill high availability and timing demands in modern cars, but today it is also used in many other mission critical applications with hard real-time requirements. We present a compact model of the CAN bus specified by a timed automaton and prove its applicability for estimating worst case delays which are crucial for hard real-time systems. Using our model we detected flaws in previous approaches to determine the worst case delays in CAN systems. Keywords: Controller area network, CAN, real-time system, medium access, model checking, timed automata, worst case delay, latency.
1
Introduction
Modern cars are equipped with many collaborating Electronic Control Units (ECUs) which are used for various tasks like braking system, infotainment, occupant protection etc. The number of electronic subsystems in current vehicles keeps increasing, as most of innovation in the automotive industry is achieved by adding new electronic functions and devices. Innovation happens in various fields of application like entertainment, wireless connectivity or active and passive occupant protection. Most of these ECUs need data from sensors connected to other ECUs. Therefore, communication plays a significant role in modern vehicles. For this purpose, a number of automotive communication buses are utilized, ranging from LIN (Local Interconnect Network) [1] buses for connecting simple sensors and actors over MOST (Media Oriented Systems Transport) [2] for infotainment applications with *
Supported by German Research Council as a part of the project "Verification of Real-Time Warranties in CAN”.
J.B. Schmitt (Ed.): MMB & DFT 2012, LNCS 7201, pp. 91–105, 2012. © Springer-Verlag Berlin Heidelberg 2012
92
N. Ivkovic et al.
audio and video to FlexRay [3] with deterministic TDMA. With the properties as described in detail in section 3, CAN (Controller Area Network) [4] offers a sustainable performance to fulfill common communication demands in automotive industry, mainly regarding available data rate and collision avoidance at media access. The majority of car manufacturers employ CAN based data buses that often coexist with other bus systems. Each of them can offer specific advantages in costs, data rate, media access and multiplexing schemes or other, mostly hardware related peculiarities. Analyzing data traffic is necessary in order to support design of a reliable and robust communication system. The objective of this paper is to show the applicability of timed automata and model checking techniques for a quantitative evaluation of automotive communication systems. In such systems mean values for performance measures, as obtained from stochastic modeling approaches like queuing theory, are of minor interest, as they are not sufficient to predict whether hard real-time deadlines are met in any case. Reliable operation and avoidance of system malfunctions with fatal effects generally depend on the worst case performance of the communication infrastructure. Using model checking for timed systems yields upper bound delays for data transmission which are inevitable to assess reliability of the system in all possible scenarios of operation, since timeliness itself is an essential aspect of hard real-time systems. This paper is organized as following. The section 2 presents previous approaches to analyze timing aspects of CAN communication; in the section 3 the main features of CAN are briefly presented. In the section 4 we introduce a CAN model based on timed automata which allows us automatic verification of hard timing bounds. In section 5, for studied systems, worst case delays are found, verified and analyzed. The section 6 summarizes the results and gives some final conclusions.
2
Related Work
Timing aspects in CAN have been an object of several studies concerning reliable data transfer in automotive applications. Response Time Analysis was used in [5] to estimate if deadlines of tasks are met by a given schedule. In [6] and [7], this technique was extended for an improved priority scheduling policy and automatic assignment of task and message periods. Paper [8] investigates CAN using a system representation as a composition of timed automata. In comparison to [8] our model is very compact and it uses fewer states since it is settled to a higher abstract level. It is also easier to understand and to analyze. In [9] a tool for exploration and optimization of CAN bus design is presented. For an optimized assignment of tasks as well message cycles [10] applies the Earliest Deadline First algorithm in conjunction with the CAN protocol. But the drawback of all these proposals is that a static a-priori schedule for all tasks and messages is inevitable to analyze real-time properties and to optimize traffic with respect to timing demands. In fact, due to the asynchronous wake-up of controllers along the CAN bus, such a global schedule can never be anticipated. To overcome this, the verification procedure that we propose requires
Verifying Worst Case Delays in Controller Area Network
93
only statically assigned CAN identifiers and known cycle times at which each message will be sent. We do not need to know any global bus-wide schedule of traffic in order to obtain upper bound delays for each priority class. Network calculus was used in [11] to obtain delay bounds in CAN. An advantage of our approach is that it follows CAN protocol more closely and it allows insight into the internal world of CAN in contrast to black box methods.
3
A Brief Description of CAN
Increasing number of interconnected electronic devices used in modern cars motivated Robert Bosch GmbH in 1980s to develop a protocol that suits specific needs of automotive applications. This resulted with CAN bus which later spread outside automobile industry because of its suitable characteristics for safety-critical and real-time systems. CAN bus is now standardized as ISO 11898 [4]. It uses differential serial line architecture with dominant and recessive bits. A CAN bus system consists from a number of nodes, also called ECUs. Every node has a CAN controller connected through a transceiver with the CAN bus as shown on Fig. 1. CAN controller is responsible for sending massages, generated in upper layers by a microcontroller, through the CAN bus to other CAN controllers and for delivering received messages to its own microcontroller.
Fig. 1. ECU nodes interconnected with the CAN bus
A recessive bit represents a logical “1” and a dominant bit a logical “0”. When the bus is idle it has the recessive level. If one controller on the bus sends a dominant bit while another controller sends a recessive bit, the resulting value on the bus will be dominant. This feature implies a bit-wise arbitration scheme for the media access, often called CSMA/BA (Carrier Sense Multiple Access with Collision Avoidance by Bitwise Arbitration). While sending data, each station listens to the bus, i.e. it measures its voltage level. If a collision occurs in the arbitration process where one station tries to send a recessive bit but receives a dominant bit, it will notice that another station is sending simultaneously and will stop its own transmission immediately. This makes the arbitration non-destructive, since the controller sending dominant bit will continue with sending without any negative effects on the bus while
94
N. Ivkovic et al.
the station sending recessive bit remains silent from the time on when the collision has occurred. A retransmission of the interrupted frame is triggered automatically once the bus is idle and thus recessive again. Before sending, each controller stays in the carrier sense phase and listens to the bus. A controller can start sending if the bus is free (recessive), but only after at least 3 bit times (bit time is the time needed to transmit 1 bit on the bus, also denoted as b.t.) of the interframe space passed from the last frame. This means that the pending controller that lost previous arbitration will start sending frame as soon as the interframe space expires. The frame starts with the start bit (SOF) that is always dominant and is used by all controllers for the hard synchronization. Immediately after, follows the message identifier from the most significant to the least significant bit. The structure of a standard CAN frame is shown in Fig. 2. CAN controllers do not have explicit address, instead unique message identifiers are used to describe the content of the respective CAN frame. The frames are broadcast on the bus and each station can decide if the message content is relevant by examining the message identifier of the received frame. These identifiers are allocated statically during the system design phase to avoid an ambiguity in the interpretation of the frame content. This requires that a system designer has a global view of the complete system during development. Two variants of CAN frames exist: standard frames with 11 bit message identifiers and extended frames with 29 bit identifiers. Both can coexist on the same bus. The payload can be of variable size from 0 up to 64 bits.
Fig. 2. Standard CAN data frame
The media access scheme described above creates an implicit hierarchy of priorities based on message identifiers value. If more senders start to send simultaneously, the sender transmitting the highest message identifier has to send a recessive bit 1 first while transmitting the message identifier due to its binary encoding. This bit is overwritten by a dominant 0 bit. The controller listens to the bus at the same time as it sends the bits and if a collision is detected (dominant bus while it is sending a recessive bit) the controller stops sending. This arbitration process will continue and the only one sender remains sending after arbitration fields are sent. This must always be the one with the lowest message identifier and thus with the highest priority [12]. A NRZ (Non-Return-to-Zero) encoding is used for the line encoding of the bits and the sender inserts a complementary stuff bit when no change in the logic level has occurred over five successive bits (including stuff bits). These stuff bits are removed automatically by the receiver of the message.
Verifying Worst Case Delays in Controller Area Network
4
95
Timed Automata and Verification
Timed automata, as introduced in [13], enable modeling real-time requirements of systems. In general, a timed automaton consists of a set of locations which denotes states of a system and a set of edges between locations which denote transitions. Locations as well as transitions may be tagged with (real-valued) constraints on clocks (such constraints associated with locations are called invariants – time can elapse in such location only as long as the given invariant of the location is valid). Guards are constraints associated with transitions. Guards are also called preconditions. Furthermore, transitions can also be labeled with postconditions, which are automatically executed after a transition is passed. A transition can be taken only if current values of clocks satisfy the clock constraint. A state of a timed automaton comprises a location and an assignment of values to clock variables. Timed automata can capture important aspects of control and real-time systems such as (real-time) liveness and fairness (both qualitative features) as well timing delays, bounded response etc. (examples of quantitative features). In order to perform a formal analysis of a timed automata based model automatic verification (also known as model checking) can be used. Basically, the verification problem for state-based, infinite systems can be described as the language inclusion problem L(A) ⊆ L(B), (1) where L(A) is (the language of) the specification and L(B) is (the language of) the given model property to be proven. This inclusion is often transformed to the socalled emptiness problem, i.e. L(A) ∩ L(B)C = ∅
(2)
C
where L(B) denotes the complement of the language L(B). This equation shows that there is no word which is accepted by A but not accepted by B (if the intersection is not empty, then every word from the intersection set is a counterexample). Constructing the intersection of the automata A and BC (where the last automaton often represents a logic formula specifying the given property) is generally denoted as model checking. 4.1
CAN Arbitration Model
In this section, we present a timed automata based model of the CAN protocol. This model was used to determine upper delay bounds for message delivery with model checking techniques. Our CAN model represents a system of nodes attached to transceivers which are interconnected with the CAN bus. This is modeled as an array of timed automata, i.e. with several instances of the automaton. Using one compact model for a node has an advantage over the composition of separate automata as it shows more clearly what is happening by performing state transitions. Moreover, a more compact automaton promises a more efficient verification since it has fewer states and variables. Our model - as shown on Fig. 3 - was implemented with
96
N. Ivkovic et al.
UPPAAL [14] which allows modeling, simulation and verification of real-time systems specified as (networks of) timed automata, possibly extended with data types (bounded integers, arrays, etc.). Different instances of this timed automaton (guards represented with green, variable resets with blue, synchronization labels with light blue and invariants with violet) can communicate with each other by the global variables R (representing the highest adapter priority) and bus (representing the CAN bus state – being idle or busy), and by exchanging the synchronization label free through a broadcast communication channel. An instance of automation represents one node and models cyclic message generation that is done by the application process, but most of the model behavior is concerned with CAN protocol and CAN controller.
Fig. 3. The Controller area network model implemented with UPPAAL
Every instance of the timed automaton has two local clocks. The clock g is used for determining time events when new message is generated and given to the controller for sending. Moreover, the clock g is used for measuring the time delay between a request to send the message and the moment when the message is received by other transceivers. The clock t is used for modeling the bus access control of the CAN controller. Our model assumes that a node is waiting some undetermined amount of time before it can start to operate normally, i.e. before it can generate and send messages in predefined cycles. This aspect is modeled with the off state from which the timed automaton is allowed to make a transition to the idle state without any time constraints. After executing this transition, in order to start the time measuring of its own cycle, the clock g is set to 0.
Verifying Worst Case Delays in Controller Area Network
97
The automaton is staying in the idle state waiting its cycle time to expire; then it can change to the wait or to the start_arbit state, depending on the state of the bus variable. The bus variable represents the CAN bus state which can be free for sending (encoded by “1”) or busy (encoded by “0”). The automaton will remain in the wait state until it receives the free symbol, sent by another instance, after it completes its frame transmission and after the minimum interframe space T_ifs elapses. Immediately as the automaton enters the start_arbit state it starts the arbitration process. If its priority number p is smaller than the number currently stored in the global variable R, the automaton rewrites its priority number in the R, sets the bus variable to 0 and goes to the finish_arbit state. Otherwise, the automaton goes to the wait state. Initially the variable R stores a number that is bigger than the one associated with the lowest priority in the system. On the transition from the start_arbit to the finish_arbit state, the global variable bus is set to 0. This is done to prevent the other instances, which might miss the start of the arbitration, to subsequently engage in the arbitration process. (This can also be done on the transition to the wait state, but it is redundant since at least one instance will do the same on transition to the finish_arbit state.) When the time T_arb, spent within arbitration expires, the arbitration is finished and the priority of the automaton with the lowest value is stored in the R variable. Automata whose priority is different from the value currently stored in the R variable will go back to the wait state. The automaton which won the arbitration will fulfill the condition that its priority was the highest priority, and is stored in the R variable, so it will make a transition to the sending state. By making this transition, the automaton resets the R variable to its initial value, i.e. lower than the lowest priority used in the system. Thus, the variable R is prepared for some future arbitration process. The automaton will remain in the sending state until the value of the clock t satisfies the guard T_min ≤ t ≤ T_max, i.e. when a complete frame is sent and the automaton goes to the finish state. On the transition to the finish state the clock t is reset to „0“ so that the interframe space time can be measured. When the interframe space time T_ifs expires then the automaton goes from the finish to the idle state. On the transition to the idle state the variable bus is set to 1, meaning that the bus is now free, and the label free is sent to all automata in the wait state which causes them to change to the start_arbit state. It is easy to notice that in the transition between sending and finish state, an automaton only waits for the clock t to become T_min ≤ t ≤ T_max, and afterwards from the finish to the idle state it waits additionally T_ifs. An equivalent automaton can be produced if the finish state is omitted and the transition to the idle state is executed when T_min+T_ifs ≤ t ≤ T_max+T_ifs. That is, instead of sending a data frame an automaton can send “augmented virtual frame” that is ordinary data frame augment with following minimal interframe space. The transition to the idle state sets the bus variable to 1 and the free label is broadcasted. For an equivalent automaton, with the omitted finish state, the verification procedure can be done more efficiently, but for clarity reasons the finish state is included in the model.
98
4.2
N. Ivkovic et al.
Qualitative Properties
Based on the described CAN model we identified several properties which have to be fulfilled. These properties are specified in the UPPAAL version of Computation Tree Logic (CTL). Apart from the usual deadlock-freeness proof (UPPAAL expression: A[] not deadlock) we automatically proved also the following properties (for more details look at [15]): 1) “At a certain time, only one node (after winning arbitration!) may send a frame”. This safety property can be specified in UPPAAL as follows: A[] forall (i:id_t) forall (j:id_t) P(i).sending && P(j).sending imply i == j
2) “When a node is sending a frame then the bus has to be busy (i.e. bus=0)”. This as another safety property which can expressed as: A[] forall (i:id_t) P(i).sending imply bus==0
3) “It is possible that several nodes compete for bus access right”. This liveness property is specified as follows: E[] forall (i:id_t) forall (j:id_t) forall (k:id_t) P(i).start_arbit && P(j).start_arbit && P(k).start_arbit imply (i != j and j != k and i != k)
(for clarity reasons we assumed here that three nodes are competing; but this CTL formula can be easily extended for any number of nodes!) 4) “The highest-priority node will always win the arbitration and start with sending”. This is also a liveness property which can be specified as follows P(0).finish_arbit --> P(0).sending
If we set “1” (or any other non-zero value) instead of “0” this property – as expected will not be always valid (because any other node can lose its arbitration process). 4.3
Time Constants
As appropriate time unit in our CAN model we use bit time since it does not depend on the transmission rate and it is easy to calculate times needed for frame transmissions. Every automaton i has its own cycle[i] time that is statically defined during the system development phase. The time needed to transmit the minimum sized frame is easy to determine as it occurs when no data is sent in the data filed and the bit pattern is one that doesn’t
Verifying Worst Case Delays in Controller Area Network
99
need the bit stuffing technique. Minimum sized frame for the standard 11 bit identifier has to be 47 bits, therefore the respective time is T_min = 47 b.t. A maximum sized frame is one that carries the maximum (8 byte) data field, and the bit pattern is such that causes adding a maximum number of stuffed bits. If five consecutive bits with the same value (including stuff bits) are encountered, then the opposite bit is inserted in the frame [4]. Bit stuffing is used only for the first part of the frame and is not used in the remaining part of the frame after the CRC field. Unstuffed frame with 11 bit identifier may have up to 98 bits in the stuffable part of the frame. An upper bound of the frame length is achieved if first five bits are equally valued and then are followed with alternation of four bits patterns of ones and zeroes as shown on Fig. 4. This way 1 + ⌊ (98 - 5) / 4⌋ = 24 additional stuff bits are inserted. With remaining bits, the upper bound for a CAN frame is 98+24+10=132 bits, therefore T_max = 132 b.t. The actual time needed to finish the arbitration process depends on the length of the arbitration field (i.e. message identifier) of all messages competing for the bus access. This time depends upon the fact that winning identification may have a bit pattern which is stuffed or not. Together with the SOF bit the arbitration field can have between 12 and 14 bits. In our practical implementation the T_arb is set to 14 b.t. Since the minimum and maximum arbitration field is included in the T_min and the T_max setting T_arb to 12, 13 or 14 bits does not affect the delay time of the automaton. Unstuffed frame (first 98 bits) 00000111100001111000011110000111100001111000011110000111100001 111000011110000111100001111000011111
Stuffed frame (first 98+24=122 bits) 00000i1111o0000i1111o0000i1111o0000i1111o0000i1111o0000i1111o 0000i1111o0000i1111o0000i1111o0000i1111o0000i1111o0000i1111o1
Legend: o – stuffed zero, i – stuffed one Fig. 4. Example of a bit pattern that gives an upper bound for frame size
5
Verification of Worst Case Delays
Based on our CAN model presented in the Section 3 we were able to find exact upper bounds for message delivery delays without regard to the concrete priority of the given message. For such purposes UPPAAL provides a scheme for computing upper time bounds: A[](b ==> z 0}. Distr(X) is the set of all probability distributions over X. Given an element x ∈ X, by Dx : X → [0, 1] we denote the Dirac distribution i.e. Dx (x) = 1. We find it convenient to work with distributions that are labelled by updates [29]. Given a finite alphabet U of updates and a set X, an update-labelled distribution μ is a distribution over U × X such that (u, x), (u, x ) ∈ support(μ) implies x = x . Expressions and evaluations. Given a set of variables V such that every variable x ∈ V has a specific domain Dx . In our framework we restrict the domain of the variables to booleans, integers or bounded integers. A valuation of V is a function v : V → x∈V Dx such that v(x) ∈ Dx . We denote by Val(V ) the set of valuations of V . To avoid a cumbersome description, we assume that we can construct expressions where the free variables are in V by using the usual arithmetic operations and relations. We denote by ExprV (BExprV ) the set of all (boolean) expressions of V . Given a expression e ∈ ExprV and an valuation of v of V , we denote by ev the evaluation of the expression were every occurrence of a variable x is replaced by the value v(x) and the term is evaluated subsequently. We denote by e[x1 → e1 , . . . , xn → en ] the expression that results from replacing simultaneously every occurrence of xi by ei . Given b ∈ BExprV and v ∈ Val(V ) we say that v satisfies b, denoted by v |= b, if bv = 1. The set of all valuations that satisfies a boolean expression is denoted
Heuristics for Probabilistic Timed Automata with Abstraction Refinement
153
by b. An assignment is a function η : V → ExprV such that for every evaluation v and variable x ∈ V it is η(x)v ∈ Dx . The set of all assignments is denoted by Assgn(V ). We denote by e[η] the expression e[x1 → η(x1 ), . . . , xn → η(xn )] and v[η] is the valuation such that v[η](x) = η(x)v . Clock Constraints. In order to model real-time behavior, a special kind of variables is required, we call them clock variables. Given a finite set of clock variables X , a clock constraint ζ is an expression in BExprX that is generated by the following grammar rules: ζ ::= true | false | x ≤ c | x = c | x ≥ c | ζ ∧ ζ where x ∈ X and c ∈ N. We denote by CC(X ) the set of all clock constraints over X . Timed Guarded Commands. Given a finite set of updates U a set of actions Σ, a finite set of discrete variables V and a finite set of clock variables X , a guarded command is a tuple c = (a, g, gt , μ, X) where a ∈ Σ is the label, g ∈ BExprV is the guard, gt ∈ CC(X ) is the guard over clock variables, μ : U → Q × Assgn(V ) is the probabilistic transition relation such that: p = 1. u∈U μ(u)=(p,η)
and X ∈ X are the clocks that are reset by the action. Definition 1 (Variable-decorated Probabilistic Timed Automata). A Variable-decorated Probabilistic Timed Automaton or simply Probabilistic Timed Automaton (PTA) M is a tuple (V, X , I, Σ, U, Inv, T C) where: V is a finite set of data variables. X is a finite set of clock variables. I ∈ BExprV is the initial condition for the data variables. Σ is a finite set of actions labels. U is a finite update alphabet. Inv ⊆ BExprV ∪X is a finite set of invariants such that if inv ∈ Inv then inv is of the form p =⇒ q where p ∈ BExprV and q ∈ CC(X ). – T C is a finite set of timed commands.
– – – – – –
Intuitively, a PTA, just like a TA, is a system whose states are paired from data valuation and clock valuation, the latter satisfying the invariants. At each step the system can either choose a command enabled in the current state whose execution does not violate the invariant in any of the possible target states. In this case it performs the action and carries out a probabilistic experiment, which determines the successor location and clock resets. The system can also let the time pass in which case all of the intermediate states must not violate the invariant of the location. By the above definition of clock constraints, our PTA are
154
L.M. Ferrer Fioriti and H. Hermanns
closed and diagonal-free by construction. That means that strict comparisons are disallowed and clocks can only be compared against constants. We call strategy to the function that in each state resolves the non-determinism present in the model. As usual we restrict our analysis to non-zeno, timelock free models and consider only divergent strategies (i.e. infinite traces take infinite time). Given a divergent strategy and a reachability objective they define a unique probability measure over the states [24]. A Probabilistic Automaton (PA) can be thought of as a PTA without time behavior. Therefore, it can be defined along the lines of Definition 1 removing clock variables and the time action. A timed model can be translated into a discrete non-timed model by means of a transformation called digitalization [14,20]. This technique roughly consists of introducing a new integer variable for each clock present in the original model and a new action that increments synchronously by one all clocks. In the non-probabilistic setting Henzinger, Manna and Pnueli [14] and later for PTA Kwiatkowska et al. [20] showed that for closed and diagonal-free models digitalization is equivalent to the dense semantics in case of reachability properties.
3
Abstraction and Refinement
In this section we show how a probabilistic automaton can be abstracted using a stochastic game semantics and later we instantiate this idea to a particular case of abstraction which is based on predicates. 3.1
Abstraction Based on Stochastic Games
Abstraction based on state partitioning can be formed by over-approximating PA [7,31]. Such abstractions can only provide safe (usually upper) bounds, but we cannot be sure about the precision of the abstraction, i.e. minimal and maximal adversaries cannot be calculated. A better approach, pioneered by Kwiatkowska et al. [17] uses stochastic games [27,5] as abstractions and exploits a separation of non-determinism of the original model and non-determinism introduced by the abstraction. As a consequence, we can calculate upper and lower bounds of both minimal and maximal probabilities. Definition 2 (Stochastic Game [5]). A Stochastic Game is a tuple G = ((V, E), Vinit , V1 , V2 , Vp , U, δ) where: – – – – – – –
(V, E) is a finite directed graph with vertices V and edges E ⊆ (V × V ). Vinit ⊆ V the set of initial vertices. V1 ⊆ V are the player 1 vertices. V2 ⊆ V are the player 2 vertices. Vp ⊆ V are the probabilistic vertices. U is a finite alphabet. δ : Vp → Distr(U × V1 )
Heuristics for Probabilistic Timed Automata with Abstraction Refinement
155
such that (V1 , V2 , Vp ) is a partition of V , E ⊆ V1 × V2 ∪ V2 × Vp ∪ Vp × V1 and δ(v)(u, v ) > 0 implies (v, v ) ∈ E. We denote by E(v) = {w | (v, w) ∈ E} the set of successors of v. As with PA and PTA, non-determinism is resolved by means of strategies. In case of Stochastic Games, strategies are formed by two independent strategies one for player 1 choices and another for player 2. In our framework we abstract PA into stochastic games [17,30]. Definition 3 (Menu-based 1 Abstraction [30]). Given a PA M = (V, I, Σ, U, C), a reachability objective F ∈ BExprV and Q, a partition of Val(V ) such that for all B ∈ Q if v, v ∈ B then for all c ∈ C it is v |= gc if and only if gc , and there exist blocks B1 , . . . , Bn , B1 , . . . , Bm such that I = i Bi and v |= F = i Bi , the Menu-based Abstraction (MBA) of M with respect to Q is the stochastic game GM,Q = ((V, E), Vinit , V1 , V2 , Vp , U, δ) where: – – – –
Vinit = I/Q V1 = Q. V2 = {(B, c) | B ∩ gc = ∅}. Vp = {μv | (a, g, μ) ∈ C ∧ v |= g}.
The distribution function δ is the identity and the edges in E are defined by: E ={(B, (B, c)) | B ∈ V1 , B ∩ gc = ∅} ∪{((B, c), μv ) | B ∈ V1 , v ∈ B, c = (a, g, μ) ∈ C, v |= g} ∪{(μv , B) | μ ∈ Vp , B ∈ support(μ)}. and μu ∈ Distr(U × Q) is defined by: p μu (u, B) = 0
if B = v[η]/Q, otherwise
where μ(u) = (p, η). The basic idea of the abstraction is that the first player resolves the nondeterminism present in the original PA while the second player resolves the non-determinism introduced by the abstraction. The abstraction used here is – for presentation reasons – slightly simplified from the one presented in [30] since we assume that in each abstract player 1 state the same commands are enabled. Theorem 1 (Soundness [30]). Given a PA M = (V, I, Σ, U, C), a partition Q of Val(V ) that satisfies the condition of Definition 3, and the menu-based abstraction GM,Q then: inf pσs 1 ,σ2 ≤ pmin ≤ inf sup pσs 1 ,σ2 s
σ1 ,σ2
sup inf σ1 1
σ2
σ1
pσs 1 ,σ2
≤
pmax s
σ2
≤ sup pσs 1 ,σ2 σ1 ,σ2
This abstraction is originally called Parallel Abstraction [30], but we use more recent terminology [29].
156
L.M. Ferrer Fioriti and H. Hermanns
Menu-based abstraction has some advantages over game-based abstraction [17] (GBA), a technique where the role of the players are basically reversed - with the first player resolving the non-determinism due to the abstraction, and the second player resolving the model inherent non-determinism: MBAs are generally more compact than GBAs, and they are easier to compute [29], since the need to analyse the effect of multiple actions at a time [15] can be avoided. This is contrasted by the foundational observation that GBA has the best transformer property in the sense of abstract interpretation [6], while MBA is actually suboptimal for non-cooperative strategies [29]. 3.2
Predicate Abstraction
In order to build an abstract model, one of the most widely used techniques is Predicate Abstraction, which was pioneered by Graf and Sa¨ıdi [10]. The partition of the state space using predicates can be efficiently constructed from a syntactic representation of the model using modern SMT solvers. This avoids the construction of the full semantics of the original model. In theory, the state space size generated by Predicate Abstraction may be exponential in the number of predicates used. However, in practice the size is considerably lower. First, not all the abstract states are reachable from the initial abstract configuration. Secondly, the predicates are not always disjoint, giving abstract states that are logically equivalent to false. In case of probabilistic models, predicate abstraction is more complex due to the fact the post relation of an action in the original model is not deterministic [31]. Requirements of Definition 3 can be fulfilled by adding all the predicates from guards in case of PA, and additionally all predicates from timed guards and invariants in case of PTA. 3.3
Backward Refinement
Given a PA or a digitalization of a PTA and a partition we are in the position to construct a stochastic game that is a sound abstraction of the original model. However, the built abstraction might be too coarse, therefore the upper and lower bounds for the reachability property we want to analyze may differ considerably. To solve this problem the abstraction has to be refined. In our framework this is equivalent to obtaining a new set of predicates. Backwards Refinement (BR) is a technique introduced by Kattenbelt et al. [16] to verify probabilistic software. It was later adapted to networks of PA using MBA in [30]. The algorithm is based on the notion of pivot blocks. A block is a pivot if there is a difference between the selections made by player 2 in the strategies obtaining the upper and lower bounds on the reachability property. Although two different strategies can yield the same probability to reach a set of target states, it can be assumed that the decisions of player 2 of the upper and lower strategies differ only if that induces a change in the probabilities [3]. To ensure this constraint, PASS employs a modification of the value iteration algorithm [29]. The new predicates are obtained by taking the weakest preconditions of the different choices made
Heuristics for Probabilistic Timed Automata with Abstraction Refinement
157
Table 1. Experimental results of models hand-translated into digital clocks model
param 16 / 2 / 1
BRP
16 / 2 / 4
65 / 5 / 1
65 / 5 / 4
CSMA
1 4
Zeroconf
-
prop ref pred t pred time abst P1 P2 P3 P4 P1 P2 P3 P4 P1 P2 P3 P4 P1 P2 P3 P4
24 10 18 3
79 79 78 64
27 28 26 26
29 17 5
126 131 108
75 79 70
40
138
35
5
84
43
4
67
26
P1 P2 P3 P4 P* P1 P2 P3 P4
9
218
192
11 55 32 242 111 238 134 265
42 225 222 249
mc
76s 19s 48s 16s 10s 1s 48s 15s 24s 2s 2s 1s < timeout 184s 118s 15s 902s 65s 789s 6s 4s 1s < timeout 388s 125s 110s timeout 4s 3s 1s < timeout timeout timeout 3s 2s 1s