FOUNDATIONS OF DEPENDABLE COMPUTING
Models and Frameworks for Dependable Systems
T H E K L U W E R I N T E R N A T I...
46 downloads
891 Views
11MB Size
Report
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form
FOUNDATIONS OF DEPENDABLE COMPUTING
Models and Frameworks for Dependable Systems
T H E K L U W E R I N T E R N A T I O N A L SERIES IN E N G I N E E R I N G A N D C O M P U T E R SCIENCE OFFICE OF NAVAL RESEARCH Advanced Book Series Consulting Editor Andr~ M. van Tilborg
Other titles in the series: OF D E P E N D A B L E COMPUTING: Paradigms for Dependable Applications, edited by Gary M. Koob and Clifford G. Lau ISBN: 0-7923-9485-2
FOUNDATIONS
FOUNDATIONS OF DEPENDABLE COMPUTING: Implementation, edited by Gary M. Koob and Clifford G. Lau ISBN: 0-7923-9486-0
System
PARALLEL ALGORITHM DERIVATION AND PROGRAM TRANSFORMATION, edited by Robert Paige, John Reif and Ralph Wachter ISBN: 0-7923-9362-7 F O U N D A T I O N S OF K N O W L E D G E ACQUISITION: Cognitive Models of Complex Learning, edited by Susan Chipman and Alan L. Meyrowitz ISBN: 0-7923-9277-9 F O U N D A T I O N S OF K N O W L E D G E ACQUISITION: Machine Learning, edited by Alan L. Meyrowitz and Susan Chipman ISBN: 0-7923-9278-7 F O U N D A T I O N S OF REAL-TIME COMPUTING: Formal Specifications and Methods, edited by Andr6 M. van Tilborg and Gary M. Koob ISBN: 0-7923-9167-5 F O U N D A T I O N S OF REAL-TIME COMPUTING: Scheduling and Resource Management, edited by Andr6 M. van Tilborg and Gary M. Koob ISBN: 0-7923-9166-7
FOUNDATIONS OF DEPENDABLE COMPUTING
Models and Frameworks for Dependable Systems
edited by
G a r y M. Koob C l i f f o r d G. L a u
Office of Naval Research
KLUWER ACADEMIC PUBLISHERS Boston / Dordrecht / London
Distributors for North America: Kluwer Academic Publishers 101 Philip Drive Assinippi Park Norwell, Massachusetts 02061 USA Distributors for all other countries: Kluwer Academic Publishers Group Distribution Centre Post Office Box 322 3300 AH Dordrecht, THE NETHERLANDS
Library of Congress Cataloging-in-Publication Data Foundations of dependable computing. Models and frameworks for dependable systems / [edited by Gary M. Koob, Clifford G. Lau]. p. cm. -- (The Kluwer international series in engineering and computer science ; 0283) Includes bibliographical references and index. ISBN 0-7923-9484-4 1. Electronic digital computers--Reliability. 2. Real-time data processing. 3. Fault-tolerant computing. I. Koob, Gary M., 1958II. Lau, Clifford. III. Series: Kluwer international series in engineering and computer science ; SECS 0283. QA76.5.F6238 1994 004.2' 1--dc20 94-29126 CIP
Copyright © 1994 by Kluwer Academic Publishers All rights reserved. No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means, mechanical, photo-copying, recording, or otherwise, without the prior written permission of the publisher, Kluwer Academic Publishers, 101 Philip Drive, Assinippi Park, Norwell, Massachusetts 02061 Printed on acid-free paper.
Printed in the United States of America
CONTENTS
Preface
.....................................................
Acknowledgements
vii
......................................
xiii
Errata ......................................................
1. FRAMEWORKS 1.1
FOR DEPENDABLE
x v
SYSTEMS
......
A Consensus-Based Framework and Model for the Design of Responsive Computing Systems ...........................................................
1
3
M. Malek 1.2
A Methodology for Adapting to Patterns of Faults ..................................
23
G. Agha and D. C. Sturman
2. REQUIREMENTS 2.1
MODELS
............................
Military Fault-Tolerant Requirements ...................................................
6 1 63
T.P. Monaghan 2.3
Derivation and Use of Deadline Information in Real-Time Control Systems ..............................................................................
K.G. Shin and H. Kim
77
vi SYSTEM
3.
VALIDATION
................................
111
3.1
Software-Implemented Fault Injection of Transient Hardware Errors .......... 113 C.R. Yount and D.P. Siewiorek
3.2
REACT: An Integrated Tool for the Design of Dependable Computer Systems .......................................................................... 169 J.A. Clark and D.K. Pradhan
o
SYSTEM
EVALUATION
................................
193
4.1
Measurement-Based Dependability Evaluation of Operational Computer Systems .......................................................................... 195 R.K. Iyer and D. Tang
4.2
Modeling and Evaluation of Opto-Electronic Computing Systems ............ 235 T.Y. Lin
Index
......................................................
263
PREFACE
Dependability has long been a central concern in the design of space-based and military systems, where survivability for the prescribed mission duration is an essential requirement, and is becoming an increasingly important attribute of govemment and commercial systems where reduced availability may have severe financial consequences or even lead to loss of life. Historically, research in the field of dependable computing has focused on the theory and techniques for preventing hardware and environmentally induced faults through increasing the intrinsic reliability of components and systems (fault avoidance), or surviving such faults through massive redundancy at the hardware level (fault tolerance). Recent advances in hardware, software, and measurement technology coupled with new insights into the nature, scope, and fundamental principles of dependable computing, however, contributed to the creation of a challenging new research agenda in the late eighties aimed at dramatically increasing the power, effectiveness, and efficiency of approaches to ensuring dependability in critical systems At the core of this new agenda was a paradigm shift spurred by the recognition that dependability is fundamentally an attribute of applications and services--not platforms. Research should therefore focus on (1) developing a scientific understanding of the manifestations of faults at the application level in terms of their ultimate impact on the correctness and survivability of the application; (2) innovative, application-sensitive approaches to detecting and mitigating this impact; and (3) hierarchical system support for these new approaches. Such a paradigm shift necessarily entailed a concomitant shift in emphasis away from inefficient, inflexible, hardware-based approaches toward higher level, more efficient and flexible software-based solutions. Consequently, the role of hardware-based mechanisms was redefined to that of providing and implementing the abstractions required to support the higher level software-based mechanisms in an integrated, hierarchical approach to ultradependable system design. This shift was furthermore compatible with an expanded view of "dependability," which had evolved to mean "the ability of the system to deliver the specified (or expected) service." Such a definition encompasses not only survival of traditional single hardware faults and enviromnental disturbances but more complex and less-well understood phenomena, as well: Byzantine faults, correlated errors, timing faults, software design and process interaction errors, and--most significantly--the unique issues encountered in real-
° . °
Vlll
time systems in which faults and transient overload conditions must be detected and handled under hard deadline and resource constraints. As sources of service disruption multiplied and focus shifted to their ultimate effects, traditional frameworks for reasoning about dependability had to be rethought. The classical fault/error/failure model, in which underlying anomalies (faults) give rise to incorrect values (errors), which may ultimately cause incorrect behavior at the output (failures), required extension to capture timing and performance issues. Graceful degradation, a long standing principle codifying performance/dependability trade-offs must be more carefully applied in real-time systems, where individual task requirements supercede general throughput optimization in any assessment. Indeed, embedded real-time systems--often characterized by interaction with physical sensors and actuators--may possess an inherent ability to tolerate brief periods of incorrect interaction, either in the values exchanged or the timing of those exchanges. Thus, a technical failure of the embedded computer does not necessarily imply a system failure. The challenge of capturing and modeling dependability for such potentially complex requirements is matched by the challenge of successfully exploiting them to devise more intelligent and efficient--as well as more complete-dependability mechanisms. The evolution to a hierarchical, software-dominated approach would not have been possible without several enabling advances in hardware and software technology over the past decade: (1) Advances in VLSI technology and RISC architectures have produced components with more chip real estate available for incorporation of efficient concurrent error detection mechanisms and more on-chip resources permitting software management of f'me-grainredundancy; (2) The emergence of practical parallel and distributed computing platforms possessing inherent coarse-grain redundancy of processing and communications resources--also amenable to efficient software-based management by either the system or the application; (3) Advances in algorithms and languages for parallel and distributed computing leading to new insights in and paradigms for problem decomposition, module encapsulation, and module interaction, potentially exploitable in refining redundancy requirements and isolating faults; (4) Advances in distributed operating systems allowing more efficient interprocess communication and more intelligent resource management;
ix (5) Advances in compiler technology that permit efficient, automatic instrumentation or restructuring of application code, program decomposition, and coarse and fine-grain resource management; and (6) The emergence of fault-injection technology for conducting controlled experiments to determine the system and application-level manifestations of faults and evaluating the effectiveness or performance of fault-tolerance methods. In response to this challenging, new vision for dependable computing research, the advent of the technological opportunities for realizing it, and its potential for addressing critical dependability needs of Naval, Defense, and commercial systems, the Office of Naval Research launched a five-year basic research initiative in 1990 in Ultradependable Multicomputers and Electronic Systems to accelerate and integrate progress in this important discipline. The objective of the initiative is to establish the fundamental principles as well as practical approaches for efficiently incorporating dependability into critical applications running on modern platforms. More specifically, the initiative sought increased effectiveness and efficiency through (1) Intelligent exploitation of the inherent redundancy available in modern parallel and distributed computers and VLSI components; (2) More precise characterization of the sources and manifestations of errors; (3) Exploitation of application semantics at all levels---code, task, algorithm, and domain--to allow optimization of fault-tolerance mechanisms to both application requirements and resource limitations; (4) Hierarchical, integrated software/hardware approaches; and (5) Development of scientific methods for evaluating and comparing candidate approaches. Implementation of this broad mandate as a coherent research program necessitated focusing on a small cross-section of promising application-sensitive paradigms (including language, algorithm, and coordination-based approaches), their required hardware, compiler, and system support, and a few selected modeling and evaluation projects. In scope, the initiative emphasizes dependability primarily with respect to an expanded class of hardware and environment (both physical and operational) faults. Many of the efforts furthermore explicitly address issues of dependability unique to the domain of embedded real-time systems. The success of the initiative and the significance of the research is demonstrated by the ongoing associations that many of our principal investigators have forged with a variety of military, Government, and commercial projects whose critical needs are leading to the rapid assimilation of concepts, approaches, and expertise arising from this initiative. Activities influenced to date include the FAA's Advanced Automation System for air traffic control, the Navy's AX project and Next Generation Computing Resources standards program, the Air Force's Center for Dependable Systems, the OSF/1 project, the space station Freedom, the Strategic
Defense Initiative, and research projects at GE, DEC, Tandem, the Naval Surface Warfare Center, and MITRE Corporation. This book series is a compendium of papers summarizing the major results and accomplishments attained under the auspices of the ONR initiative in its first three years. Rather than providing a comprehensive text on dependable computing, the series is intended to capture the breadth, depth, and impact of recent advances in the field, as reflected through the specific research efforts represented, in the context of the vision articulated here. Each chapter does, however, incorporate appropriate background material and references. In view of the increasing importance and pervasiveness of real-time concerns in critical systems that impact our daily lives--ranging from multimedia communications to manufacturing to medical instrumentat i o n - t h e real-time material is woven throughout the series rather than isolated in a single section or volume. The series is partitioned into three volumes, corresponding to the three principal avenues of research identified at the beginning of this preface. While many of the chapters actually address issues at multiple levels, reflecting the comprehensive nature of the associated research project, they have been organized into these volumes on the basis of the primary conceptual contribution of the work. Agha and Sturman, for example, describe a framework (reflective architectures), a paradigm (replicated actors), and a prototype implementation (the Screed language and Broadway runtime system). But because the salient attribute of this work is the use of reflection to dynamically adapt an application to its environment, it is included in the Frameworks volume. Volume I, Models and Frameworksfor Dependable Systems, presents two comprehensive frameworks for reasoning about system dependability, thereby establishing a context for understanding the roles played by specific approaches presented throughout the series. This volume then explores the range of models and analysis methods necessary to design, validate, and analyze dependable systems. Volume II, Paradigmsfor Dependable Applications, presents a variety of specific approaches to achieving dependability at the application level. Driven by the higher level fault models of Volume I and buiilt on the lower level abstractions implemented in Volume III, these approaches demonstrate how dependability may be tuned to the requirements of an application, the fault environment, and the characteristics of the target platform. Three classes of paradigms are considered: protocolbased paradigms for distributed applications, algorithm-based paradigms for parallel applications, and approaches to exploiting application semantics in embedded realtime control systems. Volume III, System hnplementation, explores the system infrastructure needed to support the various paradigms of Volume II. Approaches to implementing
xi suppport mechanisms and to incorporating additional appropriate levels of fault detection and fault tolerance at the processor, network, and operating system level are pre~nted. A primary concern at these levels is balancing cost and performance against coverage and overall dependability. As these chapters demonstrate, low overhead, practical solutions are attainable and not necessarily incompatible with performance considerations. The section on innovative compiler support, in particular, demonstrates how the benefits of application specificity may be obtained while reducing hardware cost and run-time overhead. This first volume in the series covers system architectures or frameworks that serve as the foundation for dependable system design and the various models required at each layer of the system hierarchy and stage of its lifecycle to guide design decisions and evaluate their effectiveness. Section 1 presents two frameworks for the study and design of dependable systems. Maiek emphasizes the layered view of dependability advocated throughout this series and presents the concept of universal consensus for realizing dependability at each level in the context of distributed real-time systems. Agha and Sturman introduce the concepts of reflection and encapsulation as vehicles for tuning dependability to a dynamically changing fault environment and application requirements while maintaining the transparency of dependability mechanisms to the application itself. "l~e concepts are made concrete through the example of an actor-based language and run-time system, highlighting the importance of language hooks in granting users ent~lced control over detection and recovery mechanisms. Given these frameworks, Section 2 addresses the issue of mathematically characterizing dependability requirements in a manner exploitable by them. Monaghan introduces the section by outlining the real requirements demanded by typical military systems and the difficulty of precisely translating those requirements for systems designers and verifying the results. In real-time systems, dependability encompasses timeliness as well as correctness. Shin presents an approach to deriving precise deadline constraints from the application semantics to provide dependability mechanisms with maximum flexibility based on true requirements rather than specifications of undocumented origin. Shin also extends the layered view of dependability to the larger system level: an erroneous output from an embedded computer, while technically a failure of that computer, may be still recoverable at the system level if the controlled process is robust enough. Once a system is designed using appropriate high-level abstractions and fault models the problem remains to validate the design against the types of faults anticipated in actual operation. An emerging approach to this critical problem is fault injection, in which the response of the system to low-level injected errors is gauged. Efficiency of this process demands an intermediate model to guide injection that preserves coverage while simplifying and accelerating the testing. One such approach
xii and the issues involved in applying it are examined by Yount and Siewiorek in Section 3. Clark and Pradhan complete the picture by describing the REACT testbed for modeling and validating dependable system designs. Whereas the models presented thus far capture the behavior of the system in response to particular fault scenarios, a global, quanititative analysis of system dependability in terms of the probabilistic measures of reliability, availability, and performability is necessary in order to judge whether the overall requirements have met and to guide allocation of resources to the most critical system components. Iyer and Tang take an empirical approach using data from operational systems to drive their models, identify trends, and capture the shifting focus of dependability concerns as hardware and software technology evolve. Lin takes an analytical approach to developing a quantitative method for evaluating design alternatives in the context of optoelectronic interconnection networks. Gary M. Koob Mathematical, Computer and Information Sciences Division Office of Naval Research Clifford G. Lau Electronics Division Office of Naval Research
A CKNOWLEDGEMENTS
The editors regret that, due to circumstances beyond their control, two planned contributions to this series could not be included in the final publications: "Compiler Generated Self-Monitoring Programs for Concurrent Detection of Run-Time Errors," by J.P. Shen and "The Hybrid Fault Effects Model for Dependable Systems," by C.J. Walter, M.M. Hugue, and N. Suri. Both represent significant, innovative contributions to the theory and practice of dependable computing and their omission diminishes the overall quality and completeness of these volumes. The editors would also like to gratefully acknowledge the invaluable contributions of the following individuals to the success of the Office of Naval Research initiative in Ultradependable Multicomputers and Electronic Systems and this book series: Joe Chiara, George Gilley, Walt Heimerdinger, Robert Holland, Michelle Hugue, Miroslaw Malek, Tim Monaghan, Richard Scalzo, Jim Smith, Andr6 van Tilborg, and Chuck Weinstock.
ERRATA
1. Due to a late editing decision, Section 2.2 was removed from this volume. The somewhat anomalous numbering scheme employed in Section 2 reflects the original organization. 2. The following notes were inadvertently omitted from Section 3.2: •
This section is based on research sponsored, in part, by the Office of Naval Research under grants N00014-9 l-J- 1404 and N00014-92-J-1366 and conducted at the University of Massachusetts at Amherst and Texas A&M University.
•
Jeffrey A. Clark is with the MITRE Corporation, Bedford, Massachusetts.
•
Dhiraj K. Pradhan is with Texas A&M University, College Station, Texas.
SECTION 1
FRAMEWORKS FOR DEPENDABLE SYSTEMS
SECTION 1.1
A Consensus-Based Framework and Model for the Design of Responsive Computing Systems Miroslaw Malek ^
Abstract T h e emerging discipline of responsive systems d e m a n d s faulttolerant cind real-time performaince in uniprocessor, parallel, and distributed c o m p u t i n g environments. A new proposal for a measure of responsiveness is presented, followed by an introduction of a consensus-based framework and model for responsive comp u t i n g . T h e model, called C O N C O R D S ( C O N s e n s u s / C O m p u t a t i o n for Responsive Distributed Systems), is beised on t h e integration of various forms of consensus a n d c o m p u t a t i o n (management, t h e n progress or recovery). T h e consensus tasks include clock synchronization, phaise initialization a n d termination, diagnosis, checkpointing, resource eillocation, a n d scheduling. I n d e x T e r m s : consensus, distributed computing, fault tolerance, operating system, parallel computing, real time, responsive systems. ' T h i s research was supported in part by the Office of Naval Research Contract No. N00014-88-K-0543, Grant No. N00014-91-J-1858 and the Texas Advanced Technology Grant 386. 'Department of Electrical and Computer Engineering, The University of Texas at Austin, Austin, Texas 78712. Phone: 1-512-471-5704, Fax; 1-512-471-0954, Email: malekQece.utexas.edu
1.1.1
Introduction
The integration of three or four traditionally separate disciplines of fault-tolerant computing, real-time systems, and parallel/distributed computing, may provide solutions to one of the greatest challenges of this decade; namely, design and implementation of responsive computer systems. A responsive computer system [7] is a fault-tolerant, real-time system whose objective is timely and correct programs execution in uniprocessor, parallel, and distributed computing environments, even in the presence of faults. In a nutshell, responsive systems are expected to maximize the probability of correct programs execution on time despite faults. Society, business owners, and government officials worldwide have been sold on the promise that open, distributed computer systems would deliver improved reliability and timeliness. The reality seems to be to the contrary, and effective solutions for making these systems responsive are badly needed. Responsive computer systems are crucial in many applications. Communication, control, avionic systems, robotics, multimedia, decision support systems, banking, air traffic control, and even point-of-sale terminals are just a few examples of systems that require a high degree of responsiveness. The objective of this chapter is to introduce a concrete model for optimizing both timeliness and claissical measures of dependability, such as reliability and availability, in the multicomputing environment. Our philosophy is based on the observation that high responsiveness can only be achieved by developing an ultrareliable kernel, whose correctness is provable, and combining it with application-specific responsive system design techniques. This is due to the diversity of applications and environments that make highly responsive systems extremely difficult to design. Although there were several fault-tolerant systems designed in the past, only two of them focused directly on both fault tolerance and real time: the Multicomputer Architecture for Fault Tolerance (MAFT) [5] and the Maintainable Real-Time System (MARS) [6]. While MAFT manages redundancy via Byzantine agreement protocols, MARS uses duplicated processors and relies on a fail-silent fault model. MAFT
separates scheduling, synchronization, voting (SSV), and computing in space by using separate processors for SSV and computing. Both systems heavily rely on static scheduling which limits their applications domain. Our approach, called CORE (Consensus for REsponsiveness), intends to support fault tolerance and real time in dynamic environments where fault and task arrival times, durations and deadlines can be modeled by specific distributions. CORE is a software layer, ideally a kernel. Our approach focuses on the identification of the fundamental functions that must be incorporated in a responsive system kernel and the development of application specific techniques for maximizing responsiveness. This approach is based on two further observations; the omnipresence of consensus ^ and the inevitability of probability. We view synchronization, initialization, termination, communication, data access and storage protocols, diagnosis, selected error detection and error correction codes, scheduling, replicated data management, checkpointing, and reconfiguration as consensus problems. Therefore, the consensus concept is fundamental in any multicomputer environment. The inherent complexity of the system, combined with the random occurrence of faults, implies that probability cannot be avoided, and, even in so-called "hard" real-time systems, only probabilistic guarantees can be given. The chapter is divided into the following parts. In Section 2, we discuss a consensus-based framework for responsive computer systems. Next, a new proposal for a measure of responsiveness is presented. Section 4 introduces the model and some alternatives for responsive computing and outlines design options for implementing them in a distributed computing environment. Section 5 gives the specification of the universal consensus while Section 6 presents the conclusions and summarizes our model. ' In this case, consensus is defined as an agreement among computers.
1.1.2
The Consensus-Based Framework
The consensus-baised framework [10] for responsive systems aimed to define the core concepts and the functions that lead to comprehensive design methods for responsive computer systems. We propose a consensus for REsponsiveness (CORE) software layer that may be incorporated on top or as a kernel (or microkernel) in systems such as Unix, Mach, and Chorus in order to improve their responsiveness. Any successful design requires quantitative and/or qualitative goals that can be verified through measurement. The most successful designs are bcised on particular models that are accurate abstractions of reality. Of course, the ultimate model is a copy of the given system itself; however, with the high complexity of today's systems, such a model is frequently unattainable. Therefore, models for these systems tend to focus on a specific aspect of system behavior or a specific layer of system design. We propose a definition of responsiveness which combines availability and timeliness (a specific proposal for this measure is outlined in the next section) and introduce a design framework in which characteristics such as synchronicity, message order or lack of it, and bounded or unbounded communication delay are well defined for a specific environment. This framework [10] is based on the consensus problem [2] and is, in our opinion, fundamental in the design of responsive multicomputer systems. In multicomputer systems, the consensus problem is omnipresent. It is necessary for handling synchronization and reliable communication, and it appears in resource allocation, task scheduling, fault diagnosis, checkpointing, and reconfiguration. Consensus tasks take many forms in multicomputer systems. Figure 1.1.1 shows the necessary functions for fault and time management in a multicomputer environment in which each function represents a separate consensus problem, with exception of applicationspecific methods at the top layer, although some of these techniques, such as coding, might also be viewed as a consensus problem. At the base of the model is the synchronization level. For a system to be responsive (i.e., fault tolerant and real time), there must be an agreement about time for task execution, fault detection and checkpointing. The next layer represents the requirement for reliable communication. Responsive computers must agree on how and when information is ex-
Application-Specific Responsiveness
Responsiveness Coordinator (Scheduler and Resource Allocator) Fault Recovery (Reconfiguration/Repair)
Checkpointing
Fault Diagnosis or Fault Masking Reliable Broadcast
—
Kernel or Responsiveness Layer
Synchronization
Figure 1.1.1; A framework for responsive systems design.
8 changed and how many messages can be considered delivered or lost. Also, timeliness of communication is crucial in achieving a high level of responsiveness. The third layer, checkpointing, ensures that a consistent system state can be saved after a set of operations. The fourth layer, diagnosis, is fundamental to both real time and fault tolerance, for agreements must be reached on task scheduling and on who is faulty and who is not. Finally, the fifth layer illustrates the need for agreement on resource allocation and reconfiguration for efficient task execution and recovery from potential faults. In addition, we add a scheduler that manages space (resources) and time, and an availability manager. The scheduler and availability manager form the responsiveness coordinator which is responsible for maximizing responsiveness by ensuring that tasks are executed on time (scheduling and coordinating reconfiguration), even in the presence of faults. Application-specific responsive design methods are placed on top, as shown in Figure 1.1.1. An example of implementation of this framework is illustrated in Figure 1.1.2, in which functions in the CORE are incorporated in the kernel and support applications, armed with application-specific techniques for responsive system design. With the variety and complexity of the numerous applications in multicomputer systems today, we insist on this approach, as it is our belief that general techniques have some limitations and, when used alone, cannot assure a high level of responsiveness. We believe that, although the small generic kernel may be proved to be correct, the correctness of real-world applications, in most cases, cannot be proven. Hence, application-specific techniques are necessary. In responsive system design, all of the consensus problems, as well as computations, should be accomplished in a timely and reliable manner. To design a responsive system, we need responsive synchronization, responsive communication, responsive task scheduling, responsive fault diagnosis, responsive resource allocation, and responsive reconfiguration. This means each layer should incorporate timeliness and dependability measures, as well as the relevant fault model. This requirement opens a number of research issues, and only a few of them have been studied from the perspective of responsiveness which is discussed in the next section.
Figure 1.1.2: An example implementation of a responsive systems design framework.
10
1.1.3
The Responsiveness Measure
In principle, there are two approaches in dealing with the responsiveness: one, in which the optimization process takes a pair of parameters such as timeliness and availability and attempts to optimize them as a pair using, for example, an integer programming formulation of this problem, and another one, in which timeliness and availability are integrated in a form of responsiveness measure. An optimization of a tirrieliness-availability (reliability) pair hypothetically gives a designer an opportunity for a separation of concerns but, in reality, there is usually a relation, often complicated, between timeliness and availability, and the designer might want to resort to an integrated form of responsiveness measure. The responsiveness measure combines the traditional meeisures of fault tolerance and real time; namely, availability and the ability to complete a task set by some deadline, with the objective of co-optimizing these dependent system qualities. This is done by examining the system from a task and service perspective. Consider that a system consists of a set of tasks, each having a particular execution time and deadline that may or may not be met. Whether or not a task's deadline is met depends on the ability of the system to perform the task set, given that the system hardware and software have some probability of failing. By taking this approach, we acknowledge that responsiveness is the ability to correctly execute the system's required functions (i.e., its task set or a service) in a timely manner, even in the presence of faults. Responsiveness, then, is dependent on the user's needs with respect to timeliness, a realistic fault model, and his perception of correctness which should ultimately be the system's goals and is the probability that these needs will be fulfilled. Unless these needs are very simple, responsiveness will be less than one. The user must balance how often the system can fail with how much the system will cost. This makes building a responsive system an iterative process of the user specifying what task set must be performed with a particular responsiveness, and the system designer creating a system at a certain cost that will meet the user's requirements. Thus, the goal for the responsiveness measure is that it be sufficiently intuitive so the user may easily use it to specify his requirements, and sufficiently rigorous so the designer may
11
Figure 1,1.3: Combining systems attributes with service attributes. analytically evaluate the system's responsiveness. Notice that reliability and other specific measures of dependability have been traditionally perceived as system attributes, while timeliness hais been viewed as a service attribute (see Figure 1.1.3). The challenge is to optimize the two measures in tandem or to combine them to maximize the probability of correct execution on time. In general, one may assume that the system responsiveness is a function of timeliness and some dependability measure such as availability which are time dependent and can be defined as r{t) — f[a(t),p(t)]. Deriving a precise responsiveness criterion for a multicomputer environment, as well as methods for evaluating and measuring responsiveness, is still an open issue. If a single-processor system is required to run a single task, its responsiveness r{t) is the probability a{t) that the processor performs the task without error (i.e., the instantaneous availability, or reliability, of the processor) multiplied by the probability p{t) that the processor can complete the task before its deadline (i.e., the task is schedulable at a given time t). In short, r{t) — a(t)p(t). If this system is required to perform n tasks, the probability that a particular task is completed is related to its governing scheduling policy and the other tcisks in the system. In addition, if multiple tasks are executed on a single or multiple processor system, the responsiveness measure could be defined as
r{t) =
n
^f^aiit)p,{t), !=1
12 where a; is the lifetime availability of task i and pi is the probability that task i will be completed before its deadline [9]. Availability, a, is calculated using the standard evaluation techniques for the configuration of equipment performing task i and the duration of task i. Timeliness, p,- is the probability that task i will complete before its deadline, given that it is part of some task set scheduled on the system. Note that responsiveness r{t) varies between zero and one with higher values of r(t) indicating a greater likelihood that the task set will be completed within its deadlines. For a gracefully degradable system, such as a consensus-based system, a, and p, may be calculated to any desired precision by computing a weighted average of these probabilities over the various configurations. For example, a four-processor, single-fault-tolerant, consensusbased system would have a^ and pi calculated for the two cases when there are four and three fault-free processors (two, one or zero faultfree processors amounts to a system failure and Oj = 0), then weighted and averaged depending on the expected time the system will spend in these various configurations. We view the proper definition of responsiveness a.s the key to any optimization that might be done in responsive systems design. Only experience will show whether r{i) = f[a{t),p{t)] is a measure that can accurately portray a system's ability to satisfy the user's needs, yet can still be useful to the system designer. It should be noted that responsiveness is a function of the current time t, a task set T (with its timing characteristics), the available time to deadline (d = td — t), and an execution strategy (replication in time or replication in space). One of the principal goals in responsive computing is to accurately define responsiveness and make it useful, easily modeled, and easily measurable. Complete and accurate evaluation of responsiveness may require sophisticated models that can cross the bounds of the design hierarchy without the usual state space explosion. We are currently studying system and service models that focus on responsiveness. Since it seems that no existing modeling paradigms are suitable, we extend Petri Nets [3], limit the state explosion by using partitioning and hierarchy, and incorporate both exponential and delay functions.
13
1.1.4
The Model
The model for responsive computing should facilitate the goals of timely and correct execution of programs in parallel and distributed environments. The proposed model, called CONCORDS (CONsensus Computation model for Responsive Distributed Systems) can be described as follows. In CONCORDS, computing is viewed as a collection of tasks (units of computation) whose execution time and deadline are specified. Taisks may arrive periodically or sporadically at random at one or a number of processing elements (PEs) with an ability to send messages among themselves. In order to perform responsive computing, both logical and physical synchronization are necessary. Reliable communication, fault diagnosis and checkpointing are a must. All of these processes can be executed as consensus. Therefore, the consensus plus, of course, computation and recovery, are fundamental in responsive systems design. A parallel or distributed computation is simply a group of these PEs working towards a common goal. We extend Valiant's BulkSynchronous Parallel model (BSP) [11] and model such a computation as a sequence of two alternating phases; namely, local execution and scheduling/consensus. The local execution might represent progress (a computation towards achieving expected results) or a recovery from a fault. A computation involves each PE performing its portion of a number of tasks (including recovery procedures) until it must send or receive information from the other PEs involved. Regardless of the nature of this information, an agreement is needed, i.e., consensus is executed. The UNIversal CONsensus protocol, UNICON, [1] is a recognition of this, as any nonlocal requirements of a PE are met by a particular instance of it. Therefore, we may model a distributed computation as an alternating sequence of local executions and calls to the UNICON protocol. These phases (consensus/computation) alternate indefinitely or until the application is completed. As an illustration, consider Figure 1.1.4 which shows the execution profiles of four processors A, B, C and D. The column below a processor label represents when consensus is being performed (marked in white) and when a local computation is being performed (marked in
14
time
B
D
consensus type clock synchronization and diagnosis (computation) checkpointing
(computation) computation termination
Figure 1.1.4: A computation/consensus view of a distributed program.
15 gray). A horizontal bar is used to collect related executions of consensus. For example, the first consensus resulting in clock synchronization and diagnosis involves all four processors as marked by the horizontal bar connecting all four columns. As Valiant argues t h a t the BSP model eases the programmer's task in writing parallel computations, we feel this computation/consensus paradigm will ease the development of responsive operating systems and applications. A consensus-based approach towards the development of responsive operating systems was presented in [10]. The primary rationale is that we can never rely on a single resource; thus, it is necessary to progress via consensus. Therefore, the kernel of a responsive operating system consists of a number of consensus tasks as illustrated in Figure 1.1.1. We are currently working on implementation of universal consensus [1] within a C O R E framework, expecting to obtain some definite advantages. First, we can encompass many differing fault models in the system model, and second, UNICON, once a fault model is defined, can be based on a single algorithm. T h e result is a simplicity and functionality that allows for more reliable coding, low installation overhead, and an ease of software maintenance. Moreover, we anticipate t h a t our model will ease the estimation of timeliness of the operating system tasks and applications. An open question is t h a t of efficiency. While the UNICON protocols, by their very nature of combining numerous systems, functions in a single message or pass throughout the network and are bound to be efficient, the restrictive model of consensus and computation phases might not be, especially in a large system. To improve system efficiency, the following variations of a C O N C O R D S model may alleviate the problem: 1. Application-specific C O N C O R D S . In this model, processors involved in a specific application (computation) perform consensus/computation phases alternatively, regardless of other processors in the system. The consensus, scheduling, and reconfiguration tasks might be more efficient, as they involve only part of a system. 2. Exception-driven C O N C O R D S . In this variation, consensus occurs only on demand, i.e., if one of the processes/processors raises an exception. Note that this model is mainly applicable to a fail-stop fault model where a faulty processor stops and notifies others about its
16 failure. While the suggested alternative models may improve efficiency, the potential gains may disappear due to potentially high overhead of concurrency control for some applications. Another interesting problem, from a scheduling perspective, is to analyze and carry out comparative analysis of dynamic versus static scheduling and then compare them with respect to a single global versus multiple task queue(s). Also, placement of software for support of responsive computing remains an open problem. Three possibilities exist; a kernel with an operating system on top of it, a microkernel cooperating with an operating system, and a responsiveness layer on top of an operating system. It remains an open question as to which approach is most cost effective but, obviously, all three of them trade off cost and effectivenesss of maximizing responsiveness.
1.1.5
Specification of Universal Consensus
We assume that all of the processors in the system are (logically) completely connected by an underlying communication medium. Moreover, there exists a distribution of message delays such that a processor may estimate the time that a message was sent from the time it was received and from which processor it was sent [4]. We assume that there exist mechanisms for scheduling and preempting tasks on a processor. We also assume that the number of faults tolerated is bounded according to the fault model employed. The processor clocks are assumed to be synchronized by the UNICON's synchronization task. For consensus to take place, at lesist five questions must be answered. First, who is to be involved in the consensus? The answer to this would be a list of processors. Second, what are the characteristics' of those involved? That is, the fault models of the processors are needed. Third, what is to be agreed upon? The consensus task must know what information is to be collected. Fourth, what is the desired timeliness of the task? Fifth, what structure should the consensus take? In other words, do all the processors involved need all of the consensus
17 results, or will a subset of them be sufficient? From the defining requirements for consensus, we envision the declaration of UNICON as unicon(con_id *id,PE_set members,con.top topology,con.type type,con_time p r i o r i t y ) where id is set to the identification number of this instance of consensus and members is the set of processors involved in the consensus. We assume that the set members refers to a specific set of physical processors, but it could refer to a logical set maintained by the operating system. The structure, topology, possesses two fields; namely, structure which takes values from the set { s i n g l e - l e v e l , partitioned, unspecified}, and r e s t r i c t i o n which takes values from the set {global, a p p l i c a t i o n - s p e c i f i c } . The variable, type, is enumerated from a set including { synchronization, configuration, diagnosis, communication scheduling, checkpointing, termination}. This set can obviously be extended. We may also consider the caise in which type is a union of the various members of this set such as when multiple purposes may be served by a single instance of UNICON. For this chapter we assume that each consensus has a distinct goal but, in practice, various information items could be combined in the consensus messages in order to perform multiple functions at once. The parameter, priority, determines the timeliness requirements of the responsive consensus. From this simple system call, it is possible to invoke any number of the consensus protocols. By examining the fault models of the processors in members, the system can choose a suitable consensus algorithm; by examining the number of processors in members, the system can decide whether a partitioning method is useful or, alternatively, the structure may be specified with topology. The consensus algorithm is abstracted across synchronization, configuration, diagnosis, communication scheduling, checkpointing, termination and consensus information requirements with the type variable. The priority parameter defines the timeliness requirements of a particular instance of UNICON. It is a structure with members time, which is an absolute, real-time value as set by the synchronization pro-
18 cess; periodic, which is a boolean stating whether or not the consensus should be scheduled periodically; duration, which in the case that the consensus should be scheduled periodically is the length of that period; and sched, which is of the set {urgent, deadline, asap, Icizy}. Note that specifying that the consensus should be completed in 10 seconds implies that time should be set to the current time plus 10 seconds. Also, if a consensus is periodic, then after each deadline the time variable should be increased by duration and the task rescheduled. The meanings of these elements are as follows: urgent: The consensus is to be completed by time even if this implies that all other scheduled tasks must be preempted. If time has passed, then the consensus task should be the sole purpose of the processor until its completion. deadline: The consensus should be scheduled during the spare capacity of the system to complete by time. If time passes before completion, then the incomplete results should be provided. asap: The consensus should be begun and completed as soon after time as the processor's schedule allows. By analogy to an interrupt hierarchy, it could be called a polite consensus. lazy: The consensus should be scheduled during the processor's spare capacity, but if it is not completed by time then it should be abandoned with the incomplete results reported. As we assume that all processor clocks are synchronized to within some bound called synch_error, time has meaning, although somewhat inconsistent, throughout the consensus. When we initiate UNICON, therefore, it is necessary to let other processes know that a result is needed by time - synch_error in order that the result will be available at the initiating process by time. A prototype for CORE runs as a set of library functions on top of AIX (IBM version of UNIX operating system) on a network of RS/6000 workstations. It provides the application developer with support for
19 incorporating fault tolerance and real-time requirements. The earliestdeadline-first scheduler had to be adapted to coexist with the AIX system. UNICON's protocol is implemented as a set of functions that are called by the system and the application to perform communication and consensus. We anticipate that the prototype will enable us to perform experiments that will give us insights into an ultimate implementation of CORE as a UNIX microkernel.
1.1.6
Conclusion
There is an urgent need to provide the users community with responsive processing in parallel and distributed computing environments. The thesis of this chapter is that the proposed framework and model will facilitate the development of systems which would improve the feasibility of achieving two crucial qualities, reliability and timeliness. The concepts of consensus and computation progress are fundamental. The universal consensus algorithm for synchronization, reliable communication, diagnosis, scheduling, checkpointing, and reconfiguration, combined with a consensus/computation model, illustrate a promising approach to responsive computing. In addition, a responsive systems layer such as CORE should be integrated with application-specific methods such as those presented in [7, 8] because high responsiveness can only be achieved by combining an ultrareliable layer (ideally a kernel) with application-specific techniques. As computer systems proliferate and our dependence on them increases, responsiveness will become the most sought-after quality in computer and communication systems.
Acknowledgment I would like to acknowledge my student Mike Barborak for contributing to the specification of universal consensus.
20
References [1] M. Barborak and M. Malek. UNICON —^ a UNIversal CONsensus for responsive computer systems. Technical Report TR-92-36, Department of Computer Sciences, The University of Texas at Austin, October 1992. [2] M. Barborak, M. Malek, and A.T. Dahbura. Consensus problem in fault-tolerant computing. ACM Computing Surveys, 25(2):171220, June 1993. [3] G. Brat and M. Malek. Incorporating delays and exponential distributions in Petri Nets for responsive systems. Technical report, The University of Texas at Austin, March 1993. [4] F. Cristian. Probabilistic clock synchronization. Distributed Computing, 3:146-158, 1989. [5] R. Kieckhafer et al. The MAFT architecture for distributed fault tolerance. IEEE Transactions on Computers, 37(4);398-405, April 1988. [6] H. Kopetz et al. Distributed fault-tolerant real-time systems: The MARS approach. IEEE Micro, 9(l):25-40, February 1989. [7] L. Laranjeira, M. Malek, and R. Jenevein. On tolerating faults in naturally redundant algorithms. In the 10th Symposium on Reliable Distributed Systems, pages 118-127, September 1991. Pisa, Italy. [8] L. Laranjeira, M. Malek., and R. Jenevein. Nest: A nestedpredicate scheme for fault tolerance. IEEE Transactions on Computers, 42(11):1303-1324, November 1993. [9] M. Malek. Responsive systems: A challenge for the nineties. In Euromicro 90, Sixteenth Symposium on Microprocessing and Microprogramming, pages 9-16, August 1990. Amsterdam, The Netherlands, Keynote Address.
21 [10] M. Malek. A consensus-based framework for responsive computer system design. In The NATO Advanced Study Institute on RealTime Systems, October 5-18 1992. Springer-Verlag, St. Martin, West Indies. [11] L. Valiant. A bridging model for parallel computation. Communications of the ACM, 33(8):103-111, August 1990.
SECTION 1.2
A Methodology for Adapting to Patterns of Faults* Gul Agha and Daniel C. Sturman ^
Abstract In this paper, we present a language framework for describing dependable systems. Our framework emphaisizes modularity and compoaition. The dependability and functionality aspects of an application may be described independently, providing separation of design concerns. Furthermore, the dependabUity protocols of an application may be constructed bottom-up cLS simple protocols that are composed into more complex protocols. Composition makes it easier to reason about the behavior of complex protocols and supports the construction of generic reusable dependability schemes. A significant aspect of our language framework is that dependability protocols may be loaded into a running application and installed dynamiccdly. Dynamic installation makes it possible to impose additional dependability protocols on a server as clients with new dependability demands are integrated into a system. Similarly, if a given dependability protocol is only necessary during some particular phase of execution it may be installed during that period only.
1.2.1
Introduction
A number of systems have been developed to support the development of dependable computing applications. Such support is given in terms of failure *The research described has been made possible by support from the Office of Naval Research (ONR contract numbers N00014-90-J-1899and N00014-93-1-0273), by an Incentives for Excellence Award from the Digital Equipment Corporation Faculty Program, and by joint support from the Department of Defense Advanced Research Projects Agency and the National Science Foundation (NSF CCR 90-07195). ' A u t h o r s address: Dep£irtment of Computer Science, University of Illinois at UrbanaChampaign, 1304 W. Springfield Avenue, Urbana, Illinois 61801, USA. Email: { agha j sturman } S c s . u i u c . e d u
24 semantics which specify legal ways in which a component can fail [11]. Failure semantics are enforced through the use of dependability protocols which guarantee that the probability of a failure of a type not specified in the semantics is acceptably small. However, existing systems assume that the failure semantics of a service are static and, therefore, the dependability protocols used may be fixed. In many computer systems, it is either unsatisfactory to adhere to a static failure semantics or impossible to adequately enforce the semantics with a fixed group of dependability protocols. We illustrate this concept with two example systems: • Consider an embedded system which is required to function over a long duration, yet is fault-prone due to the uncertain environment in which it operates. If this system is physically isolated, such as in the control system of a satellite, physical modification of system components is often infeasible. In such a system, a change in the physical environment may result in protocols designed for the old environment failing to uphold the failure semantics in the new environment. A different group of dependability protocols may then be required to enforce the desired failure semantics of the system. • Consider an open system. Open systems allow interactions with the external environment; in particular, new services may be added or deleted dynamically from the system in response to external events. Consequently, it may not be possible to statically determine the system configuration. Without knowing the system configuration, it may not be possible to determine what failure semantics a process must have, or what protocols are necessary to enforce these semantics, until after the process actually joins the system. Furthermore, the addition of new services may require a change in the failure semantics of existing components. For example, a file server may initially address safety only by check-pointing the files to stable storage. New clients that are added to the system, however, may require the server to also provide persistence and a protocol to support replication may need to be added. In this paper, we describe a methodology for the modular specification of systems that adapt to patterns of faults. We call the resulting systems adaptively dependable. We present a methodology which allows the transparent installation and reuse of dependability protocols as well as their dynamic installation in a system. Our methodology, when combined with a suitably structured exception handling mechanism and fault detection, allows for the development of
25 fault handlers which can maintain consistent failure semantics within a changing environment and can alter failure semantics as system needs change. We have provided programmer support for our methodology in the language Screed which is implemented on our run-time system Broadway. We employ reflection as the enabling technology for dynamic installation of dependability protocols. Reflection means that an application can reason about and manipulate a representation of its own behavior. This representation is called the application's meia-level. The components of an object that may be customized at the meta-level are referred to as the meia-architecture. In our case, the meta-level contains a description which implements the failure semantics of an executing application; reflection thus allows dynamic changes in the execution of an application with respect to dependability. Besides supporting dynamic installation, our meta-architecture supports transparency and reuse of dependability protocols. For example, the metaarchitecture allows protocols to be expressed as abstract operations on messages. Since the individual fields of a particular message are never examined, the same protocol may be used with different applications. Given this technique for dynamic modification of the dependability protocols used in a system, we describe how fault-detection and exception handling may be used in conjunction with our meta-architecture to support adaptive dependability. We model both failures and exceptions as objects. Each type of fault which may be detected is described as a specific system exception. We construct managers — exception handlers with meta-level capabilities — to address system exceptions. Managers serve three purposes: • A manager may correct for recoverable faults. The corrections allow the system to continue to function despite a fault. This role is generally referred to as performing forward error recovery. • Managers provide failure prevention. When a manager discovers a pattern of component failures, it dynamically installs protocols which mask future failures or facilitate future fault-correction by expanding the set of recoverable faults. In this manner, we have taken forward error recovery one step further; rather than simply adjusting the system state, the actual dependability characteristics of the system may be modified. • Managers support reconfiguration of the dependability protocols in a system. This may be done either to alter the system's failure semantics or to correctly enforce these semantics once the environment changes. Thus,
26 we can develop dependable long duration systems whose fault patterns are not known at start-up time. A prototype implementation of a run-time system which tests these idesLS is described: the system Broadway supports our meta-architecture as well as failure detection and a set of system exceptions. On top of Broadway, we have implemented the language Screed. Screed is a prototype concurrent actor language which provides complementary constructs for both fault-detection through exception handling and dynamic installation of protocols through a meta-architecture. Screed is presented as a demonstration of how such constructs may be added to existing languages. This paper is organized as follows. Section 1.2.2, discusses related research in the areas of reflection, exception handling, and languages for fault-tolerance. Section 1.2.3 provides a brief description of the concepts of reflection, the Actor model, and object-orientation. Section 1.2.4 provides a guide to the syntax of Screed to assist in understanding our examples. Section 1.2.5 discusses our meta-level architecture and how it may be used to construct dependability protocols. We also discuss the effect of our meta-level architecture on protocol performance. Section 1.2.6 describes exception handling in Screed and how exception handling may be used in conjunction with our meta-level architecture to implement adaptively dependable systems. We then illustrate this technique with an example of a system adapting to a change in environment. In
1.2.2
Related Work
A number of languages and systems offer support for constructing fault tolerant systems. In Argus [23], Avalon [15] and Arjuna [31], the concept of nested transactions is used to structure distributed systems. Consistency and resilience is ensured by atomic actions whose effects are check-pointed at commit time. The focus in [27], [9] and [7] is to provide a set of protocols that represent common communication patterns found in fault tolerant systems. None of the above systems support the factorization of fault tolerance characteristics from the application specific code. In [38] and [28], replication can be described separate from the service being replicated. Our approach is more flexible: fault tolerance schemes may not only be described separately, they may be attached and detached dynamically. Another unique aspect of our approach is that different fault tolerance schemes may be composed in a modular fashion. For example, check-pointing may be composed with replication without requiring that the representation of either protocol know about the other.
27 Non-reflective systems which support customization do so only in a systemwide basis. For example, customization in a micro-kernel based system [1] affects all the objects collectively. In an object-oriented system such as Choices [8], frameworks may be customized for a particular application. However, once customized, the characteristics may not change dynamically. Reflection in an object based system allows customization of the underlying system independently for each object. Because different protocols are generally required for very specific subsets of the objects in a system, this flexibility is required for implementing dependability protocols. Reflection has been used to address a number of issues in concurrent systems. For example, the scheduling problem of the Time Warp algorithm for parallel discrete event simulation is modeled by means of reflection in [40]. A reflective implementation of object migration is reported in [37]. Reflection has been used in the Muse Operating System [39] for dynamically modifying the system behavior. Reflective frameworks for the Actor languages MERING IV^ and Rosette have been proposed in [16] and [35], respectively. In MERING IV, programs may access mtta-instanccs to modify an object or meta-ciasses to change a class definition. In Rosette, the meta-level is described in terms of three components: a container, which represents the acquaintances and script; a processor, which acts as the scheduler for the actor; and a mailbox, which handles message reception The concept of unifying exception handling and fault detection was originally proposed in [30] and then refined in [29]. In these papers, detected failures are considered asynchronous events much as exceptional conditions are treated in distributed programming languages. Therefore, exception handling construct provide a natural way to incorporate failure-response code into an application. Goodenough introduced the idea of exceptions and exception handling in [19]. Since then, many different exception handling mechanisms have been proposed. Exception handling constructs have been developed for object-based languages such as Clu [24] and Ada [12]. Dony [13] describes an approach for object-oriented languages and its implementation in Smalltalk. In this approach, exceptions are implemented as objects much as we do. Exception handling for C-I--I- is discussed in [33]. A good overview of techniques proposed for other object-oriented languages can be found in [14]. A critical difference between object-oriented approaches to exception handling and non-object-oriented approaches such as CLU [24] or Ada [12] is that, in the latter, the exception object is represented by a set of parameters to a function. Therefore, on generating the signal, a parameter list must provide all possible information used by the handler.
28 For concurrent systems, another technique has been proposed for languages which use RPC communication [10]; the technique is based on synchronized components which allows the exception handling constructions to be closer to that of a sequential system than an asynchronous system. Exception handling mechanisms have been proposed for other Actor languages. An exception handling mechanism was proposed for ABCL/1 and for Acore [21, 26]; the mechanism uses complaint addresses to support exception handling. A complaint address is a specific location, specified with each message, to which all signals are dispatched.
1.2.3
Background
Before discussing our meta-architecture and how we use it to support adaptive dependability, we first discuss in greater detail some concepts that are important in our framework. The organization of this section is as follows. First, we briefly discuss some of the advantages of object-oriented programming and how they are useful with our methodology. Secondly, we describe the Actor model of concurrent computation. We chose the Actor model as the basis of our work due to the ease with which it may be extended, Finally, we give a more in-depth discussion of reflection and how it relates to a programming language.
Object Orientation In an object-oriented language, a program is organized as a collection of objects. Each object is an encapsulated entity, representing an instance of an abstract data type. The local data comprising each object may only be accessed through an interface specified as a set of methods. The operations carried out by a method are not visible outside the object. Objects communicate with messages which invoke a method in the receiving object. The local data of another object cannot otherwise be accessed or modified. Objects are mstanttated from classes. A class is a user-defined abstraction. Classes may be thought of as types and objects as elements of that type. Instantiation is the creation of an object of a particular class. Classes contain the description (code) of the methods and of the instance variables for objects instantiated from that class. Classes may inherit from other classes. Inheritance provides the inheriting class with the properties - the methods and instances of the ancestor class. The inheriting class can then utilize these properties as well as augment them with new instances variables or methods. Methods may be inherited directly or redefined, facilitating code reuse.
29 Object-oriented languages allow for modular development of systems. The implementation of each component is hidden from other components: only the interface is known. In this way, a component's implementation may change without affecting other components. Code may also be reused efficiently since components may share code by inheriting from a common ancestor class. Note that our use of classes and inheritance differs from that in sequential objectoriented languages in that we do not support class variables.
The Actor Model We illustrate our approach using the Actor model [2, 3]. Actors can be thought of as an abstract representation for multicomputer architectures. An actor is an encapsulated object that communicates with other actors through asynchronous point-to-point message passing. Specifically, an actor language supports three primitive operators; send Actors communicate through asynchronous, point-to-point message passing. The send operator is used to communicate a message asynchronously to another actor. Each message invokes a method (or procedure) at the destination. Upon reception of the message at the destination, the message will be buffered in a mail queue. Each actor has a unique mail address which is used to specify a target for communication. Mail addresses may also be communicated in a message, allowing for a dynamic communication topology. new Actors may dynamically create other actors. The new operator takes an actor behavior (class name) as a parameter, creates a new actor with the correct behavior and returns its mail address. The mail address is initially known only by the creating actor. However, the creator subsequently include this new mail address in future messages. become The become operator marks the end of state modifications in the execution of a method. Once a become has executed in a method, the actor may continue to modify state local to the method. However, such state changes do not effect the way in which the actor may process the next message. Therefore, once this operator is executed, the actor may begin processing its next pending message. Judicious use of the b e c o m e operator may improve performance by allowing internal concurrency; i.e., multiple threads of execution within a single actor. It is important to note that the idea of using reflection to describe dependability is not tied to any specific programming language. Our methodology
30 assumes only that these three operators are in some way incorporated into the language; we require that new actors may be created dynamically and that the communication topology of a system is reconfigurable. In fact, the actor operators may be used to extend almost any standard sequential language to provide coordination and communication in a distributed environment; local computations may still be expressed in terms of the sequential language. The level at which the sequential and actor constructs are integrated determines the amount of concurrency available in the system. An actor language may be used to "wrap" existing sequential programs, serving as an interconnection language. With this approach, each method in an actor class invokes a subroutine, or set of routines, written in a sequential language and dispatches messages based on the values returned. Such an approach was taken by the Carnot project at MCC [34]. In Carnot, the actor language Rosette "glues" sequential components together to facilitate heterogeneous distributed computing. A complementary approach is to actually integrate the actor operators into an existing language. Broadway, the run-time platform we use to implement the ideas in this paper, supports C++ calls for both send and new; the b e c o m e operator is implicit at the end of each method. Using Broadway, developers of distributed programs may use a well known language — C+i to develop distributed programs. Actor operators have also been combined with functional languages. Specifically, actor operators have been added to the call-by-value A-calculus [5]. In this case, the local computation is modeled as a sequential functional computation. An operation semantics is developed for the resulting language. The semantics supports operational reasoning. In [36], the semantics is extended to support formal reasoning about meta-architectures such as the one we describe here. If necessary, the actor operators may also be extended to support more complex functionality. In particular, communication model may be modified to support more complex message passing constructs. The asynchronous point-topoint communication model for actors has been extended to include patternbased multicasts using AciorSpaces [6]. Furthermore, remote procedure calls may be transformed into a set of asynchronous messages using a concurrent analog of the continuation passing style [22]. Synchronization constraints [17] and multi-object constraints [18] are two other extensions of the actor operators which greatly simplify distributed programming. Constraints allow the programmer to specify "when" asynchronous
31 events may occur based on the state of a single object or the occurrence of other events in the system. Using these techniques, the non-determioism of asynchronous communication may be constrained to maintain a consistent system state without requiring an overly restrictive comrrmnication model.
Reflection
Application
System
Figure 1.2,1: Through reflection, an application may modify the system by modifying its meta-objects. Meta-objects are a system level description of the base-level application objects.
Reflection means that a system can manipulate a causally connected description of itself [32, 25]. Causal connection implies that changes to the description have an immediate effect on the described object. In a reflective system, a change in these descriptions or meta-objecis results in a change in how objects are implemented. The object for which a meta-object represents certain aspects of the implementation is called the base object. This relationship is shown in Figure 1.2.1. Meta-objects may be thought of as objects which logically belong in the underlying run-time system. For examples, a meta-object might control the message lookup scheme that maps incoming messages to operations in the base object. Another meta-object may modify how values are read from memory. Using reflection, such implementation level objects can be accessed and exam-
32 ined, and user defined meta-objects may be installed, yielding a potentially customizable run-time system within a single language framework. The reflective capabilities which are provided by a language are referred to as the meta-level architecture of the language. The meta-level architecture may provide variable levels of sophistication, depending on the desirable level of customization. The most general meta-level architecture is comprised of complete interpreters, thus allowing customization of all aspects of the implementation of objects. In practice, this generality is not always needed and, furthermore, defining a more restrictive meta-level architecture may allow reflection to be realized in a compiled language. The choice of a meta-level architecture is part of the language design. Customizability of a language implementation must be anticipated when designing the run-time structure. Although a restrictive meta-level architecture limits flexibility, it provides greater safety and structure. If all aspects of the implementation were mutable, an entirely new semantics for the language could be defined at run-time; in this case, reasoning about the behavior of a program would be difficult. We limit our meta-level to contain only the aspects that are relevant to dependability. Application specific functionality is described in the form of base objects and dependability protocols are described in terms of meta-objects. Thus, dependability is modeled as a special way of implementing the application in question. Our methodology gives modularity since functionality and dependability are described in separate objects. Since meta-objects can be defined and installed dynamically, the objects in a system can dynamically change the protocols enforcing their failure semantics as system needs change. Furthermore, new dependability protocols may be defined while a system is running and put into effect without stopping and recompiling the system. For example, if a communication line within a system shows potential for unacceptable error rates, more dependable communication protocols may be installed without stopping and recompiling the entire system. Since meta-objects are themselves objects, they can also have meta-objects associated with them, giving customizable implementation of meta-objects. In this way, meta-objects realizing a given dependability protocol may again be subject to another dependability protocol. This scenario implies a hierarchy of meta-objects where each meta-object contributes a part of the dependability characteristics for the application in question. Each meta-object may be defined separately and composed with other meta-objects in a layered structure supporting reuse and incremental construction of dependability protocols. Because installation of a malfunctioning meta-level may compromise the dependability of a system, precautions must be taken to protect against erroneous
33 or malicious meta-objects. To provide the needed protection of the meta-level, we introduce the concept of privileged objects called managers. Only managers may install meta-objects. Using operating system terminology, a manager should be thought of as a privileged process which can dynamically load new modules (meta-objects) into the kernel (meta-level). It should be observed that, because of the close resemblance to the operating system world, many of the operating system protection strategies can be reused in our design. We will not discuss particular mechanisms for enforcing the protection provided by the managers in greater detail here. Because only managers may install metaobjects, special requirements can be enforced by the managers on the structure of objects which may be installed as meta-objects. For example, managers may only allow installation of meta-objects instantiated from special verified and trusted libraries. Greater or fewer restrictions may be imposed on the metalevel, depending on the dependability and security requirements that a given application must meet.
1.2.4
Screed
Screed is an object-oriented actor language that compiles applications for Broadway. Screed will be used to illustrate examples in this paper. Screed is an object-oriented language: programs are written in terms of class definitions. Each class defines a single actor behavior and consists of a set of variable declarations and a set of method definitions. Screed supports inheritance. A class for which a parent class is not specified will, by default, inherit from the system defined Object class. At any point, a parent method may be referenced by sending a message to the "object" parent. Inheritance is specified when the class is defined: c l a s s MyMailQueue : MailQueue { . . . instance variables . . . getO { . . . method body... } putO { . . . method body... } } In this example, the class MyMailQueue with the methods get and put is defined. It inherits from the class MailQueue.
34 Classes may be instantiated using the new command which returns a new actor address. For example: too = new MyHailQueue; This statement creates a new actor with the behavior MyMailQueue and returns the address of this actor, which is assigned to loo. There are five primitive types in Screed. The types i n t , r e a l , and s t r i n g are self-explanatory. The type a c t o r holds the address of any actor, regardless of class. The type method can have the value of any legal method name. In addition, one-dimentional arrays of any of these types may be specified. Arrays are defined and used as in C++: a c t o r myReplicas[5]; myReplicas[2] = new ...; Actors communicate through atsynchronous message passing. In the current implementation of Broadway message ordering (from a given source to the same destination) is guaranteed, although actor semantics enforce no such requirement. Messages are sent by specifying a method and an actor address with the appropriate parameters: foo.getO; In this case, the method get is invoked on the actor f oo without any parameters. Since methods are first-class values, it would be possible to specify a variable instead of the name of a particular method. Parameters are dynamically typechecked upon reception at the message destination. Note that since we are using asynchronous message passing, this method invocation does not block. Although asynchronous message passing provides improved performance and concurrency, a drawback is the difficulty in providing return values: since the method does not block upon sending a message, it is necessary to specify a return address and method in the message itself. Therefore, method invocations may return a value, thereby acting as an remote procedure call (rpc). For example: A X = loo.get(); B With rpc-communication, the current method invocation will block. The instructions in A will execute, followed by a message send to loo. B will not execute until a return value arrives and the value is assigned to x.
35 In a asynchronous system, the programmer may want to prevent certain methods from executing based on an actor's state. Therefore, we support synchronization constraints [17] in Screed. Using synchronization constraints, the programmer will be able to specify exactly which methods may not be invoked. Maximal concurrency is then preserved since only the minimal synchronization — as specified by the programmer not the language - will be enforced. The other constructs which comprise expressions in Screed (if, while, etc.) are similar to those in C; we do not describe them further.
1.2.5
Meta-level Architecture for Ultra-dependability
In this section we introduce MAtro (Meta-level Architecture for Ultra Dependability) [4]. MAUD supports the development of reusable dependability protocols. These protocols may then be installed during the execution of an application. MAUD has been implemented on Broadway, our run-time environment for actors. We begin with a discussion of MAUD'S structure. We then discuss how transparency and reusability of protocols are supported by MAUDand provide an example to illustrate the concepts. We finish this section by demonstrating how MAUD also allows the composition of protocols and give an example of composition.
A Meta-Level Architecture As previously mentioned, MAUD is designed to support the structures that are necessary to implement dependability. In MAUD, there are three metaobjects for each actor; dispatcher, mail queue and acquaintances. In the next three paragraphs we describe the structure of meta-objects in MAUD. Note that MAUD is a particular system developed for use with actors. It may be possible, however, to develop similar systems for other models. The dispatcher and mail queue meta-objects customize the communication primitives of objects so that their interaction can be modified for a variety of dependability characteristics. The dispatcher meta-object is a representation of the implementation of the message-send action. Whenever the base object issues a message send, the run-time system calls the transmit method on the installed dispatcher. The dispatcher performs whatever actions are needed to send the given message. Installing dispatchers to modify the send behavior makes it possible to implement customized message delivery patterns.
36 A mail queue meta-object represents the mail queue holding the incoming messages sent to an actor. A mail queue is an object with get and put operations. After installation of a mail queue meta-object, its get operation is called by the run-time system whenever the base object is ready to process a message. The put operation on a mail queue is called by the run-time system whenever a message for the base object arrives. By installing a mail queue at the meta-level, it is possible to customize the way messages flow into the base object. The acquaintances meta-object is a list representing the acquaintances of a base object. In an actor system, all entities are actors. Although they may be implemented as local state, even primitive data objects, such as integers or strings, are considered acquaintances in an actor system. Therefore, in an actor language the acquaintances and the mail queue comprise the complete state of an actor. The acquaintances meta-object allows for check-pointing of actors. Meta-objects are examined and installed by means of meta-operaiions. Metaoperations are defined in the class called Object which is the root of the inheritance hierarchy. All classes in the system inherit from Object implying that meta-operations can be called on each actor in the system. The mela-operations changejnailQueue and change_dispatcher install mail queues and dispatchers for the object on which they are called. Similarly, the meta-operations getjnailQueue, get_dispatcher and get_acquaintances return the metaobjects of a given actor. If no meta-objects have been previously installed, an object representing the built-in, default, implementation is returned. Such default meta-objects are created in a lazy fashion when a meta-operation is actually called.
Transparency and Reuse By describing our dependability protocols in terms of meta-level dispatchers and mail queues, we are able to construct protocols in terms of operations on messages where we treat each message as an integral entity. There are several advantages to developing dependability protocols in this manner. The first advantage is the natural way in which protocols may now be expressed. When dependability protocols are described in the literature, they are described in terms of abstract operations on messages, i.e. the contents of the messages are not used in determine the nature of manipulation to be performed. Therefore, it is logical to code protocols in a manner more closely resembling their natural language description. Secondly, because the protocols are expressed in terms of abstract messages
37 and because every object may have a meta-leve] mail queue and dispatcher, a library of protocols may be developed which may be used with any object in the system. Such a library would consist of protocols expressed in terms of a mail queue and dispatcher pair. T h e meta-objects may then be installed on any object in the system. Since the protocols deal only with entire messages, the actual d a t a of such messages is irrelevant to the operation of the protocol. Only fields common to every message, such as source, destination, time sent, etc., need be inspected. T h e libraries could also be used with other systems, allowing the reuse of dependability protocols. One set of developers could be responsible for the dependability of multiple software systems and develop a protocol library for use with all of them. Since protocols implemented with MAUD are transparent to the application, other development teams, who are responsible for development of the application programs, need not be concerned with dependability. In the final system, protocols from the library may be installed on objects in the application, providing dependability in the composed system. E x a m p l e 1: A R e p l i c a t e d S e r v e r In this section, we provide an example of how a protocol may be described using MAUD. In a distributed system, an important service may be replicated to maintain availability despite processor faults. In this section, we will give an example of how MAUD can be used in an actor domain to develop a modular and application-independent implementation of a protocol which uses replication to protect against crash failures. The protocol we describe is quite simple: each message sent to the server is forwarded to a backup copy of the server. In this manner, there is an alternate copy of the server in case of a crash. Reply messages from both the original and backup servers are then tagged and the client eliminates duplicate messages. Figure 1.2.2 shows the resulting actions occurring when a message is sent to the replicated service. The original server is actor Si. When a message is received by the Forwarder, the message is forwarded to the backup 52. 52 is initialized with the same behavior and state of Si- Since they will receive the same messages in the same order, their state will remain consistent. Therefore, any replies will be identical and in the same order. The replies are tagged by the dispatchers of class Tagger and only the first copy of each message is passed on to the client by Eliminator. Forwarding messages to the backup server is implemented using a meta-level mail queue. The Screed code for this mail queue is presented in Figure 1.2.3. Using a dispatcher, each reply message of the server is tagged to allow the
38
'1?!
Clicnl^
Eliminator
Eiiminator
g |
Key: Dispatcher
•r,,:,, MaiiQueue ^. Message send m- Causal Connection ,.,.J-
^ ""^
Forwarder ,'
Tagger
,»,
Figure 1.2.2: When a message is sent by the clients A or B to the replicated service Si, tlie message is received by the Forwarder and a copy is forwarded to the backup S2. When the servers reply, the Tagger dispatchers tag each message so that the Eliminator mail queues may remove duplicate results. If Si crashes, manager actors will install S2 as the new server.
elimination of duplicate replies by the client. A mail queue at the client performs this duplicate elimination. The code for this mail queue is shown in Figure 1.2.4. We assume that managers themselves install appropriate raeta-objects realizing a given dependability protocol. Therefore, we specify the relevant dependability protocols by describing the behavior of the initiating manager as well as the installed mail queues and dispatchers. A manager in charge of replicating a service takes the foUomdng actions to achieve the state shown in Figure 1,2.2: 1. The specified server is replicated by a manager by creating an actor with the same behavior and state. 2. A mail queue is installed for the original server to make it act as the Forwarder described above. 3. The mail queues of the original clients are modified to act as the Eliminator described above.
39 class Forwarder : MailQueue { actor backup; actor server; put(msg m) { m.base_send(); m.set-dest(backup); ra.sendO;
} } Figure 1.2.3: Code for the server-end mail queue which implements replication, The mail queue Forwarder sends a copy of each message to a backup copy of the server.
4. The dispatchers of the servers are changed to tag all messages so that the Eliminator may remove copies of the same message. 5. Upon detection of a crash of 5i, the manager takes appropriate action to ensure all further requests to the server are directed to 52- The manager may also create another backup at this time. Although this example is simple, it does illustrate some of the benefits of our approach. The manager initiating the replication protocol needs no advance knowledge of the service to be replicated nor does the replicated service need to know that it is being replicated. Because the clients using the replicated service are not modified in any way, this gives us the flexibility to dynamically replicate and unreplicate services while the system is running.
Composition of Dependability Characteristics In some cases, dependability can only be guaranteed by using several different protocols. For example, a system employing replication to avoid possible processor faults may also need to guarantee consensus on multi-party transactions through the use of three-phase commit or some similar mechanism. Unfortunately, writing one protocol which has the functionality of multiple protocols can lead to very complex code. In addition, the number of possible permutations of protocols grows exponentially — making it necessary to predict all possibly needed combinations in a system. Therefore, it is desirable to be able to compose two protocols written independently. In some cases this may not
40 class Eliminator : Mailq { int tag; actor members[NUMREP]; actor client; /* No get method is required since we use * the default behavior inherited from Mailq •/ put(msg m) { int i;
for (i=0; i < NUMREP; i = i + 1) if (m.getjsrcO == members[i]) / * Since t h e message was from a r e p l i c a , + we know t h a t the f i r s t argument i s a t a g and • t h e second i s t h e o r i g i n a l message. */ if (m.argCO] < t a g ) / * Discard message */ return; e l s e if (m[0] == t a g ) { self.enqueue(m[l]); t a g = t a g + 1;
} } } Figure 1.2.4: Code for the server-end mail queue which implements replication. The mail queue Eliminator removes tags (which have been added to all server replies by some other dispatcher) and takes the first message labeled by a new tag.
41 add_mailq ( a c t o r aMailq) { if (mailq == n i l ) { s e l i . chauige j n a i l q ( a M a i l q ) ; e l s e mailq.addjnailqCaMailq); ) add_dispatcher ( a c t o r aDispatcher) { i l ( d i s p a t c h e r == n i l ) { s e l l . changejdispatcher(aDispatcher); else dispatcher.addjiispatcher(aDispatcher); } Figure 1.2.5: The additional methods which must be inherited to allow for protocol composition.
be possible due to a conflict in the semantics of the two protocols. In other cases, performance may depend greatly on the way in which two protocols are composed. For many common protocols such as replication, checksum error detection, message encryption, or check-pointing, composition is possible. Because the meta-components of an object are themselves objects in a reflective system, there is a general solution for composing two protocols using MAUD. A simple change to the meta-operations inherited from the Object class, along with a few restrictions on the construction of mail queues and dispatchers, allows us to layer protocols in a general way. Figure 1.2.5 shows how an add-mailq method could be expressed in terms of the other meta-operations to allow layering. Because the mail queue and the dispatcher are objects, we can send a message to install meta-objects customizing their mail queue or dispatcher. By adding protocols in the above manner, the outer mail queue functionality will be performed on incoming messages before they are passed on to the "inner" mail queues. For the send behaviors, the process is reversed with the innermost send behavior being performed first and the outermost behavior last, thereby creating an onion-like model with the newest layer closest to the outside world. To preserve the model, however, several restrictions must be applied to the behavior of dispatchers and mail queues. We define the partner of a mail queue as being the dispatcher which handles the output of a protocol and the partner of a dispatcher as being the mail queue which receives input for the protocol. In Figure 1.2.6, B and C are partners as well as E and D. Each pair implements one protocol. It is possible for a meta-object to have a null partner.
42
Figure 1.2.6: Partrieis and Owner relationships. A is the owner of all other actors in the figure. Dispatcher B and mail queue C are partners as well as dispatcher D^ and mail queue E.
The owner application of a meta-ohject is inductively defined as either its base object, If its base object is not »• ineta-object,; or the owner application, of its base object. For example, in figure 1.2.6, A is, the owner application of meta-objects B, C, D, and E. With the above definition we can restrict the commnnication behavior of the actors so that: • A mail'queue or dispatcher may send or receive messages from its partner or an object created "by itself or its partner. • A dispatcher may send messages tcj the outside world, i.e. to an, object which is not a mail queue or dispatcher of the owner application (although, the message might be sent through the dispatcher's dispatcher). A dispatcher may receive t r a n s m i t messages from its base object and otherwis,c may only .re.ceive messages from its m,ail quetie partner. Therefore, a dispatcher with a null mail queue partner may only receive t r a n s m i t mes,sag,e,s from its base object or eomniunicate with actors it created. • A mail 'queue may receive messages from the outside world (through itS' own mail queue), and send pmt messages when responding to get, messages from itsbase object. Mail queues may otherwise only send messages
43 to its dispatcher partner or actors it created. Therefore, a mailq queue with a null dispatcher partner may only send put messages to its base object or communicate with actors it created. • Objects created by a mail queue or dispatcher may communicate with each other, their creator, or their creator's partner. Because of the above restrictions, regardless of the number of protocols added to an object there is exactly one path which incoming messages follow — starting with the outermost mail queue — and exactly one path for outgoing messages in each object — ending with the outermost dispatcher. Therefore, when a new dispatcher is added to an object, all outgoing messages from the object must pass through the new dispatcher. When a new mail queue is installed, it will handle all incoming messages before passing them down to the next layer. Thus, a model of objects resembling the layers of an onion is created; each addition of a protocol adds a new layer in the same way regardless of how many layers currently exist. With the above rules, protocols can be composed without any previous knowledge that the composition was going to occur and protocols can now be added and removed as needed without regard not just to the actor itself, but also without regard to existing protocols. In Figure 1.2.6, actors B and C are initially installed as one "layer." Messages come into the layer only through C and leave through B. Therefore, D and E may be installed with the add-mailq and add-dispatcher messages as if they were being added to a single actor. Now messages coming into the composite object through E are then received by C. Messages sent are first processed by B and then by D. Example 2: Composing Two Protocols Figure 1.2.7 shows the result of imposing the protocol described in Example 1 on a set of actors already using a checksum routine to guarantee message correctness. Originally, each actor had a corresponding Check-In mail queue and aCheck-Out dispatcher. When server Si is replicated, its meta-level objects are also replicated. The Forwarder mail queue is installed as the meta-level mail queue of Si's mail queue. It will forward all messages to S2. A Tagger dispatcher is installed for each of the two servers and the Eliminator mail queue removes duplicate messages at the client. Although this protocol would be difficult to write eis one entity, composition allows their modular, and therefore simpler, development. In terms of our onion-layer model, each Check-In/Check-Out pair forms a layer. For example, the innermost layer for server Si consists of a Check-Out
44
Key: Dispmchei-rirr-T _
Mai'Queue.
m- Message send m Causal Cotmeclion
Figure 1.2.7: System resulting from the composition of a replication protoco.1 and message checksum protocol. W]:ien a message is sent by the client A (1), the Check-Out dispatcher adds the checksum information to the message. The message is then forwarded to tlie replica as describe in Example 1 (2-3). The checksum inform,ation is removed by tiie Check-In mail queue(4) and the messages are processed, resulting in a reply (5). The reply messages both have the checksum (6) information added before they are tagged and sent to the client (7). At the client, duplicate messages are removed, the checksum information is checked, and the message is delivered.
45 dispatcher and a Check-In mail queue. The outermost layer at Si is comprised of a Tagger dispatcher and a Forwarder mail queue. The client A also has two layers. However, its outer layer consists solely of the Eliminator, this mail queue has a null dispatcher partner. Similarly, at server 52, the outermost layer consists only of a Tagger dispatcher with a null mail queue partner. As can be seen in the above example, the onion-layer model only provides consistency for mail queue and dispatcher installation at a single node: a manager that follows the above rules may still install protocols incorrectly. Such an error may occur if the protocols are installed in one order at one node and in a different order at another node. For example, if the manager installed the Ehmma/or mail queue at client A as the innermost layer rather than the outermost, the system would not operate correctly. An area of current research is developing methods for specifying managers which simplify protocol installation and guarantee global consistency.
Performance In implementing MAUD in Broadway, we have found that, in most cases, the additional message overhead caused by reflection is small compared to the actual cost in messages accrued by the protocols themselves. Using .MAUD, there is an additional in messages upon message reception, where n is the number of protocols composed together to provide dependability for a single object. Upon reception of a message, the message is routed up the chain of meta-level mail queues (n messages) and then worked its way down through a series of get and put messages. For message transmission, there are n additional transmit messages. Since each object is usually protected by only a small (1 or 2) number of protocols, this cost is not great. Since meta-level objects are most likely to be local to the base actor, most messages to meta-objects will be local and inexpensive. Furthermore, we use caching of the highest level mail queue to eliminate n of the messages: the system records the address of the top level mail queue and directs all messages intended for the base object to this mail queue. To preserve correctness with caching, meta-object installation is made atomic. Once a protocol installation begins, no messages are processed until the protocol is installed. This optimization is especially critical if some meta-objects need to be on a separate node. Placement of meta-objects on a different node from the base object is only done when physical separation is necessary for dependability: in this case, the inter-node communication from meta-mail queue to base-object
46 or base-object to meta-dispatcher would normally be required by the protocol, regardless of implementation technique. On the other hand, the communication cost from base-object to meta-mail queue is only due to the nature of using reflection. Therefore, caching eliminates this additional expense.
1,2.6
Exception Handling
Given a meta-level such as MAUD, it is still necessary for a programming language to provide flexible constructs supporting adaptive dependability. In particular, it is important to convey information to the correct entities when system failures occur. We have chosen exception handling as the medium through which managers are informed of problems in the system. This technique has been used extensively with forward error recovery; we simply extend the notion by having our managers prevent future failures through dynamic protocol installation. In this section, we describe the exception handling mechanism in Screed, our prototype actor language. To support adaptive dependability, faults and exceptions have been unified as one concept and exception handlers may be shared between objects. Broadway provides a set of system exceptions, some of which are notifications of failures. For example, when an actor attempts to communicate with an unreachable node, a crash exception is generated. We begin with a discussion of the general structure of exception handling in Screed followed by a specific illustration of the syntax used. We then show how this structure may be used with the meta-architecture to design adaptively dependable systems.
Exception Handling C o m p o n e n t s Exceptions are signaled whenever an unexpected condition is encountered. An exception may be signaled either by the run-time system or by the application. The former are referred to EIS system exceptions and the latter as user-defined exceptionsExceptions in Screed are represented as objects, as proposed in [13] for sequential object-oriented languages. Although none of the other concurrent languages discussed above have taken this approach, we feel representing exceptions as objects allows for more flexible and efficient exception handling; all the information needed by a handler is contained in one object. All system exceptions are derived, through inheritance, from the class exception. Userdefined exceptions may inherit from the exception class or from any other node
47
on the system exception inheritance tree. Below, we discuss the parties involved in the generation of an exception and then the structure of system exceptions. There are four roles involved in handling any exceptional condition: invoker, signaler, exception, and handler (see Figure 1.2.8). Each role is represented as an object in the system. The invoker initiates the method of the signaler which results in an exception. The occurrence of an exception generates a signal. When a signal occurs, a new exception object is created. The signaler notifies the appropriate handler object of the exception's mail address. The handler must then diagnose the exception and perform any appropriate actions for handling the exception. Exception handlers are constructed by the programmer as Screed actorclasses. For each exception a handler accepts, a method must exist with the same name as the exception and which takes an instance of the exception class as a parameter. In all other ways, handlers are identical to other actor classes: they may have a set of instance variables, inherit from other classes, and may communicate with any of their acquaintances. They may also have other, nonexception methods.
invoker
signaler /^ ^
Q \
*\ handler//
J
• I
'^z^^-**•
I *" V^ exception
Actor creation Message sent
Message may be sent
Figure 1.2.8: The four roles involved with an exceptional condition. The invoker initiated the method in the signaler which caused the exception. An object of the appropriate exception class is created and the handler is notified of the exception's mail address. The handler may then interact with the invoker and/or the signaler as well as the exception object to resolve the exception.
All exceptions must inherit from the class exception. When an exception is signaled, an object of the appropriate exception class is instantiated and initialized with any information needed by the handler to process the exception. Some of the initialization fields are supplied by the run-time system. These
48 fields are contained in the exception class from which all exception objects inherit, and are utilized through the methods inherited from the exception class. Additional arguments for the initialization of an exception may be specified by the objects raising a signal. For example, an arithmetic exception which is initiated by an application could be initialized when signaled with the values of the operands in the arithmetic operation. This exception object would still have default values specified by the system. Methods defined in the exception class make use of the system-supplied values. These methods are: name returns the name of the exception as a method value. Since method names are first-class values in Screed, this method enables the automatic calling of the correct method to handle it. invoker returns the mail address of the actor which invoked the method resulting in the generation of the signal. signaler returns the mail address of the signal generator. source returns the name of the method in which the signal was generated. arguments returns a list of the arguments that were passed to the method in which the signal was generated. request returns TRUE if the invoker is waiting on a reply, FALSE otherwise. reply allows a handler to reply to a request that was interrupted by the signal. The reply method can be used to supply an acceptable value to the invoker, thereby allowing the continuation of the computation. Each exception handler may utilize only a few of these fields. However, since our environment is asynchronous, we want to preserve all available information. There are no guarantees that this information will be retained by either the invoker or the signaler. Use of exception objects provides us with the flexibility to include a large amount of information without creating complicated function calls or messages: all the information is packed into an object and is referenced through a standard interface. In a procedural approach, long parameters lists would be necessary to achieve the same effect. Broadway currently supports three different system exceptions. All three inherit directly from the class exception. A bad-method exception is instantiated when an actor receives a message it cannot processes. The bad-method
49 exception class provides the behavior of the destination actor. In general, there is very little the run-time system can do to correct such an error, but this information allows a handler to provide meaningful error messages to the user. An a r i t h m e t i c exception is generated whenever Broadway traps an arithmetic error. Currently, this exception provides the state under which the exception occurred. We plan to expand this exception to include a string representing the expression being evaluated. Broadway also provides some failure detection capabilities. Each node on Broadway has a failure detector which uses a waich-dog timer approach to detect the failure of, or inability to communicate with, other nodes. A crash exception is generated whenever an actor attempts to communicate with an actor on a non-existent or unreachable node. A crash exception consists of the original message and the identity of the node which cannot be reached. Notice that, although Broadway has detected a component failure, it is treated similar to any other system exception. It is also possible for an object to subscribe to a failure detector. In this case, the subscriber's handler will automatically receive an exception whenever a failure is detected, even if the object did not try to communicate with the failed node. Besides detecting node crashes, Broadway will also handle the failure of individual actors. If an actor crashes due to an error that is trapped by Broadway, that actor address will be marked as a creish. Currently, only arithmetic errors are trapped by Broadway and, therefore, this is the only manner in which a single actor may crash. If the defunct actor receives a message, a d e a d - a c t o r exception will be generated. The dead-actor exception inherits from the crash exception. It also contains a reference to the exception generated when the actor crashed. (Currently, this is always an a r i t h m e t i c exception.)
Exception Handling in Screed In this section, we describe our two syntactic additions to Screed which enable exception handling: the handle statement which associates exceptions with handlers, and the s i g n a l statement which generates an exception. In Screed, handlers can be associated with exceptions for either entire actor classes or for arbitrary code segments within a method. Figure 1.2.9 gives the syntax for a handle statement. The statement defines a scope over which specific exceptions are associated with a particular handler. If any method invocation contained within the code block of the handle statement results in an exception, the signal is routed to the correct handler as specified by the with
50 handle (exceptionl, exception^
exception2 with with handler2,
handler!,
)
{ /* Any block ol code goes here */
} Figure 1.2.9: The structure of a handle block in Screed. exceptionl, exception^ are actor class names, handler is the name of an object.
bindings. As explained above, the exceptions are specified as cleiss names and the handlers are addresses of objects. Handler statements may be nested. In this case, when an exception is generated, the innermost scope is searched first for an appropriate handler. If a handler for the exception does not exist then higher level scopes are checked. handle (eirithmetic with arithhemdler, bad-method with aborthandler) { actor A; actor B; actor E; A = new complex(2,3); B = A.divide(C); handle (arithmetic with myhandler) E = B.divide(D); rayNum = r e s ; } } Figure 1.2.10: An example of handler scopes and their effect. The outermost handle statement provides handlers for arithmetic and bad-method exceptions. The inner statement overrides the outer scope in that all arithmetic exceptions will be handled by myhandler. Figure 1.2.10 demonstrates the scoping rules. In the scope of the outer handle statement, if in computing B (by dividing A by C), an arithmetic exception is generated (possibly by dividing by zero), the signal will be passed
51 to arithhcindler. The computation of E through the division of B by D, however, is in the scope of the second handle statement. Therefore, any arithmetic signals generated by this action are sent to rayhandler. Conversely, if our complex objects do not have a divide method, our actions will generate a bad-method signal which will be handled by aborthandler. Unlike the complaint address based schemes[21, 26], our syntactic mechanisms do not require explicit specification of a handler's address with each message. For any given scope, including a single message send, handlers — our equivalent of complaint addresses — may be specified for each individual exception or for any group of exceptions. One handler need not be specified for all exceptions. Additionally, our method takes greater advantage of the available inheritance mechanisms as well as the general structure of object-oriented languages: both exceptions and handlers are expressed as objects in our system. The above constructs work well within methods. However, there are two levels of scoping above the method level in Screed: the global and class levels. Exception handling at the class level is specified through the use of a handler statement which encloses several method definitions. In this manner, exception handling may be specified for an entire class by enclosing all methods in one handler statement. Such a construction does not prohibit handler statements inside the methods. A handle statement may not be defined across class boundaries as that would require the use of shared variables between class instances. However, to provide exception handling at the global level, Screed supports the systemdefined handler class Default-Handler. An instance of this class handles all signals which are not caught by another handler. Default system behavior is for a signal to be simply reported to the terminal. Delault-Heindler may be overwritten by a programmer defining a custom class of the same name. In this way, a final level of exception handling may be defined by the programmer. This type of facility is especially useful for writing debuggers. Any exception not defined in a custom Default-Handler class is handled by the system-default. Note that the system creates only one instance of the Default-Handler class: all otherwise unhandled signals are delivered to this instance. As mentioned previously, exceptions may be generated as user-defined signals. A signal is generated by a s i g n a l statement. signal
exception-class-name(_args . . . ) ;
The s i g n a l statement generates a message to the appropriate exception handler. The arguments are used for initialization of the exception as defined by
52 the interface of the particular exception class. The signal does not interrupt the flow of control in the code, although a Screed return statement could follow the signal to end the method. In many cases, it is necessary for the signaler of the exception to await a response from the handler before proceeding, signal statements are treated, syntactically, as message sends to a handler. Therefore, signal statements may act as an remote procedure call in the same manner as Screed message-sends. Thus, the handler may return a value to be used by the signaler. Such a case would be:
res = signal div-zero(); For this example, the exception handler would return an actor address as the value res. Then, the rest of the signalling method may compute. In other systems, a special construct exists for generating signals within the current context, i.e. generate a signal which is caught by the handle statement in whose scope the statement occurs. An example of such a construct would be the exit statement in Clu [24]. In Screed, such a construct in not necessary: the actor can explicitly send a message to the appropriate exception handler.
1.2.7
Supporting Adaptive Dependability
A significant difference between exception handling in Screed and other languages is the use of third-party exception handlers. In languages such as CLU [24], SR [20], and Ada [12], exception handling routines are defined within the scope of the invoking objects. We refer to this approach as two-party exception handling (the invoker and the signaler) and our approach as three-party exception handling (the invoker, the signaler and an independent handler). We have found that two-party exception handling is unsatisfactory for modeling the complex relationships between objects such as those required for adaptively dependable systems. The key difference between two- and three-party systems is in the potential knowledge of a given object. With two-party exception handling, the exception handler, which is also the invoker, may know only of exceptions its invocations generated. Therefore, in such a system it is very difficult to develop a global picture of the fault pattern in the system. In a three-party system, such monitoring may be naturally expressed in terms of the exception handler since it may handle exceptions for a large group of objects or even the entire system. Furthermore, an autonomous exception handler may subscribe to any available
53
Backup Node
^r:^-^^ Aoplicaiion
/
'^l
Fa'liire Detector Manager
Figure 1.2.11: A prototypical adaptively dependable system. In this case, the manager M receives input in the form of exceptions from applicatioE objects and .notices from the failure detector. Upon, determination that the Primary Node is unstable, M allocates the Backup Node and creates the appropriate objects to replace the Primary Node. Note that, in actuality, M is probably a replicated entity to ensure its resilience and availability. failure detectors, thereby augmenting the knowledge received through exception signals. A third-party exception handler may also be designated as an object with special rights. In this manner the system may be safely modified in response to exceptions and failures. Since it is dangerous to allow the arbitrary modification of one actor by another, most two-party systems, can express reconfiguration of the system only by mimicking a three-party system, i.e. they must notify a third object with special rights of the exceptions they encounter and this object 'may then reconfigure the system. Thus in adaptively dependable systems, the resulting system architectures will look qaite similar tO' Figure 1.2.,11. Such a system may allow the dynamic installation of dependability protocols or may simply support the reconfiguration of several objects m rcspofise to exceptiens. In either case, the system will
54 hiH' a mil ; 0 there exists a 0 such that \\x{ko)-x,\\ 0 \\x(ko)-Xe\\')-Xe\\—'0
as A : — • oc.
(2.3.5)
85 In linear time-invariant systems, stability can be checked simply by using the pole positions of the controlled process in the presence of random computer failures. Using this information one can derive hard deadlines stochastically or deterministically with the sample(s) and the ensemble average of the controlled process: D(N)
= inf supjiV : ||A(.V)|| < 1},
(2.3.6)
Cert V
where A(A'') is the eigenvalue of the controlled process in the presence of computer failures of the maximum duration NTs and Cenv represents all the environmental characteristics t h a t cause controller-computer failures. Consider a state trajectory evolved from time kt) to kj. Let A'^(A*) and UA be the allowed state space at time k and the admissible input space, respectively. If a computer failure, which occurred at ^i (A'o < ki < kj) and was detected I\\Ts later, is recovered within N2Ts, then the control input during these jV =
Ni+Nn
sampling periods is: u^ (k) = u{k)lAnk,{Ni)
+ u(ki + Ny)nk,+NA''^2),
k:)Bl\+4>(l- T,) active period (Fig. 2.3.3). The (asymptotic or global) stability condition discussed thus far is therefore no longer applicable.
Instead, the
terminal state constraints can be used to test whether or not the system leaves its allowed state space. Note that every critical process must operate within the state space circumscribed by given constraints, i.e., the allowed state space. When the control input is not updated correctly for a period exceeding the hard deadline, the system may leave the allowed state space, thus causing a dynamic failure. T h e allowed state space consists of two sets of states A'^ and X\
defined
as follows:
•
X\:
the set of states in which the system must stay to avoid an
immediate
dynamic failure, e.g., a civilian aircraft flying upside down is viewed as an immediate dynamic failure. This set can usually be derived o priori
from
the physical constraints. •
X\:
the set of states that can lead to meeting the terminal constraints
with appropriate control inputs. This set is determined by the terminal constraints, the dynamic equation, and the control algorithm used.
95 The system must not leave X\ nor X]^ in order to prevent catastrophic failure. Assuming that some computer failure may not be detected upon its occurrence but every detected failure can always be recovered successfully, we can consider three cases for the analysis of the effects of computer failures during a finite time: (i) delay: when a computer failure is detected upon its occurrence, (ii) disturbance: when a computer failure is not detected until its disappearance, and (iii) disturbance and delay: when a computer failure is detected at some time after its occurrence but before its disappearance. Let ko,kf, Ni, and A'^2 denote the indices for the failure occurrence time, the mission completion time, and the period of disturbance, the period of delay measured in sampling periods, respectively, where N = N1 + N2, 0 < Ni,N2 < N. The dynamic equation for a one-shot event model is; - /)nte+N,(^2)] (2.3.28) where TIkgiN) is the rectangular function as defined in Section 2, and A'^i and A'^2 are random variables and determined by the conditional probability of successful detection (d) \i N is given; :E(^+1)
= Axik)+B
PT[NI
[u{k) + (uiko) - uik))Uko(Ni) +
- i] =
Fi[N2 = i] = Pr[7Vi=7V]
=
d(l -dy
«(A:)(/A
0
Pmax},
x) -1/2 ,
(2.3.36)
and m{x) = 2k(x'^x)''
^.
101
1
J u I y
1
V
1
1
0.5
1
1
1.5
2
'
3.5
xl
Figure 2.3.6 Hard deadlines of state trajectory 1 of Fig. 2.3.5 in the absence of delay without terminal constraints.
The intersection of these constraints with the admissible control set Q results in a polygonal control space: A= The Optimal
Decision
Strategies
^(x,v)r]Q. (ODS) in [13] can be used to solve such a
constrained minimization problem, similarly to a class of pointwise-optima] control laws constrained by hard control bounds. The hard deadline is derived using a pointwise-optimal control law with the OVS. Since the computationtime delay causes the failure to update the control input, a collision (or a dynamic failure) occurs if the computation-time delay is longer than a certain threshold, which is a function of the system state and time. In this example, the state constraints change with system state (time), i.e., the state-dependent control constraints. Thus, the control input must be updated on the basis of new information to avoid any collision. The trajectories derived in both the absence and presence of delay DT^, and the hard deadlines on these trajectories in the absence of delay are plotted in Figs. 2.3.5 and 2.3.6.
102 Controlled Process: System Dynamics
"
Controller Computer:
Environment: Fault behaviors
Hard Deadline
Control Algorithms
H/W Design
S/W Design 4 Evaluation (Reliability)
Figure 2.3.7 The source and application of hard-deadline information in a real-time control system.
2.3.8
Application of Deadline Information
T h e hard-deadline information allows us to deduce the timing constraints of the controlled process. It can be used to evaluate the fault-tolerance requirements of a real-time control system in terms of its timing constraints. Fig. 2.3.7 shows the source and application of hard-deadline information. As we shall see, this deadline information about the controlled process is quite useful for the design and evaluation of a controller computer. When designing a controller computer, one has to make many design decisions in the context of controlled processes that are characterized by their hard deadlines and cost functions [18], including:
hardware design issues dealing with the number of processors and the type of interconnection network to be used, and how to synchronize the processors.
103 •
software design issues related to the implementation of control algorithms, task assignment and scheduling, redundancy management, error detection and recovery.
Since the timing constraints of the controlled processes are manifested as hard deadlines, the deadline information is also essential to evaluate the system reliability, an important yardstick to measure the goodness of the controller computer. To illustrate the general idea of applying the knowledge of the deadline information (i.e., system inertia), we consider two specific examples; (i) a design example that optimizes time-redundancy recovery methods such as retry or rollback, and (ii) an evaluation example that assesses the system reliability. Example 8.1: When an error is detected the simplest recovery method is to re-execute the previous instruction, called simply retry, which is effective in case of immediate error detection [10, 11]. When retrying an instruction, one must determine a retry period, which is long enough for the present fault(s) to die away. If the retry does not succeed in recovering from the error, we have to use an alternative recovery method like rollback or restart. So, the retry period must also be short enough not to miss the deadline by considering the amount of time to be taken by the subsequent recovery method in case of an unsuccessful retry. Let Tt, TQ, and tr be the "nominal" task execution time in the absence of error, the actual task execution time, and the retry period, respectively. Then, one can obtain a set of samples ofTa'. Ta E {Tt,Tt +
-^,if^T,+tr)-{-Tt,{T+T,-\-trHT,-^~,2{f+T,+tr)+Tt,---]-
where T,, T, and j - are the resetting time, the mean occurrence time of an error, and the mean active duration of a fault. Since Ta has discrete values, the probability mass function (pmf) of Ta is: /^^
=
PT[Ta = k{f + Z+tr)
+ Tt+6^l
=
p',+'iTt)il-p,(U))''a-Pe{Tt)y-'ps{irY,
0 write (addr, data © 2'), V i output (addr, data) => output (addr © 2", data), V i output (addr, data) => output (addr, data ® 2'), V i read (addr) => read (addr ® 2'), V i read (addr) => read (addr) ® 2', V i input (addr) =* input (addrffi2'), V i input (addr) => input (addr) © 2', V i
Table 3.1.4: Summary of RTL fault model (Sheet 1 of 3)
142 H/W element
Transient hardware fault
ALU
Level change on input
A + B + Co =* (A e 2') + B + CQ, V i*"
Level change on result
A + B + Co => (A + B + CQ) © 2', V i
Original RTL => faulted RTL
A + B + Co => A + (B © 2') + Co, V i
A op B =* (A op B) © 2', V i, where op is a logical of>eration
Level change on carry input
A+B+Co=*A+B+-Co
Level change on carry output (or carry look-ahead)
A + B + Co => A + B + CQ + 2', V i''
Level change on carry output and result
A + B + Co=> (A+B + Co± 2') © 2', V i
Incorrect condition code calculation
Covered by status line faults in control section
Incorrect function performed
A opj B => A opj B, V j ?t i Aop B => A A op B => B opj A=> opj A, V j ?i i op A=> A
Instruction Decoder
Incorrect preprocessing function performed
A op B => A op A AopB =i.BopB A-B=>B-A A op B => -A op B A op B =s> A op -B A op B => 0 op B A op B => A op 0
Incorrect instruction
Decode/case => all incorrect case statements executed
Table 3.1.4: Summary of RTL fault model (Sheet 2 of 3)
143 H/W element
Transient hardware fault
Shifter
Level change on input or output
AopB=!>(AopB)e2\Vi''
Incorrect decode of shift amount
A op B => A op i, V i ;t B such that 0 < i < length of A
Incorrect condition code calculation
Covered by status line faults in control section
Incorrect function performed
A opi B => A opj B, V j ?i i
Extraneous command issued
None
Missing command
Many covered previously; others are: expr => no-op (evaluate only) test (expr) => no-op write (addr, data) => no-op output (addr, data) => no-op
Incorrect command
Many covered previously; others are: write (addr, data) => output (addr, data) output (addr, data) => write (addr. data) read (addr) => input (addr) input (addr) => read (addr) sign extend expr to length i => extend expr to length i extend expr to length i => sign extend expr to length i Many others not covered
Level change on status signal
Logical operations, comparisons, and condition codes => invert result
Control lines
Original RTL => faulted RTL
A op B =*• (A ® 2") op B, where n is the position of the sign bit in A, and op is shift right algebraic
Table 3.1.4: Summary of RTL fault model (Sheet 3 of 3) "Only single bit faults considered; this is true for all "XOR" faults. Subtraction and unary minus are covered by these faults because A - B is implemented as A -I- -B -I- 1, and -A is implemented as 0 -h -A + 1. '^A fault in carry bit i is actually modeled by iteratively simulating each slice of the bitslice chain and XORing the carry bit between slices i and i-i-l. The "± 2'" term indicates that 2' is added if the carry bit was 0 and became I due to the fault; 2' is subtracted if the bit was I and became 0. A is the value to shift, B is the amount, and op specifies direction and type.
144 3.1.3.5 Operation The operation of ASPHALT is outlined in Figure 3.1.10 and broadly consists of four phases: reading the RTL code and the initial machine state, the golden run, the faulty runs, and post-processing. begin read RTL code and initial irachine state sinvLLate golden run: begin initialize registers sinulate instructicn and keep RTL executicn trace save final state and outputs end siitulate faulty runs: for each RTL cperaticn i in executicn trace for each possible fault j on node i begin initialize registers sinulate instruction vd.th fault j on i save final state and oitputs aid past-processing: begin ooll.^)se error states report statistics End Old
Figure 3.1.10: Pseudo-code for ASPHALT operation Reading the RTL Code and Initial State. ASPHALT starts by reading a description of the processor like the fragments shown in Figure 3.1.5. Then it inputs the values for all the visible registers captured from the actual processor just before the instruction to be faulted. It is also given access to the processor's memory image to obtain instruction and data values. The Golden Run. To obtain a basis for comparison, ASPHALT first simulates the instruction without any faults. Such a fault-free standard is called a "golden" run. ASPHALT initializes itself by loading its internal copies of the visible registers
145 with the initial machine state. All other registers are set to zero. It then executes main and any routines called by main. During this execution, ASPHALT keeps an RTL execution trace, which is a list of all the RTL operations evaluated during the golden run. The final machine state and a list of all memory and I/O writes is saved. The actual processor memory or I/O space is not written to during the simulation. The Faulty Runs. Now ASPHALT is ready to inject the RTL faults. It begins this process by sequentially stepping through the execution trace from the golden run. ASPHALT makes one faulty run for each possible fault for any given RTL operation. For example, if the current operation in the trace is the access of a 16-bit register from a 4-register file, 21 faulty runs are made: one for the XORing of each of the 16 bits, one for setting the register to all ones, one for setting it to all zeros, and one to access each of the three incorrect registers in the file. When each run completes, the final machine state and a list of memory and I/O writes is saved for that run. Alternatively, if the injected error causes one of the built-in fatal-error routines to be called, the type of error is saved; address, bus, program, privilege, or opcode trap. Post-Processing (Error Collapsing). When all faulty runs are completed, there exists a list of machine states and outputs, one for each injected fault. It is quite probable that some of these states will be identical. When performing fault-injection studies, it would be pointless to inject the same error more than once, so the list is collapsed into a set of unique errors. There are two ways in which errors are considered identical: equivalent fatal errors or equivalent states. If two injected faults cause the same fatal-error routine to be called, they are considered equivalent. For example, if two bit-flips in the instruction register cause the opcode trap to be invoked, the error states would be considered equivalent and would be identified only as an opcode-type error. It is also quite probable that two faults cause the exact same final machine state and outputs to be generated, i.e., all the bits in all the registers are the same, and the same data are written to the same memory and I/O addresses. In this case, these error states are considered equivalent and are identified by the state and outputs. A special case of this is the unique state which is identical to the state at the end of the golden run. The faults that generated this state are considered overwritten because they can never manifest into errors. A diagram showing the error collapsing process is shown in Figure 3.1.11.
146 Injected RTL Faults
Resultant States
Overwritten Faults
Equivalent Fatal Errors
Equivalent Error States
Figure 3.1.11: Error collapsing In this example, there are ten Injected faults, three of which are owerwrltten, leawing seven non-overwritten faults. Of tliese seven, there are only five unique states because two faults produced tlie same fatal error, and two produced equivalent non-fatal error states. Only the equivalent fatal errors and equivalent error states need to be used In a fault-injection experiment.
The following section will describe experiments used to verify the preceding RTL fault model and injection metliodology. 3.1.4 EXPEMMENTATION The main goal of the experiments presented in this chapter is to determine the extent to which the RTL fault model covers actual hardware faults. To estimate the error manifestations of actual hardware faults, a gate-level simulation model was used. Any of the methods from Table 2-2 on page 11 could have been used for fault injection at the gate level, but the advantage of full control over fault types and injections made software simulation the preferred choice. Figure 3.1.12 shows an overview of the experimentation process. First, both the gate-level and RTL models arc initialized with the same instruction to be faulted and the same machine state. Second, both models are exhaustively injected with
147 transient faults: gate-level faults into the gate-level model and RTL faults into the RTL model. This produces a set of machine states for each model. Finally, these two sets can be compared.
• Instruction to be taulted • Initial machine state (registers and memory)
f Gate-level simulation J
^ ^ L simulation ^
Set of resultant machine states
Set of resultant machine states
Comparison
Figure 3.1.12: Overview of experiments
Figure 3.1.13 shows a Venn diagram of this comparison. The intersection between the two sets is the covered faults. Removing the covered faults from the gate-level set leaves the non-covered faults. Removing them from the RTL set leaves the overhead, those faults generated by the RTL model but not produced by the hardware. The goal, of course, is to maximize the coverage while minimizing non-coverage and overhead. 3.1.4.1 ROMP Overview The IBM RISC-Oriented Micro Processor (ROMP) was chosen as a basis for the experiments for a number of factors: •
The gate-level model had already been developed by another researcher [9]. Therefore, since it was developed by a disinterested party, the model was not intentionally or unintentionally written to conform to the implementation assumptions of the RTL fault model.
•
There exists a fault-injection test-bed based on the IBM RT [3]. The IBM RT employs the ROMP as its CPU.
148 Resultant states, from
non-covered
__________^_^__
Resultant states from
1 covered
overhead
Figure 3,1.13: Comparison of results Coverage proportion = covered / (covered + non-GO¥erecl|. Overhead proportion = overhead / (overhead + covered).
•
The ROMP contains a number of interesting features: RISC technology, pipelining, both hard-wired and micro-coded control logic, and an asynchronous instructioa pre-fetch buffer.
Salient architectural features of the ROMP include sixteen 32-bit general purpose registers and a four gigabyte virtual address space broken into 16 segments of 256 megabytes each. The foUowins description of the processor is taken directly from [17]: The RDMP chip is a pipelined processor capable of executing a new instruction every cycle. While one instruction is being executed, the next instruction can be decoded, and following instructions can be fetched from memory and buffered. A one-cycle execution rate is prevented when either an instruction requiring multiple execution cycles is executed, or a hold-off occurs [i.e., when the processor waits for an instruction or data from memory). Most instructions take only one cycle, but Store and I/O Write instructions each take two cycles...
149 The ROMP processor is partially microprogrammed. It uses ROM for control during the execution cycles, but hturdwired control logic is used for instruction prefetching and for memory data requests, sincc: those operations are usually overlapped with tlie execution of other instructions. [Figure 3.1,14] shows, a block diagrain, of the ROMP processor data flow.
Storaae Channel (RSC)
FIgyre 3..t.14: ROMP data flow Taken from [9], as adapted from the original [17]. Major sections include the instfuction pre-fetch buffer (IPB), the micro-instruclion fetch (MIF), data fetch and storage (DFS), the ALU and shifter fALU), and the ROMP storage channel interface (RSei).
150 Instruction Fetching The instruction fetch area includes the Instruction Prefetch Buffers (IPBs), the IPB Multiplexer (MUX), and the Instruction Address Register (lAR) and its incrementers. Four IPBs are provided to keep the processor supplied with instructions. Instructions are prefetched whenever an IPB is available and there is no other request for use of the RSC [ROMP Storage Channel]. Every cycle, each IPB that is waiting for an instruction ingates the data from the RSC. During the following cycle, the tag associated with that data is examined to determine if it was addressed to any of the IPBs. If so, then that IPB will hold its contents until that instruction word is gated by the MUX to the decode circuits... Execution Unit The execution unit includes the register file, the AI and BI latches, the ALU and shifter, and the ALU output latch. It also includes the MQ [Multiplier Quotient] register and the Condition Status register, which are both SCRs [System Control Registers]. To support a one-cycle execution rate, a 4-port register file is used... Two of the ports of the register file are used to read two operands simultaneously into the AI and BI latches. Another port is used to write the result back from the ALU output latch, and the fourth port is used to write data from memory or I/O... The ALU is used for the execution of all arithmetic and logical instructions, and for address calculations when executing Load, Store, I/O and Branch Instructions... RSC Interface The RSC interface consists of the request area and the receive area. The request area arbitrates for use of the RSC, and it receives the acknowledgment signals after requests are sent. [Read requests are sent with a 5-bit tag which uniquely identifies the data when they return from the memory or yo.] The RSC receive area contains buffer registers from one incoming data word and tag. Each cycle, they capture whatever data word and tag is on the RSC... During the following cycle, the tag is examined to determine if the word is addressed to the ROMP, and if so, whether it is an instruction or data. If it is an instruction, the tag will also identify which IPB has been assigned to it. If it is data, the tag will point to one of two descriptors which will control the alignment by the formatter and store it into the proper location in the register file...
151 Control Unit The control unit includes the microcode ROM, the Control Register (C Reg), the instruction decoders, and the ROM and register file address latches and control circuits. It also includes circuits for detecting and handling interrupts, machine checks, and program checks, as well as the SCRs associated with these events. The ROM contains 256 control words of 34 bits each... The control words are needed only for the execution cycles of the instructions, since instruction prefetching is controlled by hardwired circuits, and the last control word of each instruction controls the decoding of the next instruction... The control words are fetched from the ROM into the C Reg during the cycle prior to the one when they are executed. During the last execution cycle of any instruction, the next instruction is selected from one of the IPBs by the MUX and decoded. This decode cycle is used to simultaneously fetch a control word from the ROM and fetch the two operands from the register file into the AI and BI latches. The operation code is taken directly form the output of the MUX and used as the ROM address... Also during the decode cycle, the register address for the result, called the destination address, is put into a two-stage pipeline to be used two cycles later for storing the result from the ALU output latch into the register file. The ROMP provides 118 instructions in ten classes as shown in Table 3.1.5. The branch instructions include "Branch with Execute" versions which allow overlap of the fetch of the branch target instruction with the instruction following the branch instruction (called the subject instruction). This eliminates dead cycles that would normally occur when flushing the pipeline. 3.1.4.2 The ROMP Hardware Model The gate-level model of the ROMP was written in Verilog, a hardware description language and simulator, in approximately 2600 lines of code. To generate error manifestations from the gate-level model, there must be a gatelevel fault model and a methodology of fault injection. Fault Model. There are two portions of the fault model: gate-level faults and behavioral faults for the registers. The gate-level faults are single fixed-at-0 and fixed-at-1 faults. Fixed-at faults are distinguished from stuck-at faults by being transient as opposed to permanent. The lines are fixed to a logical zero or one value for one clock cycle using a forcing
152
Instruction class Memory Access Address Computation Branch and Jump Traps Moves and Inserts Arithmetic Logical Shift System Control Input and Output Total
Number of instructions 17 8 16 3 13 21 16 15 7 2 118
Table 3.1.5: ROMP instruction classes feature of the Verilog simulator. Faults are injected into only one line at a time; no multiple faults are simulated. These faults are injected exhaustively into all nets in the model (spatial) and during all clock cycles of the instruction at the fault injection location (temporal). Since the registers are modeled at a behavioral level, a behavioral fault model is necessary. The model is based on the list of transient hardware faults from the "Register" section of Table 4-1. The following faults are injected exhaustively: Missed load: The register is not loaded when it should be. Extraneous load: The register is loaded when it should not be. (Recall that these faults are not directly emulated in the RTL fault model.) Level change in storage: A bit in the register is flipped. It is not necessary to model level changes in the input or output lines, because they are accessible at the gate-level. The procedure in Figure 3.1.15 is used to inject faults into a certain net or register. The process is repeated for all nets and registers in the Verilog model.
Actually, this "feature" was intended to be used as a debugging tool.
153 begin initialize irBchine state siimlate instructicn without faults (goldai run) recxjrd state and outputs at each cycle for i = each cycle during instructicn begin for j = each fault for the currait net or reg begin initialize itachine state siimlate instructicn with fault j in cycle i record state and outputs aid and end
Figure 3.1.15: Pseudo-code for gate-level fault injection
3.1.4.3 Results Faults were injected into the ROMP "a" instruction (add) following the procedure outlined above. The results are shown in Table 3.1.6. First, by looking at the row labeled "Total gate level," note that the gate-level injection into the Verilog model produced 711 error states. Of these, 692 were successfully reproduced by the simpler RTL model. Recall from the development of the RTL fault model that the faults for the control section were more general than those for the data section. The effects of this are evident in the results. Almost all (0.9970) of the data errors were reproduced, but only about two thirds (0.6885) of the control errors were covered. Fortunately, there are only 61 control errors compared to 669 data errors, so the average (0.9733) is weighted heavily by the data coverage. An error coverage of over 0.97 is encouraging for this simple instruction, but how do these results compare to those for different opcodes? Similar experiments were done using 91 of the ROMP's 118 instructions by injecting over 1.3 million gatelevel faults and 176 thousand RTL faults. The results arc shown in Figure 3.1.16. The mean coverage is 0.970 with a standard deviation of 0.0138. Also note that the mean overhead is 0.205, indicating that about 20% of the errors generated by ASPHALT were not generated by the Verilog model. Detailed data for each instruction may be found in [40]. A Study of the Outlying Data. This section looks at the instructions which produced the lowest coverage values. The sources of repeating non-covered faults are discussed. To improve the coverage values, the RTL model is modified based on the observations, and the effects of the new model are presented.
154
Total faults injected
Source
Faults causing fatal errors
Equivalent error states
Error states covered by RTL injection
Gate level, control section
4452
60
61
42 (68.85%)
Gate level, data section
7686
4
669
667 (99.70%)
12138
64
711
692 (97.33%)
40
834
N.A.
Total gate level RTL level
1718
Table 3.1.6: Injection into "a" instruction The first three rows show data from injection into the gate-level model: the first row shows the results of injection into the control section only; the second, data; and the third, both. The fourth row shows data from injection with ASPHALT. The first column of data shows the total number of faults injected over all cycles of the "a" instruction. The second column shows the number of those faults which caused fatal errors such as addressing or opcode traps. The third column shows the number of equivalent error states as illustrated in Figure 3.1.11. The fourth column shows how many of the gate-level equivalent error states were also generated by ASPHALT.
From the graphs of the data, the four leftmost points are immediately identified as the most extreme outliers. These values are all between 0.90 and 0.94, whereas the other 87 are between 0.94 and 0.99. These values were produced by the four branch-and-link instructions: branch and link immediate (bali), branch and link (balr), and the two versions which execute the subject instruction (balix and balrx).^ Is there a property of the RTL fault model or the RTL ROMP model which accounts for these relatively low coverage values? The branch-and-link instructions are so-called because they copy the value of the updated lAR into a link register before branching, thereby providing a return address to be used at the end of a subroutine. An analysis of the resultant machine states from the gate-level simulation which were not covered by the RTL fault model reveals two separate effects: one for bali and balix and another for balr and balrx. These are discussed in turn below.
Actually, there are six branch-and-link instructions in the ROMP. The two branch and link absolute instructions (bala and balax) were not implemented in the Verilog model.
155 Coverage
Figure 3.1.16: Dot chart of coverage data for 91 instructions The opcodes are listed along the left-hand side.
156 First, in the set of non-covered states from both bali and balix, there were 32 single-bit faults in both lAR and the link register. These faults were caused by singlebit faults in lAR and fixed-at faults in the ALU inputs. If these 32 errors were covered in bali, for example, the coverage would rise from 0.915 to 0.962. Figure 3.1.17 shows the key RTL transfers that are executed by the RTL model during bali. From this sequence, it is evident that no single RTL fault can affect both lAR and rl5, the link register, due to the temporary variable instruction_address used in the RTL code. By analyzing the microcode from the Verilog model, one can see that no such temporary variable is used. Instead, the following two transfers are made in parallel; rl5
*r-
IAR + 4
(3.1.1)
lAR
4-
lAR + immediate_valuc.
(3.1.2)
Thus, a single-bit fault can affect both rl5 and lAR. To compensate for this discrepancy, a change is introduced in the RTL model as shown in Figure 3.1.18 to force a single-bit fault in the temporary register to affect rl5 and lAR. 1. 2. 3. 4. 5. 6.
instruction_address rocessor and memory modules must be operational. In order to simplify the analysis, the fault-tolerance mechanisms of this architecture were assumed to have perfect coverage and never fail. The reliabilities of the processor and memory modtdes were defined to be exponential: i ? p ( 0 = e-
Hu{i) = e
-
AM(
m)
with failure rates Ap and AM, respectively. An expression for llie reliability of a 2-of-3 system is obtained by summing the probabilities of no failed modules, one failed processor or memory module, and both a failed processor and memory module:
/i2-of-3( A tuple reflects the occurrence of one or more errors of the same type in rapid succession. It can be represented by a record containing information such as the number of entries in the tuple and the time duration of the tuple. Different systems may need different time intervals in data coalescing. A recent study [14] defined two kinds of mistakes that can be made in data coalescing: collision and truncation. A collision occurs when the detection times of two faults are close enough (within AT") such that they are combined into a tuple. A truncation occurs when the time between two reports caused by a single fault is greater than AT" such that the two reports are split into different tuples. If AT is large, collisions are likely to occur. If AT" is small, truncations are likely to occur. The study found that there is a time-intervals threshold beyond which collisions are rapidly increased. Based on this observation, the study proposed a statistical models which can be used to select an appropriate time interval. In our experience, collision is not a big problem if the error type and device information is used in data coalescing as shown in the above coalescing algorithm. Truncation is usually not considered to be a problem [14].
202
4.1.5 Preliminary Analysis Once coalesced data is obtained, basic dependability characteristics of the measured system can be identified by a preliminary statistical analysis. Commonly used measures in the analysis include error/failure frequency, TTE or TTF distribution, and error/failure hazard rate function. In the following discussion, data from a VAXcluster system [52] is used to illustrate analysis methods.
4.1.5.1 Basic Statistics It is important but easy to obtain basic statistics from the measured data such as frequency, percentage, and probability. These statistics provide an overall picture of the measured system. Often, dependability bottlenecks can be identified by analysis of these statistics. Table 4.1.4 shows the error/failure statistics for the measured VAXcluster. In the table, UO errors include disk, tape, and network errors. Machine errors include CPU and memory errors. Software errors are software-related errors. The 95% confidence intervals for the percentage and probability estimates are also provided in the table. Two bottlenecks can be identified from the table. First, the major error category is I/O errors (93%), i.e., errors from shared resources. This category of error has a very high recovery probability (0.996). However, these errors still result in nearly 43% of all failures. This result indicates that, although the system is generally robust to the impact of 1/0 errors, the shared resources still constitute a major reliability bottleneck due to the sheer number of errors. Improving such a system may require using an ultra-reliable network and a disk system to reduce the raw error rate, not just providing high recoverability. Table 4.1.4 Error/Failure Statistics for the VAXcluster Error
Recovery
Failure
Category Count
Percentage
Count
Percentage
Probability
I/O Machine Software Unknown
25807 1721 69 191
92.87±0.30 6.19±0.28 0.25±0.06 0.69±0.10
105 5 62 73
42.86+6.20 2.04±1.77 25.3115.44 29.80+5.73
0.996±0.001 0.970±0.002 0.101±0.071 0.61810.069
All
27788
100.0
245
100.0
0.991+0.001
203 Secondly, although software errors constitute only a small part of all errors (0.3%), they result in significant failures (25%). This is because software errors have a very low recovery probability (0.1). This software failure estimation is conservative because there are significant unknown failures (30%). Some of these unknown failures could be attributed to software problems. Thus, software-related problems are severe in the measured system.
4.1.5.2 Empirical TTE Distributions and Hazard Rates TTE/TTF probability distributions and error/failure hazard rates are commonly used to investigate how errors and failures occur across time. It is relatively easy to obtain empirical TTE/TTF distributions from data. Figure 4.1.2 shows the empirical TTE distribution function, f{t), for a measured VAXcluster system [52]. Notice that the logarithmic coordinate is used for f(t) because of the big contrast between the largest and smallest values. It is seen that about 67% of the TBEs are less than one minute. Most of these instances are "time between errors of two different machines" because errors of the same type occurring within a 5-minute interval of each other on the same machine have been coalesced into a single error event. This fact implies that errors arc likely to occur on the different machines in the measured system within a very short period of time. The hazard rate characterizes error/failure intensity in time. It can be considered to be the probability that an error (failure) will occur within the coming unit of time, given that no error (failure) has occurred since the start of the system or the last error (failure) occurrence. The mathematical definition of the hazard rate [42] is as follows: h{t) =
Pr(error in (t, t+dt)} _ f(t) Prfno errors in (0. t}J dt 1 -F(f)
1.000 Mean= 12.9 Median = 0.08 Std. Dev. = 46.1
0.100-
/(') 0.0100.001 0
10
20 30 / (minutes)
40
50
Figure 4.1.2 VAXcluster Empirical TTE Distribution
(4.1.1)
204 Figure 4.1.3 shows the empirical failure hazard rates computed from the VAXcluster failure data. The high hazard rate near the origin, i.e., the high probability that the second failure will occur within a short time after a failure occurrence, indicates that failures in the VAXcluster tend to occur in bursts. The most likely for a second failure is the first two hours after a failure occurrence. Failure bursts have been observed by many studies [4], [17], [23]. Actually, in an early study of transient errors [34], the Weibull distribution with a decreasing failure rate identified for the interarrival time of failures caused by transient errors implicated the existence of failure bursts. 0.40.3hU) 0.20.100
\ . Un__n.^-n, ,.-^_n—.
c
10
20 30 / (hours)
^j~i
40
1 —
5
Figure 4.1.3 VAXcluster Empirical Failure Hazard
4.1.5.3 Analytical TTE Distributions A realistic, analytical form of TTE distributions is essential in modeling and evaluating computer system dependability. Often, for simplicity or due to lack of information, TTEs are assumed to be exponentially distributed. Early measurementbased studies found that the Weibull distribution with decreasing failure rate is representative of the time between failures (TBF) in a measured DEC computer system [34] and a measured IBM operating system [22]. A recent comparative study of the dependability of the Tandem GUARDIAN, VAX VMS, and IBM MVS operating systems showed that the software TTE in a single machine can be represented by a multi-stage gamma distribution and the software TTE in multicomputers can be represented by a hyperexponential distribution [27]. In this section, we discuss these two types of distributions. We first explain how a TTE distribution is obtained from a multicomputer system. The above GUARDIAN and VMS operating systems were running on multicomputers. Typically, all the constituent machines work in a similar environment and run the same version of the operating system. The whole system is treated as a single entity in which multiple instances of an operating system are running concurrently. Every software error in the system is sequentially ordered, and a distribution
205 is constructed. The constructed TTE distribution reflects the software error characteristics for the whole system. We will call this distribution the multicomputer software TTE distribution. Figure 4.1.4 shows the analytical TTE or TTH (Time To Halt) distributions fitted using SAS for the three measured systems. All the three empirical distributions failed to fit simple exponential functions. The fitting was tested using the Kolmogorov-Smirnov or Chi-square test at a 0.05 significance level. The two-phase hyperexponential distribution provided satisfactory fits for the VAXcluster and Tandem multicomputer software TTE distributions. An attempt to fit the MVS TTE distribution to a phase-type exponential distribution led to a large number of stages. As a result, the following multi-stage gamma distribution was used:
(a) IBM MVS Software TTE Distribution /{f) = 0.748-g(f;2.1,-1)+ 0.55-s((; 0.5,0)+ 0.069-sO; 3.5..1) + 0.030-,?((;5.0,8) + 0.098's((;5.0,1.7)
/(')
200 I (minutes) (b) VAXcluster Software TTE Distribution
300
400
(c) Tandem Software TTH Distribution 12-
1
osJ /(')
fU) = a,Xte'*'' + 02X26'^'' a 1=0.67 ,i,=0.20 ^2=0.33 ^2=2.75
/(0 = « ! AjC .08-
m .04-
04
^'' + CC2^2^ '*''
a,=0.87 02=0.13
,i,=0.10 /i2=2.78
^ ^ _ _ 00
.0010 15 t (days)
20
25
10 15 / (days)
20
Figure 4.1.4 Analytical Software TTE Distributions Extracted from Data
25
206 n
fit) = Z a,g{t\ff,-.Si),
(4.1.2)
(=1
where a, > 0, J) a, = 1, and
t < s,
g(t;a,s)--
na) It was found that a 5-stage gamma distribution provided a satisfactory fit, which means that the software TTE distribution on the MVS has a complicated mode. Figures 4.1.4(b) and 4.1.4(c) show that the multicomputer software TTE distribution can be modeled as a probabilistic combination of two exponential random variables, indicating that there are two dominant error modes. The higher error rate, /I2, with occurrence probability a2, captures both the error bursts (multiple errors occurring on the same operating system within a short period of time) and concurrent errors (multiple errors on different copies of an operating system within a short period of time) on these systems. The lower error rate, A], with occurrence probability tt], captures regular errors and provides an inter-burst error rate. Error bursts can be explained as repeated occurrences of the same software problem or as multiple effects of an intermittent hardware fault on the software. Actually, software error bursts have been observed in laboratory experiments reported in [4]. The study showed that, if the input sequences of the software under investigation are correlated (rather than independent), one can expect more "bunching" of failures than those predicted using a constant failure rate assumption. In an operating system, input sequences (user requests) are highly likely to be correlated. Hence, a defect area may be triggered repeatedly.
4.1.6 Dependency Analysis Many underlying dependencies exist among measured parameters and components, such as the dependency between workload and failure rate and the dependency among failures on different components. Understanding such dependency is important for improving system dependability and developing realistic models and hence better designs. In this regard, the workload/failure dependency issue was studied in the early 1980s and the correlated failure issue was investigated recently.
207 Dependency between workload and failure was addressed in two approaches: statistical quantification of the dependence between workload and failure rate [5], [22] and stochastic modeling of failures as functions of workload [6], [7]. Both approaches demonstrated the strong correlation between workload and failure rate. This result indicated that dependability models cannot be considered representative unless the system workload is taken into account. Based on this result, several workload-dependent analytical models have been proposed [1], [10], [36]. Recent measurements on VAXclusters [48], [58] and Tandem machines [25] found that correlated failures significantly exist in distributed systems. Further studies showed that even a small correlation can have major impact on system dependability [9], [49], [50]. Neither traditional models that assume failure independence nor those that are believed to take correlation into account are representative of the actual occurrence process of the observed correlated failures in thefield[53]. In the following subsections, dependency analysis is illustrated through three examples: 1) using a workload hazard model to analyze the dependency between workload and software failures in an IBM 3081 system, 2) using the correlation analysis method to analyze dependency between errors on different machines and its impact on dependability in a VAXcluster system, and 3) using the factor analysis method to analyze the multi-way dependency among failures on multiple processors in a Tandem fault-tolerant system.
4.1.6.1. Workload/Failure Dependency In [19], a load hazard model was introduced to measure the risk of a failure as the system activity increases. The proposed model is similar to the hazard rate defined in Eq (4.1.1). Given a workload variable X, the load hazard is defined as z(x) =
Pr[failure in load interval (x, x + Ax)] gix) = Pr[no failure in load interval (0, x)] Ax 1 — G{x)
(4.1.4)
where g(x) is the p.d.f. of the variable "a failure occurs at a given workload value x" and G{x) is the corresponding c.d.f. That is, f{x) g{x) = Pr[ failure occurs \ X = x] = l{x)
,
(4.1.5)
where l{x) is simply the p.d.f. of the workload in consideration: l{x) = Pr[X = x],
(4.1.6)
208 and f{x) is the joint p.d.f. of the system state (failure state or non-failure state) and the workload: f{x) = Prlfailure occurs & X = x] .
(4.1.7)
A constant hazard rate implies that failures are occurring randomly with respect to the workload. An increasing hazard rate on the increase of X implies that there is an increasing failure rate with increasing workload. The load hazard model was applied to the software failure and workload data collected from an IBM 3081 system running the VM operating system. Based on the collected data, lix), fix), g(x}, and z(x) were computed for each workload variable. Figure 4.1.5 shows the zix) plots for three selected workload variables: (1)
OVERHEAD — fraction of CPU time spent on the operating system;
(2)
PAGEIN — number of page reads per second by all users;
(3)
SIO (Start I/O) — number of input/output operations per second.
The regression coefficient, R^, which is an effective measure of the goodness of fit, is also provided in the figure. The hazard plots show that the workload parameters appear to be acting as stress factors, i.e., the failure rate increases as the workload increases. The effect is particularly strong in the case of the interactive workload measures OVERHEAD and SIO. The correlation coefficients of 0.95 and 0.91 show that the failure closely fit an increasing load hazard model. The risk of a failure also increases with increased PAGEIN, although at a somewhat lower correlation (0.82). Note that the vertical scale on these plots is logarithmic, indicating that the relationship between the load hazard z{x) and the workload variable is exponential, i.e., the risk of a software failure increases exponentially with increasing workload.
+^^+
z(.x)
zM
I0-'
10-
0
0.2 0.4 0.6 O.i X = OVERHEAD
0
20 40 60 80 100 X = PAGEIN
•
0
R- = 0.9\
+
50
100 KiO 200 x = SIO
Figure 4.1.5 Workload Hazard Plots for the IBM 3081 System
209 4.1.6.2 Correlation Analysis Recent measurements on VAXclusters [48] [58] and Tandem machines [25] [26] found that correlated failures are not negligible (about 10% to 30% of all failure data) in distributed systems. By a correlated failure we mean the occurrence of more than one failure on the different components of a computer system within a small time window. Figure 4.1.6 shows a scenario of correlated failures in a measured VAXcluster system [51]. In the figure, Europa, Jupiter, and Mercury are machine names in the VAXcluster. A dashed line represents that the corresponding machine is in a failure state. At some time, a network error (netl) was reported from the CI (Computer Interconnect) port on Europa. This resulted in a software failure (softl) 13 seconds later. Twenty-four seconds after the first network error (netl), additional network errors (net2,net3) were reported on the second machine (Jupiter), which was followed by a software failure (soft2). The error sequence on Jupiter was repeated (net4,net5,soft3) on the third machine (Mercury). The three machines experienced software failures concurrently for 45.5 minutes. When errors/failures on two components are related, the correlation coefficient is a commonly used measure to quantify such dependence. Given random variables Xi and X2, the correlation coefficient between Xi and X2 is defined as [42] Cor(Xi,X2)-
netl .softl Europa _ | I 'l3sec.'
Jupiter _ j 24 sec.
net2
net3
1
1
9 sec.
60 sec.
(4.1.8)
reboot I '
47.83 min.
Mercury _ |
Note:
,
soft2
reboot
1
10 sec.'
L 47.33 min. nel4 net.^ soft3 I I I 78 sec.' 11 sec.
45.5 itiin.
reboot \ 4 sec.
softl, soft2, soft3 — Exception while above asynchronous system traps delivery or on interrupt stack, netl. net3, net.') — Port will be re-started. net2, net4 — Virtual circuit timeout.
Figure 4.1.6 A Scenario of Correlated Failures
L
210 where ^\ and jij are the means of Xj and X2, and a\ and Oi are the standard deviations of X\ and Xj. When calculating correlation coefficients, estimates of these parameters from samples can be used. The first step in correlation analysis is building a data matrix from the measured data. Assume that there are n components in the measured system and the measured period is divided into m equal intervals of A/ (e.g., 5 minutes). An mx« data matrix can then be constructed in the following way. The n columns of the matrix represent the n components in the measured system. The m rows of the matrix represent the m time intervals. Element (;, j) of the matrix is set to the number of errors occurring within interval / on component j . Column j can be regarded as a sample of the random variable, Xj, which represents the state of component j in the system. The second step is calculating correlation coefficients using Eq. (4.1.8) based on the data matrix. Each time, we pick up two columns (X, and Xj) to calculate Cor{X,, Xj). This step can be automated by using a statistical package such as SAS. Table 4.1.5 shows the average correlation coefficients of the 21 pairs of machines in a VAXcluster for different types of errors and failures [52]. Generally, the error correlation is high (0.62) and the failure correlation is low (0.06). Disk and network errors are strongly correlated, because the processors in the system heavily use and share the disks and the network concurrently. Table 4.1.5 Average Correlation Coefficients for VAXcluster Errors Error All
CPU
Memory
0.62
0.03
0.01
Failure
Disk
Network
Software
All
0.78
0.70
0.02
0.06
Although the failure correlation is low, further studies showed that even such a low correlation can have big impact on system dependability [9], [49], [50]. It can degrade system reliability and unavailability by several orders of magnitude. In [49], a formula was derived to calculate unavailability for l-out-of-2 systems with correlated failures and was validated against real data. The formula relates the correlation coefficient discussed above to the system unavailability:
Un = pW^i}^-U^Wy^)
+ f/|t/2 >
(4.1.9)
where ^712 is the unavailability of the l-out-of-2 system, U\ and U2 are the unavailability of components 1 and 2 in the system, respectively, and p is the correlation coefficient between the two components, i.e., Cor{X\ Xi). When ^ = 0, the formula
211 reduces to the traditional formula (t/12 = lJ\Ui) for calculating the joint failure probability of two independent components. When p > 0, even if it is small (e.g., 0.01), the first term in the formula can be dominant because U\ and Ui arc usually much smaller than p in modern computer systems. For the VAXcluster data, if each pair of processors is treated as a l-out-of-2 system, the traditional formula underestimates unavailability, on the average, by two orders of magnitude.
4.1.6.3 Factor Analysis If errors/failures on more than two components are related, the correlation coefficient is not enough to quantify the dependence among these components (multiway correlation). In such a case, ihe factor analysis method can be used to uncover the multiway correlation. Factor analysis is a statistical technique to quantify multiway dependency among variables [8]. The method attempts to find a set of unobserved common factors that link together the observed variables. Consequently, it provides insight into Ihe underlying structure of the data. Let X = (A|, . . . , Xp) be a normalized random vector. We say that the /:-factor model holds for X if X can be written in the form X = AF+E,
(4.1.10)
where A = (Ay) (i = I,...,p; j = \,...,k) is a matrix of constants called factor loadings, and F = (/,,...,/(,) and E = (C],... ,ei^) are random vectors. The elements of F are called common factors, and the elements of E are called unique factors (error terms). These factors are unobservable variables. It is assumed that all factors (both common and unique factors) are independent of each other and that the common factors are normalized. Each variable x^ {i - 1,..., p), can then be expressed as x, = 'LAijfj + e^,
(4.1.11)