FOUNDATIONS OF DEPENDABLE COMPUTING System Implementation
THE KLUWER INTERNATIONAL SERIES IN ENGINEERING AND COMPUTER...
28 downloads
974 Views
14MB Size
Report
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form
FOUNDATIONS OF DEPENDABLE COMPUTING System Implementation
THE KLUWER INTERNATIONAL SERIES IN ENGINEERING AND COMPUTER SCIENCE OFFICE OF NAVAL RESEARCH Advanced Book Series Consulting Editor AndrC M. van Tilborg
Other titles in the series: FOUNDATIONS OF DEPENDABLE COMPUTING: Models and Frameworks for Dependable Systems, edited by Gary M. Koob and Clifford G. Lau ISBN: 0-7923-9484-4 FOUNDATIONS OF DEPENDABLE COMPUTING: Paradigms for Dependable Applications, edited by Gary M. Koob and Clifford G. Lau ISBN: 0-7923-9485-2 PARALLEL ALGORITHM DERIVATION AND PROGRAM TRANSFORMATION, edited by Robert Paige, John Reif and Ralph Wachter ISBN: 0-7923-9362-7 FOUNDATIONS OF KNOWLEDGE ACQUISITION: Cognitive Models of Complex Learning, edited by Susan Chipman and Alan L. Meyrowitz ISBN: 0-7923-9277-9 FOUNDATIONS OF KNOWLEDGE ACQUISITION: Machine Learning, edited by Alan L. Meyrowitz and Susan Chipman ISBN: 0-7923-9278-7 FOUNDATIONS OF REAL-TIME COMPUTING: Formal Specifications and Methods, edited by AndrC M. van Tilborg and Gary M. Koob ISBN: 0-7923-9167-5 FOUNDATIONS OF REAL-TIME COMPUTING: Scheduling and Resource Management, edited by AndrC M. van Tilborg and Gary M. Koob ISBN : 0-7923-9 166-7
FOUNDATIONS OF DEPENDABLE COMPUTING System Implementation
edited by
Gary M. Koob Clifford G. Lau OfSlce of Naval Research
KLUWER ACADEMIC PUBLISHERS Boston I Dordrecht I London
Distributors for North America: Kluwer Academic Publishers 101 Philip Drive Assinippi Park Nonvell, Massachusetts 02061 USA Distributors for all other countries: Kluwer Academic Publishers Group Distribution Centre Post Office Box 322 3300 AH Dordrecht, THE NETHERLANDS
Library of Congress Cataloging-in-PublicationData Foundations of dependable computing. System implementation 1 edited by Gary M. Koob, Clifford G. Lau. p. cm. -- (The Kluwer international series in engineering and computer science ; 0285) Includes bibliographical references and index. ISBN 0-7923-9486-0 1. Electronic digital computers--Reliability . 2. Real-time data processing. 3. Fault-tolerant computing. 4. Systems engineering. I. Koob, Gary M., 1958- . 11. Lau, Clifford. 111. Series: Kluwer international series in engineering and computer science ; SECS 0285. QA76.5. F624 1994 004.2'2--dc20 94-29138 CIP
Copyright
@
1994 by Kluwer Academic Publishers
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means, mechanical, photo-copying, recording, or otherwise, without the prior written permission of the publisher, Kluwer Academic Publishers, 101 Philip Drive, Assinippi Park, Nonvell, Massachusetts 0206 1
Printed on acid-free paper. Printed in the United States of America
CONTENTS
.....................................................v i.l. ... Acknowledgements ...................................... Preface
XHI
1 . DEPENDABLE COMPONENTS
..........................1
1.1
Self-checking and Self-Exercising Design for Hierarchic Long-Life Fault-Tolerant Systems.. ...................................................... .3 D.A. Rennels and H. Kim
1.2
Design of Self-checking Processors Using Efficient Berger Check Prediction Logic ..................................................................... 35 T.R.N. Rao, G-L Feng, and M.S. Kolturu
.
................ 6 9
2.1
Network Fault-Detection and Recovery in the Chaos Router .....................71 K. W. Bolding and L. Snyder
2.2
Real-Tie Fault-Tolerant Communication in Distributed Computing Systems.. ...................................................................... 87 K.G. Shin and Q. Zheng
2 DEPENDABLE COMMUNICATIONS..
3 . COMPILER SUPPORT
.................................1 3 3
3.1
Speculative Execution and Compiler-Assisted Multiple Instruction Recovery ....................................................................... .135 W.K. Fuchs, N.J. Alewine, and W-M Hwu
3.2
Compiler Assisted Synthesis of Algorithm-Based Checking in Multiprocessors ........................................................................... 159 P. Banerjee, V. Balasubramanian, and A. Roy-Chowdhury
4 . OPERATING SYSTEM SUPPORT
....................2 1 3
4.1
Application-Transparent Fault Management in Fault-TolerantMach ........................................................................215 M. Russinovich, Z Segall, and D.P. Siewiorek
4.2
Constructing Dependable Distributed Systems Using Consul...................243 R.D. Schlichting, S. Mishra, and L.L. Peterson
4.3
Enhancing Fault-Tolerance of Real-Time Systems Through T i e Redundancy............................................................................265 S.R. Thuel and J.K. Strosnlder
Index..
................................................... . 3 1 9
PREFACE
Dependability has long been a central concern in the design of space-based and military systems, where survivability for the prescribed mission duration is an essential requirement, and is becoming an increasingly important attribute of government and commercial systems where reduced availability may have severe financial consequences or even lead to loss of lie. Historically, research in the field of dependable computing has focused on the theory and techniques for preventing hardware and environmentally induced faults through increasing the intrinsic reliability of components and systems (fault avoidance), or surviving such faults through massive redundancy at the hardware level (fault tolemce). Recent advances in hardware, software, and measurement technology coupled with new insights into the nature, scope, and fundamental principles of dependable computing, however, contributed to the creation of a challenging new research agenda in the late eighties aimed at dramatically increasing the power, effectiveness, and efficiency of approaches to ensuring dependability in critical systems At the core of this new agenda was a paradigm shift spurred by the recognition that dependability is fundamentally an attribute of applications and services-not platforms. Research should therefore focus on (1) developing a scientific understanding of the manifestations of faults at the application level in terms of their ultimate impact on the correctness and survivability of the application; (2) innovative, application-sensitive approaches to detecting and mitigating this impact; and (3) hierarchical system support for these new approaches. Such a paradigm shift necessarily entailed a concomitant shift in emphasis away from inefficient, inflexible, hardware-based approaches toward higher level, more efficient and flexible software-based solutions. Consequently, the role of hard- ' ware-based mechanisms was redefined to that of providing and implementing the abstractions required to support the higher level software-based mechanisms in an integrated, hierarchical approach to ultradependable system design. This shift was furthermore compatible with an expanded view of "dependability," which had evolved to mean "the ability of the system to deliver the specified (or expected) service." Such a definition encompasses not only survival of traditional single hardware faults and environmental disturbances but more complex and less-well understood phenomena, as well: Byzantine faults, correlated errors, timing faults, software design and process interaction errors, and-most significantly-the unique issues encountered in real-
time systems in which faults and transient overload conditions must be detected and handled under hard deadline and resource constraints. As sources of service disruption multiplied and focus shifted to their ultimate effects, traditional frameworks for reasoning about dependability had to be rethought. The classical fault/error/failure model, in which underlying anomalies Vaults) give rise to incorrect values (errors), which may ultimately cause incorrect behavior at the output Vailures), required extension to capture timing and performance issues. Graceful degradation, a long standing principle codifying performance/dependability trade-offs must be more carefully applied in real-time systems, where individual task requirements supercede general throughput optimization in any assessment. Indeed, embedded real-time s y s t e m ~ f t e ncharacterized by interaction with physical sensors and actuators-may possess an inherent ability to tolerate brief periods of incorrect interaction, either in the values exchanged or the timing of those exchanges. Thus, a technical failure of the embedded computer does not necessarily imply a system failure. The challenge of capturing and modeling dependability for sucb potentially complex requirements is matched by the challenge of successfully exploiting them to devise more intelligent and efficient-as well as more completedependability mechanisms. The evolution to a hierarchical, software-dominated approach would not have been possible without several enabling advances in hardware and software technology over the past decade: (1) Advances in VLSI technology and RISC architectures have produced components with more chip real estate available for incorporation of efficient concurrent error detection mechanisms and more on-chip resources permitting software management of f i e - g d n redundancy; (2) The emergence of practical parallel and distributed computing platforms possessing inherent coarse-grain redundancy of processing and communications resources-also amenable to efficient software-based management by either the system or the application;
(3) Advances in algorithms and languages for parallel and distributed computing leading to new insights in and paradigms for problem decomposition, module encapsulation, and module interaction, potentially exploitable in refining redundancy requirements and isolating faults; (4) Advances in distributed operating systems allowing more efficient inter-
process communication and more intelligent resource management;
(5) Advances in compiler technology that permit efficient, automatic instrumentation or restructuring of application code, program decomposition, and coarse and fine-grain resource management; and (6) The emergence of fault-injection technology for conducting controlled experiments to determine the system and application-level manifestations of faults and evaluating the effectiveness or performance of fault-tolerance methods.
In response to this challenging, new vision for dependable computing research, the advent of the technological opportunities for realizing it, and its potential for addressing critical dependability needs of Naval, Defense, and commercial systems, the Office of Naval Research launched a five-year basic research initiative in 1990 in Ultradependable Multicomputers and Electronic Systems to accelerate and integrate progress in this important discipline. The objective of the initiative is to establish the fundamental principles as well as practical approaches for efficiently incorporating depenability into critical applications running on modem platforms. More specifically, the initiative sought increased effectiveness and efficiency through (1) Intelligent exploitation of the inherent redundancy available in modem parallel and distributed computers and VLSI components; (2) More precise characterization of the sources and manifestations of errors; (3) Exploitation of application semantics at all levels--code, task, algorithm, and domain-to allow optimization of fault-tolerance mechanisms to both application requirements and resource limitations; (4) Hierarchical, integrated software/hardware approaches; and (5) Development of scientific methods for evaluating and comparing candidate approaches. Implementation of this broad mandate as a coherent research program necessitated focusing on a small cross-section of promising application-sensitive paradigms (including language, algorithm, and coordination-basedapproaches), their required hardware, compiler, and system support, and a few selected modeling and evaluation projects. In scope, the initiative emphasizes dependability primarily with respect to an expanded class of hardware and environment (both physical and operational) faults. Many of the efforts furthermore explicitly address issues of dependability unique to the domain of embedded real-time systems. The success of the initiative and the significance of the research is demonstrated by the ongoing associations that many of our principal investigators have forged with a variety of military, Government, and commercial projects whose critical needs are leading to the rapid assimilation of concepts, approaches, and expertise arising from this initiative. Activities influenced to date include the FAA's Advanced Automation System for air traffic control, the Navy's AX project and Next Generation Computing Resources standards program, the Air Force's Center for Dependable Systems, the OSFI1 project, the space station Freedom, the Strategic
Defense Initiative, and research projects at GE, DEC, Tandem, the Naval Surface Warfare Center, and MITRE Corporation. This book series is a compendium of papers summarizing the major results and accomplishments attained under the auspices of the ONR initiative in its f i s t three years. Rather than providing a comprehensive text on dependable computing, the series is intended to capture the breadth, depth, and impact of recent advances in the field, as reflected through the specific research efforts represented, in the context of the vision articulated here. Each chapter does, however, incorporate appropriate background material and references. In view of the increasing importance and pervasiveness of real-time concerns in critical systems that impact our daily lives-ranging from multimedia communications to manufacturing' to medical instrumentation-the real-time material is woven throughout the series rather than isolated in a single section or volume. The series is partitioned into three volumes, corresponding to the three principal avenues of research identified at the beginning of this preface. While many of the chapters actually address issues at multiple levels, reflecting the comprehensive nature of the associated research project, they have been organized into these volumes on the basis of the primary conceptual contribution of the work. Agha and Sturman, for example, describe a framework (reflective architectures), a paradigm (replicated actors), and a prototype implementation (the Screed language and Broadway runtime system). But because the salient attribute of this work is the use of reflection to dynamically adapt an application to its environment, it is included in the Frameworks volume. Volume I, Models and Frameworksfor Dependable Systems, presents two comprehensive frameworks for reasoning about system dependability, thereby establishing a context for understanding the roles played by specific approaches presented throughout the series. This volume then explores the range of models and analysis methods necessary to design, validate, and analyze dependable systems. Volume 11, Paradigms for Dependable Applications, presents a variety of specific approaches to achieving dependability at the application level. Driven by the higher level fault models of Volume I and buiilt on the lower level abstractions implemented in Volume 111, these approaches demonstrate how dependability may be tuned to the requirements of an application, the fault environment, and the characteristics of the target platform. Three classes of paradigms are considered: protocolbased paradigms for distributed applications, algorithm-based paradigms for parallel applications, and approaches to exploiting application semantics in embedded realtime control systems. Volume 111, System Implementation, explores the system infrastructure needed to support the various paradigms of Volume 11. Approaches to implementing
suppport mechanisms and to incorporating additional appropriate levels of fault detection and fault tolerance at the processor, network, and operating system level are presented. A primary concern at these levels is balancing cost and performance against coverage and overall dependability. As these chapters demonstrate, low overhead, practical solutions are attainable and not necessarily incompatible with performance considerations. The section on innovative compiler support, in particular, demonstrates how the benefits of application specificity may be obtained while reducing hardware cost and run-time overhead. This third volume in the series completes the picture established in the fist two volumes by presenting detailed descriptions of techniques for implementing dependability infrastructure of the system: the operating system, run-time environment, communications, and processor levels. Section 1 presents design approaches for implementing concurrent error detection in processors and other hardware components. Rennels and Kim apply the principles of self-checking, self-exercising design at the processor level. Rao, et al, use an extension of the well-known Berger code as the mathematical foundation for the construction of self-checking ALUs. These components provide the fundamental building blocks of dependable systems and the abstractions required by higher level software implemented protocols. The field of fault-tolerant computing was once dominated by concerns over the dependability of the processor. In modem parallel and distributed systems, the reliability of the network is just as critical. Communications dependability-like processor dependability-is more appropriately addressed at the lower layers of the system to minimize impact on performance and to simplify higher-level protocols and algorithms. In the fist chapter of Section 2, Bolding and Snyder describe how the inherent attributes of chaotic routing-an approach proposed for its performance advantages in parallel systems-may be adapted to support dependability as well. The use of non-determinism in chaotic routing is the key to realizing its goal of high throughput, but is inappropriate for real-time systems where low latency and predictability are the primary concerns. Shin and Zheng offer the concept of redundant real-time channels as a solution to the problem of dependable end-to-end communications in distributed systems under hard real-time constraints. Compiler technology has emerged within the past decade as a dominant factor in effectively mapping application demands onto available resources to achieve high utilization. The chapters in Section 3 explore the intersection of compiler optimizations for high performance and compiler transformations to support dependability goals. Fuchs, et al, describe the similarities of compiler transformations intended to support instruction-level recovery from transient errors and the state management requirements encountered when speculative execution is employed in super-scalar architectures. Banerjee, et al, adapt parallelizing compiler technology to the automa-
tion of algorithm-based fault tolerance. The approach considers not only transformations for generating checks, but also partitioning, mapping, and granularity adjustment strategies that balance performance and dependability requirements. Operating systems support is central to the design of modern dependable systems. Support is required for service abstractions, communication, checkpointing and recovery, and resourcelredundancy management. All of these are considered in Section 4. Russinovich, et al, describe how various error detection and recovery mechanisms may be efficiently integrated into an operating system built on the popular Mach microkernel in a manner transparent to the application itself but customizable to its requirements. Protocol-based paradigms for distributed systems all seem to rely on a number of fundamental protocols such as atomic multicast and group membership. Schlichting, et al, have organized these operations into a comprehensive communications "substrate". By exploiting the interdependencies among them, an efficient implementation has been constructed. The problem of resource management in a real-time distributed system becomes even more complex when deadlineconstrained redundancy and fault management must be supported. Thuel and Strosnider present a framework for managing redundancy, exceptions, and recovery under hard real-time constraints. Gary M. Koob Mathematical, Computer and Information Sciences Division Office of Naval Research
Clifford G. Lau Electronics Division Office of Naval Research
ACKNOWLEDGEMENTS
The editors regret that, due to circumstances beyond their control, two planned contributions to this series could not be included in the final publications: "Compiler Generated Self-MonitoringPrograms for Concurrent Detection of Run-Time Errors," by J.P. Shen and "The Hybrid Fault Effects Model for Dependable Systems," by C.J. Walter, M.M. Hugue, and N. Suri. Both represent significant, innovative contributions to the theory and practice of dependable computing and their omission diminishes the overall quality and completeness of these volumes. The editors would also like to gratefully acknowledge the invaluable contributions of the following individuals to the success of the Office of Naval Research initiative in Ultradependable Multicomputers and Electronic Systems and this book series: Joe Chiara, George Gilley, Walt Heimerdinger, Robert Holland, Michelle Hugue, Miroslaw Malek, Tim Monaghan, Richard Scalzo, Jim Smith, Andr6 van Tilborg, and Chuck Weinstock.
SECTION 1
DEPENDABLE COMPONENTS
SECTION 1.1
Self-Checking and Self-Exercising Design for Hierarchic Long-Life Fault-Tolerant Systems David Rennels and Hyeongil Kim Abstract This research deals with fault-tolerant computers capable of operating for extended periods without external maintenance. Conventional fault-tolerance techniques such as majority voting are unsuitable for these applications, because performance is too low, power consumption is too high and an excessive number of spares must be included to keep all of the replicated systems working over an extended life. The preferred design approach is to operate as many different computations as possible on single computers, thus maximizing the amount of processing available from limited hardware resources. Fault-tolerance is implemented in a hierarchic fashion. Fault recovery is either done locally within an afflicted computer or, if that is unsuccessful, by the other working computers when one fails. Concurrent error detection is required in the computers making up these systems since errors must be quickly detected and isolated to allow recovery to begin. This chapter discusses ways of implementing concurrent error detection (i.e., self-checking) and in addition providing self-exercising capabilities that can rapidly expose dormant faults and latent errors. The fundamentals of selfchecking design are presented along with an example -- the design of a selfchecking self-exercising memory system. A new methodology for implementing self-checking in a.synchronous subsystems is discussed along with error simulation results to examine its effectiveness.
1.1.1 Introduction There is a class of multicomputer applications that require long unmaintained operation in the range of a decade or more. The obvious example is remote sensing, e.g.. space satellites. Important new applications of this type are expected for computers that are embedded in long-life host systems where maintenance is expensive or incon-
1. Computer Science Department, University of California at Los Angeles. This work was supported by the Office of Naval Research, grant N0()014-91-J-1009.
venient, e.g., transportation systems and patient monitoring. They are becoming practical because the rapidly decreasing cost of hardware malces it cost-effective to employ redundancy during the construction of a computer to avoid maintenance costs later and to avoid the inconvenience and danger of later breakdowns. Many of these applications also require minimization of power, weight, and volume. In spacecraft this is due to limited resources, and in ground-based systems this may be due to battery limitations or the wish to reduce heal to improve reliability of densely packaged systems. Technology for fabricating these systems is moving toward very densely packaged systems using stacked tnultichip modules and stacked chips. These packaging techniques provide greatly improved performance while minimizing power, weight and volume, and we expect that they will also become a preferred way of making systems when their high-volume leads to low-cost. One side effect of extremely densely packaged computers is that they are very expensive to take apart and repair. Conversely, the cost of adding redundant chips is relatively small. Thus the trend will be to add in the needed redundancy at the time of fabrication to guarantee long reliable life. This technology leads to building highly modular systems in which many small computer modules work together to solve large computing problems. Here the designers aim at as high a perfonnance as possible, and this means that as many computers as possible should be doing different parts of a problem. To optimize the design of highly reliable long-life systems it is necessary to use hardware in a highly efficient fashion. If one uses the classical fault-tolerance techniques of triplication and voting, performance is reduced by an unacceptable factor of three. Furthermore, for long life unmaintained applications, a sufficient amount of spare hardware must be included to guarantee that a triplicated system (i.e., three systems) will be operational at the end of mission. This is unacceptable in terms of power, weight, volume and performance, so we must look for a more hardware-efficient way of providing fault tolerance and k)ng unmaintained life 11).
1.1.2 Hierarchic Fault-Tolerant Designs One approach to doing this is to design hierarchic fault- tolerant systems in which the amount of redundancy is varied to match the criticality of various computations in a computer system. Critical computations such as the operating system and a subset of critical applications programs are replicated (i.e., run on two or more computers) to allow compulations to continue uninterrupted in the presence of faults). Less critical applications are run in single computers with rollback or checkpointing for recovery in order to maximize performance.
Figure 1.1.1 shows the recommended approach in a graphic fashion. The area represents the complete set of programs that are needed for the fault-tolerant system and is divided into three sub-areas I, 2. and 3, that we will call levels since they are associated with different levels of protection. •
Level 1 programs (or area 1) provide critical functions that require very strong fault-tolerance protection. These include operating system and fault-recovery functions as well as applications programs that cannot tolerate delays or data loss when a fault occurs. Executive and fault-recovery programs of level one are used to manage recovery from faults in the lower-level programs of areas 2 and 3. These programs must be run redundantly on replicated hardware to allow continued operation when a module fails.
Standby Redundancy re-compute data Recovery by restart / roll-forward (Seconds delay plus data loss)
Recovery with no program interruption or delay Massive redundancy, voted or duplcx-sclf-checking Standby redundancy with rollback / roll-forward Recovery with Program Rollback (lO's of milliseconds delay)
FIGURE 1.1.1: Applications in an Hierarchic Fault-Tolerant System •
Level 2 consists of those programs that can accept delays in recovery on the order of milliseconds. Here programs can be run singly, but program rollback techniques are required to quickly restart computations from a recent point where the state has been saved. To do this, concurrent error detection is required in all processors. That is, error detection circuits must be placed in all of the computer modules that can detect errors as soon as they occur. This prevents damaged data from propagating to external modules or across rollback points before an error is detected - making recovery very difficult. Program rollback imposes extra complexity in program writing and requires duplicate storage of rollback recovery variables.
•
Level 3 consist of less-critical programs that can accept considerable fault-recovery times (e.g., seconds), and which can be restarted after a fault is detected and corrected. These programs can be run on single machines, and fault recovery docs not add particularly difficult problems in programming. Concurrent error detection hardware is highly recommended so that hardware errors can be reliably detected and isolated.
Of course, the diagram of Figure l.I.l is a simplification. The number of programs and their required fault-tolerance, speed and memory resources varies with different applications. But many of the computers in long-life dedicated systems have only a minority of programs that require the highest levels of redundancy for fault-protection, and a majority of programs can be run in simplex using the more heavily protected executive functions to manage their recovery. This can lead to a fault-tolerant design that uses hardware most efficiently — getting a maximum of computation out of the hardware available and thus maintaining performance over long time periods with a limited amount of spare resources. However, an underlying requirement in these applications is concurrent error detection. The computer modules and interconnection structures should be designed so that they can detect internal logic errors as soon as they manifest themselves -- before they multiply and propagate. Thus they can be safely depended upon to execute programs singly, but quickly detect errors and notify more heavily protected higher-level functions to institute recovery actions should a fault occur.
1.1.3 Design of Modules with Concurrent Error Detection The ba.sic techniques and theory for implementing self-checking concurrent error detection were developed a number of years ago by Carter et. al. at IBM Research [2]. They defined a technique for designing self-checking checkers that could not only detect errors in the circuit being checked, they could also detect faults in the checkers. This solved the fundamental problem of who checks the checker. The approach was to use dual-rail logic where check signals are represented as two wires. The values 0,1 and 1,0 on the pairs indicate correct operation, while the values 1,1 or (),() indicate either an error in the circuit being checked or an error by the checker. When no errors occur, the check signals representing a "good" circuit alternate between 0,1 and 1,0 in a way that exercises and checks the check circuits. This is best explained by examples shown in Figure 1.1.2. Most concurrent error checks consist of checking separable codes on data (e.g. parity and arithmetic codes) or comparing the outputs of duplicated circuits. For self-checking designs the checkers are modified to present a set of output pairs that take on the 0,1 and 1,0 values for correct data and the 0,0 or 1,1 values for errors as shown in Figures 1.1.2(a) and 1.1.2(b). Carteret, al. demonstrated what they called a "Morphic-And" circuit that reduced two 2-wire pairs of complementary signals to a single pair of complementary signals that also take on values of 0,1 and 1,0 (See Figure 1.1.2(c)). If an input error occurs (an error in one of the circuits being checked) 0,0 or 1,1 occurs in one of the input pairs.
and the outputs will also take on the error value of 0,0 or 1,1. This circuit has the selfchecking property that for every stuck- at signal in the checker there is at least one correct set of (complementary 0,1 or 1,0) input pairs that will cause a 1,1 or 0,0 output. That means that not only does the checker detect errors in the circuit being checked, also there is good data that will flush out an error in the checker by causing an error signal of 1,1 or 0,0 to occur in the outputs. A tree of "Morphic-And circuits (See Figure 1.1.2(d)) will reduce a set of error checks to a single complementary pair that take on values 1,0 or 0,1 if no error occurs, or are 1,1 or 0,0 if an error occurs in the circuits being checked or an error occurs in the checker.
Logic module
c c a) A Self-Checking Odd Parity Check
a a b b c c b) Duplication and Comparison
a a bb
a
Duplicated Logic module
c c dd
e e ff
g g h h
Do Do c) A Morphic AND Reduction Circuit
d) A Tree of Morphic AND Circuits
FIGURE L1.2: Self-Checking Checkers This design methodology resulted in a large number of papers that showed how to implement self-checking concurrent error detection in a variety of error detection circuits. The author developed a self-checking computer module that provided concurrent error detection using this type of self-checking logic.
1.1.3.1 The Fault-Tolerant Building Block Computer As an example of a modular multicomputer architecture that uses self-checking logic design to provide computer modules with high-coverage concurrent error detection, we can turn to the JPL Fault-Tolerant Building-Block Computer. This system, built fifteen years ago, at the Jet Propulsion Laboratory used Self-Checking Computer Modules [3]. It is still a good example of the type of computer architecture that can be used to support hierarchic applications in a cost-effective fashion. The SCCM, shown in Figure 1.1.3, was experimentally verified by inserting errors. External Busses (1553A) Bus Interface Building Blocks
Redundant Memory
Bus (BA) Adaptor Bus (BA) Adaptor
I I I Bus (BA) Adaptor Bus Controller (BC)
Memory Interface Building Block Internal Bus
Hamming Correction Interrupt
Internal Fault
Bus Check Bus Arbiter Processor Check
Reset/ Rollback Core Building Block
CPU
CPU
Priority Daisy Chains
2Bits 6-Bits Hamming Spare
16 Bits Data
Output Inhibit
I/O BB I/O BB DMA Req DMA Grant Inlernal Fault Indicalors
FIGURE 1.1.3: A Self-Checking Computer Module
The SCCM, is a small 16-bit computer which is capable of detecting its own malfunctions. It contains I/O and bus interface logic which allows it to be connected to other SCCM's to form fault-tolerant multicomputers. The SCCM contains commercially available microprocessors, memories, and four types of building-block circuits as shown in Figure 1.1.3. The building blocks are: 1) an error detecting (and correcting) memory interface building block (MI-BB), 2) a programmable bus interface building block (BI-BB), 3) a core building block (Core-BB), and 4) an I/O building block (10BB). A typical SCCM consists of 2 microprocessors, 23 RAM's, 1 MI-BB, 3 BIBB's, 2 lO-BB's, and a single Core-BB. Although these circuits were designed before large ASICS became commercially available, they now could be implemented as single-chip ASICS. The building-block circuits control and interface the various processor, intercommunication, memory, and I/O functions to the SCCM's internal bus. Each building block is responsible for detecting faults in its associated circuitry and then signaling the fault condition to the Core-BB by means of an internal fault indicator. The MI-BB implements fault detection and correction in the memory, as well as providing detection of faults in its own internal circuitry. Similarly, the BI-BB and lO-BB provide intercommunications and I/O functions, along with detecting faults within themselves and their associated communications circuitry. The Core-BB checks the processing function by running two CPUs in synchronism and comparing their outputs. It is also responsible for fault collection and fault handling within the SCCM. The approach to concurrent error detection uses the Carter self-checking design methodology. For data paths and memory, error codes are checked as shown in Figure 1.1.2(a) to obtain 2-wirc complementary error signals. Irregular circuits are duplicated and compared with the outputs inverted on one module to obtain complementary pairs as shown in Figure 1.1.2(b). Finally the error signals are combined as shown in Figure 1.1.2(c) to obtain a self-checked master fault indicator. Local fault-tolerance is handled by the Core-BB. It receives two-wire l-out-of-2 coded fault indicators from the other building-block circuits, and it also checks internal bus information for proper coding. Upon detecting an error, the Core-BB disables the external bus interface and I/O functions, isolating the SCCM from its surrounding environment. The Core-BB then attempts a rollback or restart of the processor. Repeated errors result in the disabling of the faculty SCCM by its Core-BB. Most transient errors can be corrected locally within the SCCM and RAM chip faults can be handled using the SEC/DED code and spare bit-plane replacement. Faults that cannot be corrected within a SCCM must be handled by redundant SCCM modules in the system.
10 Modular computer architectures containing self-checking processor modules with duplex and compare CPUs, coding on buses and memories and redundant memory chips have been further developed by several manufacturers (Harris, Honeywell, IBM, and TRW) for DoD systems.
1.1.4 Self-Checking, Self-Exercising Logic The methodology of designing self-checking (or at least mostly self-checking) logic circuits with high-coverage error detection is well established. Given that concurrent error detection is used in a design, remaining difficulties in achieving fault- tolerance include the problems of fault dormancy and error latency. Fault dormancy occurs in circuits that have failed, but their failed state happens to correspond to the currently correct value of data. Therefore the fault will not be discovered until the data value is changed in the normal course of processing. Examples include a stuck-at-one memory ceil that is currently supposed to hold a one, or a pattern-sensitive logic error that will not show up until surrounding circuits take on a particular pattern. Error latency occurs when there is a data error, but a delay occurs before it is detected by a checking circuit. Bit-flips in seldom-read memory locations fall into this category. There is a danger that when a error is detected, another dormant fault or latent error may already exist in a fault-tolerant system, which will upset the recovery algorithm and cause the system to fail. Therefore it is a good practice to design a system in such a way that dormant faults and latent errors will be flushed out quickly. We call this approach self-checking and self-exercising (SCSE) design. Our research shows that SCSE design is relatively inexpensive, given that concurrent error detection logic is already in place. The additional cost is one of introducing periodic test patterns to expose these errors and faults. Design examples are discussed below. 1.1.4.1 A Self-Checking Self-Exercising Memory System In the last several years a SCSE memory system was designed, laid out for VLSI and simulated at UCLA [4]. As shown in Figure 1.1.4, it consists of a set of 2 ^ 1 RAM chips and a Memory Interface Building Block (MIBB) chip. The memory systems employs an odd-weight SEC/DED code for error detection and correction, and each bit-position is stored in separate chips so that any single chip failure will at most damage one bit in any word. Two spare bit planes (i.e. sets of RAM chips that can replace those in any failed bit position) are included for long life applications. Each RAMs is organized as 2 one-bit words. The cell array is square, meaning that the upper half of the address bits select a row address (RAS) of sqrt(N) cells, and the lower half of the address bits select a bit in the row i.e., a column address (CAS). The
11 Odd Column Parity
The RAM Chip
Even Column Parity
CONTROL (RAV, RAS, CAS. -7^— CHECK)
ADDRESS
Address Parity Error DATA
Address Bus To RAM Chips
32 Information Bits
f 7 Parity Bits
2 Spare Bits
Control Bus
Memory Interface Building Block (MIBB) ^
ncorr. Error
(SEC/DED, Sweeping, Spare RAM Replacement, Control) Data and Address Bus From CPU 32D+2P Control Bus from CPU (R/W, MST, CPL)
-/Correctable Error Int.
FIGURE 1.1.4: A Self-Checking Self-Exercising Memory System novel feature of this design is that both the RAM chips and MIBB contain additional circuitry to provide concurrent self testing and self-checking. Two parity bits are added to each row of storage cells inside the memory chips, one parity bit is used to check all odd numbered bits in the row, and the other parity bit checks all even bits in the row. The extra parity bits add two columns to the array as shown in the figure. This on-chip RAM checking allows background scrubbing to be carried out by the MIBB by interleaving check cycles with normal program execution as is summarized below. All the RAM chips can be commanded by the MIBB to perform a CHECK CYCLE (typically taking less than a microsecond) during which a row of each RAM's mem-
12 ory cell array (specified by the most significant half of the address - RAS) is read out into a data register, the parity is checked, and the data bits are stored back into the row either unchanged or with one of three permutations to be described later. If a chip detects an internal error, it is signalled via its data line. The MIBB interleaves these check cycles with normal reads and writes (approximately every 100th cycle), and they check each ROW in sequence. It it not possible using row-parity to determine which cell in the RAM's row (and therefore which word) has an error. The data that is read out of a row of cells inside each RAM corresponds to sqrt(n) words, and an error exists in one of the words (bits). Therefore the MIBB-dirccted recovery sequence consists of accessing every word containing bits in the erroneous RAM row and reading it out one-at-a-timc. This is done by holding the upper address bits (RAS) constant and sequencing through all values of the lower address bits (CAS). During these accesses, the SEC/DED code across all RAMs is used to find the error and write back corrected information. Both the SWEEP checking and the correction process (when needed) are interleaved with normal program execution. A typical interleaving sequence is shown in Table I.I.I. Note that the correction cycles are only invoked if a RAM chip has signalled a row error. Using this technique, transient errors in the RAMs can be detected and corrected within a few milliseconds, without significantly affecting ongoing processing. .SWEEP
ROW
CYCLE
SWEEP
ROW
CYCLE
1 2
2(X)
1
1
100
1
2
2
200
2
3
3
300
3
3 ERRO R
300
4
4
400
Correct
Cycle
R3, W l
400
5
5
500
Correct
Cycle
R3, W2
500
128
128
12800
Correct
Cycle
R3, WI28
13200
1
12900
4
4
13300
2
1 2
13000
5
5
13400
---
---
1(X)
---
Table 1.1.1 Interleaving Sweep and Correction Cycles
13 When the memory is initially loaded, sweeping is turned off and data is stored in nonpermuted form. A series of special Regenerate Parity (RP) check cycles arc executed to initialize the parity bits in all rows in the memory chips. Then sweeping is turned on and the MIBB executes sweeps consisting of a scries of interleaved check cycles as previously described. During any single sweep, the RAMs can be commanded to permute the data (i.e. flip selected bits before writing a row back) in the specified row in one of four ways. One of the four row-permutations is chosen for writing back even rows, and one (possibly different) is chosen for writing back odd rows. Permutations: (1) No Change, (2) Invert Odd Bits (3) Invert Even Bits, (4) Invert All Bits We will call a set of two row-permutations (one for odd rows and one for even rows) as a sweep-permutation, or simply a permutation. The permutation capability is used to test for permanent faults as discussed below. •
Stuck at One/Stuck at Zero Cells - If a cell is stuck at the same value as the data stored in it, there will not be an error until an attempt is made to store the other value (one or zero) that it is incapable of storing. To detect this type of fault, sweep cycles are carried out that invert all bits in each ROW as they are written back. Thus the memory contents alternate between true and complement form exposing the stuck cells in at most two sweeps.
•
Coupling (e.g. shorts) Between Cells - this is the situation where a cell is sensitive to the value in a neighboring cell. With a series of four permutations it is possible to exercise each cell in the following way: (i) its neighbors (above, below, right, and left) take on opposite logical values, (ii) then the cell takes on the opposite logic value, (iii) then the neighbor cells return to their original logic value, and (iv) finally the cell returns to its original state.
At any given time there are two permutations in memory, separated by the next row scheduled to have a sweep cycle. The MIBB must keep track of the Next Row to be Swept (NRS), the last permutation will be found in the NRS and following rows, and the new permutation in rows already swept (RAS < NRS). On a normal read or write, the MIBB must determine if the word being accessed is stored in true or complement form, since the effect of permutations is to complement the bits at some addresses but not at others. To do this it must determine what permutation has been applied to the row being accessed, and whether the bit being accessed is in an odd or even numbered column. Using the permutation information, the MIBB then determines if the word (i.e. the corresponding cell position) is inverted or not. and if necessary, re-inverts the data as it arrives at the MIBB.
14 The process is quite simple. A simple 4-bit state machine determines the current permutation that is being applied, and the last permutation is saved. The four bits indicate bit inversions in; i) odd-rows, odd-columns, ii) odd-rows, even-columns, iii) evenrows, odd-columns, and iv) even-rows, even-columns. When a normal read or write cycle is initiated, the Row Address (RAS) portion of the address is compared with NRS to determine which permutation to use. Then the least significant bit of the RAS and CAS portions of the address (which specify whether the row and column addresses in the memory array are odd or even) are used to index the four bits that specify that permutation. The correct bit in the state machine is then accessed to determine whether the word is or is not inverted. In order to prevent speed degradation, row-parity checking and generation are only done in the RAMs during sweep cycles. They are not done in the RAMs during normal reads and writes. During reads the SEC/DED code will detect and correct errors. During writes, the parity bit being stored is exclusive-or'd with the appropriate (even or odd position-covering) parity bit to update it. There are several advantages of this self-checking self exercising memory system architecture. First, it provides the standard self-checking design to detect errors and a Hamming code to correct single errors. Second, transient errors can be detected and corrected within a few milliseconds, allowing recovery in very high transient and burst-error environments. This has been modeled and its error recovery properties discussed by Meyer [5]. Finally, it reduces the latency of permanent faults by rapidly exposing them. This can be done about a thousand limes faster than can be done with software, and at relatively low cost.
1.1.5 Concurrent Error Detection Using Self-Timed Logic It is difTicult to avoid noticing the similarity of the complementary pair coding used to detect completion in Differential Cascode Voltage Switch (DCVS) self-timed logic, and the complementary signalling used in the classic paper describing self-checking logic of Carter et. al. [2]. It has been recognized by many that the redundancy inherent m self-timed logic can be used for testing and error detection. Fault modeling and siinulation of DCVS circuits has been discussed and the capability of DCVS circuits provide on-line testability with their complementary outputs has been explored [6,7]. The testability of DCVS EX-OR gates and DCVS parity trees has been analyzed, and methods for testing DCVS one-count generators, and adders have been presented [8,9]. Easily testable DCVS multipliers have been presented in which all detectable stuck-at, stuck-on and stuck-open faults are detected with a small set of test vectors, and the impact of multiple faults on DCVS circuits has been explored [10,11]. And a
15
technique for designing self-checking circuits using DCVS logic was recently presented [12]. There has also been a great deal of interest lately in asynchronous logic design in the VLSI community [13,14]. Self- timed logic is typically more complex than synchronous logic with similar functionality, but it offers potential advantages. TTiese include higher speed, the avoidance of clock-skew in large chips, better layout topology, and the ability to function correctly with slow components. For several years we have been studying the applicability of a modified form of DCVSL logic for use in designing processors and associated logic with concurrent error detection. Although this logic is not self-checking in the formal sense, for all practical purposes it provides the same capability of detecting faults in the checkers as well as faults in the circuits being checked. One of the primary motivations for this is the need for lower power in spacecraft applications. The current way of providing high-coverage concurrent error detection in processors is to duplicate and compare them. This presents a considerable power overhead. Synchronous processors consume relatively high power (at least several watts) in clock distribution and output drivers, and that power is doubled in a duplex configuration. And as chips get larger and contain more circuitry the clocking problem becomes more severe. A single self-checking asynchronous chip is expected to require considerably less power than two duplicated synchronous chips. Another reason we began this study was to explore the feasibility of using asynchronous logic in the interface/controller chip of the self-checking sclf-cxcrcising memory system previously described at FrCS-21 [4], The synchronous nature of that design caused what we felt were unnecessary delays in synchronizing between independent clocks, so we decided to examine an asynchronous design as an alternative. A key requirement of the design was concurrent error detection. Thus our approach has been from an architectural viewpoint. 1.1.5.1 The Starting Point - Synchronous Circuits with Concurrent Error Detection A common way to make synchronous hardware systems with concurrent error detection is to duplicate the circuit modules, run them with the same clock, and compare their outputs. Duplication and comparison has become widely accepted and is supported in chip sets by commercial manufacturers (e.g. Intel) and DoD-supported projects such as the GVSC and RH-32 processors. One way to do this in a self-checking fashion is to invert the outputs of one module to obtain morphic (1,0 or 0.1) pairs
16 and compare its outputs with the corresponding outputs of the other module using a tree of self-checking comparators of the type introduced by Carter et. al. many years ago [15,16]. These checkers are self-checking with respect to stuck-at faults, and will, in most cases detect transient errors that cause the coding to be incorrect at clock transitions. They may fail under Byzantine conditions where noise or marginal circuit conditions cause the checker to sec correct coding while the circuit receiving the output data sees an incorrect value. When one examines conventional duplicated synchronous systems, the cost of concurrent error detection is a doubling of active circuits plus the addition of comparators. The overhead of error detection in asynchronous designs should be similar to the synchronous case and it is already an integral part of the design. This is discussed below. 1.1.5.2 The Analogous Asynchronous Design Style — Differential Cascode Voltage Switch Logic (DCVSL) An asynchronous design requires redundant encoding that can provide completion information as part of the logic signals. A logic module is held in an initial condition until a completion signal arrives from a previous module indicating that its inputs are ready. Then it is started, and when its outputs indicate completion, other modules may be started in turn. In general this requires a form of encoding that allows the receiver to verify that the data is ready. One way to do this in self-timed designs is to use a form of l-out-of-2 coding. Output signals from various logic modules are sent as a set of two-wire pairs, taking on the values 0,0 before starting, and 0,1 or 1,0 after completion. Such logic can be implemented using differential cascode voltage switch logic (DCVSL). A typical DCVSL gate is shown in Figure 1.1.5(a). DCVSL is a differential prechargcd form of logic. When the Req (request) signal is low, the PMOS pull-up transistors precharge points a and c to Vdd. At this time the circuit is in an initial state, and its outputs are 0,0. The circuit block B contains two complementary functions. One pulls down, and the other is an open circuit. When the input signals arc ready, Req is raised, the pull ups are turned off, the NMOS pull down transistor connects points b and d to ground, and the circuit computes an output value. The side that forms a closed circuit forms a zero and the side that remains an open circuit remains precharged. TTic outputs, driven by inverters, go from 0,0 to either 1,0 or 0,1 and the completion signal CPL is generated from either a logical OR or the exclusive OR of the outputs. As in the case of duplex self-checking synchronous circuits, the functions are duplicated in true and complement form, but here the state 0,0 on a
17
signal pair is a valid setup signal, and the arrival of complementary values signal completion.
(a) A Single Circuit
(b) An Iterative Circuit FIGURE 1.1.5: Differential Cascode Voltage Switch Logic It is intuitively obvious that DCVSL can provide a degree of error detection. Consider a single DCVSL circuit (Figure 1.1.5(a)). The circuit block B can be designed so that a fault or error will only affect one of the two sides (true or complement) and thus will only affect one output [6]. The fault will cause a detectable output pair of 0,0
(detected by timeout due to no completion) or 1,1. Given that faults in individual DCVSL circuits produce detectable (1,1 or 0,0) outputs, it is helpful to examine a combinational network made up of a number of these circuits. 1.1.5.3 A DCVSL Combinational Network- An Adder Example Figure 1.1.6 shows the DCVSL circuits used in a ripple-carry adder. The sum and carry circuits are shown in Figure 1.1.6(a) and 1.1.6(b). The complete adder is shown as a multi-module circuit in Figure 1.1.6(c), and a 4-bit adder of this type is used as the function blocks in the circuit simulations to be described later.
H [ . „ s„„. Ir • ' H C . sTm A
cc
'^m
Cin
s s
A
m
Req-
a) DCVSL Carry Gate
s
B
B-
A
-S
b) DCVSL Sum Gate
c) A Multi-bit Adder
FIGURE 1.1.6: A DCVSL Combinational Function Block (Adder)
s "s
19 Fault Effects in Multi-Module DCVSL Circuits - When a good circuit receives incorrect signal input pairs from a faulty circuit (i.e. 0,0 or 1,1), it will produce either the correct output or in an output of 0,0 or 1,1 because DCVSL circuits have paths corresponding to each minterm that either pull dovi/n one side or the other. A (0,0) input pair will disable some minterms and can only prevent a side from being pulled down — producing an error output of 0,0. Similarly a (1,1) on an input pair will activate additional minterms and can only produce an error output of 1,1. These errors will be either masked or propagated through multiple level DCVSL and be detectable at the output. 1.1.5.4 An Asynchronous System with Concurrent Error Detection We now show how these DCVSL function blocks are combined into a larger sequential system. Figure 1.1.7 shows a 2-stage micropipeline. Each stage begins with a register made up of two latches for each complementary pair of input signals received from DCVSL circuits. The register data is sent to a DCVSL combinational circuit that we will call a computational block (a DCVSL 4-bit adder in our simulation studies). A checker is provided at the output of the computational block. Rl
CI
Bl
Data In
71
R2
^ B2
Data
7
7\ < z
< Z
ERROR
o QM C Q
ERROR
< o
H O
s
u CPL
o u
CU PQ
o u
Rin B^SY
A Data Out
/
7
C2
CONTROL
(Aout) FIGURE 1.1.7: A 2-Stage Asynchronous Circuit (Micropipeline)
CPL
20 The checker circuit is shown in Figure 1.1.8(c). It is logically equivalent lo a tree of morphic AND gates used in synchronous self-checking checkers by Carter et. al. The outputs (z,~z) are complementary if all of the input pairs are complementary, and if any input pair is 0,0 or 1,1 the output takes on that same value. This checker provides
Out
:L^ c
Out
'}^ aa
Qi
UJ _ ^ u u I u
Inputs
CPL
* CPL output augmented for delay-insensitive etc. operation
W
clc.
OR
AND
(a) The Ba,sic Checker Circuit ERROR
Time-Out
X ± Out
(b) A Simple RC Timer FIGURE 1.1.8: The Checker Circuits
c
o a:
Inputs (c) The Checker-Tree
21 a completion signal if all the complementary input pairs have a one. If" one of the input circuits fails to complete and generates a 0,0 signal pair, then (z,~/.=0,0), a completion is prevented and the checker uses a simple time-out counter (see Figure 1.1.8(b)) to signal the error. If one of the computational circuits generates a 1,1 pair (z,~z= 1,1) and the error signal is also generated, Note that the checker is partially self-checking because (since it is logically a tree of Carter's self-checking morphic-and gates) for any stuck-at signals in the tree, there is a set of "good" inputs that will cause that internal signal pair to take on values 0,0 or 1,1. Any 0,0 pair generated in the checker tree (due to a checker fault) will result in a 0,0 output from the checker and a timeout error. Similarly an error signal of 1,1 generated anywhere in the checker tree will result in a 1,1 output from the tree and an error signal. One of the reasons that it is not fully fault-checking is that there are error conditions that can generate a premature completion signal when an internal variable is stuck at one. If a member of a complementary pair inside the tree w,x (say w) is stuck at one and the input signals would normally generate w=l and x=0, the tree can generate a premature completion signal without waiting for the circuits preceding w,x. This premature completion signal may cause a data word to be sent to a succeeding logic stage before some of the circuits have finished setting up. This leads to 0,0 pairs being loaded into registers of the following circuit causing it to halt and timeout. The following is a simplified view of the control sequence (see Figure 1.1.7). Rin is raised if input data is ready and the first stage is not busy, and this loads the input register Rl. The arrival of data in RI causes the computational block Bl to be started. If no error has occurred, the checker signals completion when data is available from BI. An interlock is then set if stage 2 is busy. When stage 2 finishes, it loads the data out of B1 into R2 and the DATA-IN detect from R2 causes B1 and CI to be precharged, and Rl to be cleared to all zeros. Then the first stage is released by removing its BUSY signal, allowing it to accept more data. 1.1.5.5 Control Circuits A detailed view of the control circuits are shown in Figure 1.1.9. The circuits in Figure 1.1.9 can be viewed as multiple stages of logic where each stage consists of a latch followed by a block of DCVSL computational logic. The stages can operate concurrently and form a pipeline. We started with an Interconnection and Synchronization circuit by Meng [17]. The objective was to add whatever was necessary to provide concurrent error detection in the system. In the original design, the registers were implemented with edge-triggered flip flops whose inputs used only the "true" signal
22 from each input pair. The Q and ~Q flip flop outputs provided the complementary signals for the following DCVSL computational block. This had two problems. First, the one-out-of- two checking code was lost at the registers so a flip-flop error was undetectable. Second, loss of a clock to one or more flip-flops was also undetectable.
Data Out
ERROR
FIGURE 1.1.9: The Handshaking Control Circuits The modified circuit in Figure 1.1.9 uses essentially the same handshaking conventions as Meng. However, to improve the testability and fault tolerance of the control and synchronization circuit we modified it in the following fashion. The register has two gated latches for each DCVSL output pair, and thus it accepts 2-wire complementary inputs instead of single line inputs. All latches are reset after their contents have been used (to illegal 0,0 pairs) to give a better chance to detect errors in the registers if they are clocked when data is changing, or if some of the latch pairs fail to be reloaded. The dual gated latches are simpler than the single positive edge triggered flip-flops used in the original design. The main synchronization signals are: •
Rin - a data ready by the checker, signalling completion of a computational block.
23 Aout - a register loaded signal, indicates that every signal pair in a register has at least one "one". It is the logical AND of the OR of each latch pair of signals. Rout (or Request) - A signal releasing the pull ups and causing computation in the DCVSL computational logic. This signal is interlocked by the C-Gate so that the logic cannot be started until the input register is loaded and the following register is free. Similarly it cannot be released to reset the computational block until data has been loaded into the output register and the input register is reset. Ain - the same as Aout, it is the busy signal from the next latch.
FIGURE 1.1.10: Transition Graph for a Full Handshake The transition graph of signals in Figure 1.1.9 is shown in Figure 1.1.10, which shows a full-handshake between function blocks as explained in Meng [17]. This is the synchronizing function performed by the control circuits. Both the positive and negative values of the control signals are shown by the superscript + and -. Arrows show signal conditions that must be true before the following transition is allowed to proceed. A careful examination of the graph shows that this provides the appropriate interlocking so that a module on the right has to complete before the module on the left is allowed to take the next computational step. The Rout- to Rout-i- step then provides the reset to pull up the DCVSL functional block before the next computation is started. Stuck-at Faults — The interlocking nature of the feedback control signals causes the circuit to "hang up" and stop if one of the signals Ain, Aout, Rin, Completion, .. sticks
24 at a one or zero value (see Figure 1.1.10). A time out counter is employed to detect the stopped condition. In nearly all cases, stuck-at values in a register or computational block will cause a detectable value of 0,0 or 1,1 to appear at the checker This occurs because the dualrail DCVSL logic block circuits pass on an uncoded (0,0 or 1,1) output when an uncoded input (0,0 or 1,1) occurs. When input signals occur that would normally cause a stuck circuit to go to the other value, its complementary circuit takes on the same value, generating an uncoded signal that passes through the Computational Block to the checker. The reset signal sets all register pairs to 0,0 to enable detection of faults caused by the inability to clock one or more sets of latches. It is redundant so that if it sticks at /.ero. a second fault must occur before an error is generated. If it slicks at one, the register will be permanently reset to 0,0 pairs, Aout will never go high, and the circuit will slop. As soon as the latch complete signal goes high, the load signal to the register goes low in order that the latched data are not disturbed by changing data from the preceding stage. If the load sticks at zero, the registers will be permanently reset and Aout will not be generated, halting the circuit. A stuck at one load signal will cause the register not to be held constant while the computational block is working. The C-gate preceding a register norinally prevents the register from being reloaded while the outputs of the circuit that sent it inputs is being to reset to 0,0. The stuck at one load will allow the register to change while the following computational block is using its data. The results, though difficult to predict, are likely to produce a detectable coding error in the following stage.
1.1.6 Simulation of the Self-Timed Micropipeline There is no easy way to analyze the effect of errors in an asynchronous circuit. To help in analyzing the response of the circuit to transient errors we simulated the 2stage micropipeline circuit by injecting randomly generated faults at points shown in Figure 1.1.11. The circuit was divided into two sections of control logic and data logic. The control logic is the same as that shown in Figure 1.1.9, and the data path logic at each stage is a 4-bit DCVSL adder. Transient error insertion points in the control logic arc shown as exclusive-or gates in the diagram. In addition two types of errors were injected into the data path section: i) data latched in the register, and ii) data generated by the computational block. The simulation has been done using the Lsim which is Mentor Graphics Corp.'s mixed mode simulator. The simulation circuit is built in the netlist and in the modeling M language.
25
FIGURE 1.1.11: Experimental Fault Insertion Points In order to make randomly generated transient errors and data patterns to the circuit, an input deck generation program is written in C. The timing and duration of the transient errors are determined from the random numbers and are in the range of 10-90 nsec. We assume that there is only one transient error at a time. Lsim generates time trace outputs. Inputs to the adder circuit arc randomly varied, and we can easily determine the expected time sequence of the output variables. The effects of errors on the values or the timing of the outputs can then be analyzed. As currently implemented, analysis of output traces is partially done manually, therefore the number of fault-insertions is relatively small. We are currently working on a more automated way of analyzing data, and more extensive testing will be done in the future. However, the results are highly encouraging, with no undetected errors in the faults simulated so far. Since we are trying to determine the fault detection coverage of this circuit, a successful test occurs if: i) a fault produces a correct output though it may delay the circuit less than the time out count, or ii) if a fault produces bad output and it is detected. An unsuccessful test occurs if an undetected error is found.
26 The following are the results of the simulations so far. These results are divided into several categories: i) no effect, ii) tolerated-delayed - no errors circuit halted a short time until the fault was removed. iii) tolerated - an error occurred in an internal variable but it was not used by the following stage (e.g., data already taken). iv) error detected explicitly detected by a timeout or error signal from the checker. 1.1.6.1 Error Simulation Results Transient errors were simulated in the data section as described below: Simulation of "0" transient errors in the data path. - There are 32 data bits in the 2stage simulated circuit. Each stage has 4-bits of latched data, 4-bits of complement latched data, 4- bits of output data (from the computational block) and 4-bit complement output data. In the simulation one of 32 data bits is randomly selected and it is temporarily stuck at 0 at random times with random duration less than 90 time steps. Normally each calculation cycle is about 100 time steps and 20 transient errors are mjected during 5000 time steps. If the circuit is idle for more than 200 time steps, the monitoring circuit issues a timeout signal. There was no undetected error in the simulation and effects of the transient errors are classified as: a. no effect: 515 cases b. tolerated (delayed): 158 cases c. tolerated: 102 cases d. timeout error detected: 22 cases Simulation of logic "I" transient errors in the data path - This simulation setup is almost the same as the transient error 0, but this time the selected data bit is temporarily stuck at 1. There was no undetected errors in the simulation and the transient errors are classified as: a. no effect: 285 b. tolerated (ignored): 285 c. tolerated(delayed): 136 d. detect error: 254
27 Control Section Errors - This was a simulation of transient errors in selected control signals. Duration of each transient error was selected randomly between 10ns to 30ns. There were no undetected errors. a. no effect: 58 b. tolerated-delayed: 592 c. timeout error detected: 60 d. error detected by checker: 29 1.1.6.2 Observations on Error Effects After examining in detail the traces of logic signals in the error simulations, we have made the following observations about the effects of the fault-categories so far modeled. There were many cases where the inserted transient held a signal in its correct value, so there was no error. Those cases are uninteresting. So we will look at the cases where a real error occurred. Transient Errors Making Data Bits 0 J. Errors in Registers/Latches •
If a "zero" transient error occurs in data being latched to a register, the OR-AND circuit delays issuing a latch completion signal (Aout) until the transient error disappears. Either correct data is eventually loaded or a timeout occurs. That means this error is tolerated by the circuit.
•
If the error occurs after the latch completion signal (Aout) is generated, i.e., the following computational block starts, but there is one 00 pair in its input data. Therefore the computational block generates a 0,0 output and the checker also produces a 0,0 preventing a completion signal. This error is eventually detected by the timer circuit that generates a time-out signal.
•
If a latched data bit is affected after the following computational block generates a correct output and a completion signal, then the computational block produced a correct output even though one of input data bits later changed to 0, and the error has no effect.
2. Computational Block Errors •
If one of the output bits of a computational block is pulled to zero before the completion signal is generated, then the generation of the completion signal is delayed until the transient error disappears and then a correct output is generated. This error is tolerated and causes only small time delay in the circuit.
28 •
If a transient error occurs after a correct output has been generated by the computational block but before the data is latched in the register of next stage, then the latch completion signal (Aout) of the next stage is not generated until the transient error disappears and correct output is generated and latched into the register. If the transient error exists for long enough time to activate the timeout circuit, then a timeout signal is generated. So we can say this error is tolerated or detected.
•
If the error happens after output data is latched in the register of the next stage and before the computational block is initialized (evaluation signal is high), then the error causes no effect to the circuit. This error is also tolerated by the circuit.
As is explained above, it appears that all the transient errors that make a bit in the data path circuits go to 0 can be tolerated or detected by the timeout circuit. Transient Errors Making Data Bits One 1. Errors in Register Latches •
If a data bit in the register is flipped to "one" during initialization, it cannot cause Aout to be asserted since the other register pairs arc zeros. Therefore the error will be overwritten when the register is loaded and there will be no effect.
•
When a data bit in the register is hit by the error after initialization of the register, the affected data bit remains 1 and, depending on incoming data, this error will have no effect (if the bit of incoming data is 1) or it will can cause a 1,1 input pair leading to an incorrect output (1,1) from the computational block and (1,1) from the checker which signals an error
2. Errors in the Computational Blocks •
If the computational block is hit by the transient error during precharge state, the affected bit is restored to 0 by the precharge, and the error is tolerated.
•
If the data bit of the computational block is affected by the error during evaluation, then the output of the computational block will take on (1,1) and be detected by the checker. If the data bit of the computational block is hit before the precharge state and after output data is latched in the following stage, then the checker generates an error signal even though the error was not passed on to the next stage.
As it is shown above, all the logic " I " transient errors inserted so far have been tolerated -- either causing no error (other than short delays) or being detected by a checker.
29 Transient Errors in the Control Signals - We have not yet done an exhaustive analysis of the effects of errors on all of the control signals, but we have looked at the ones on which errors were inserted. They are briefly (and infonnally) described below: /. Transient errors in the Rin signal - are tolerated due to the characteristics of the Cgate and the AND gate generating the load signal. If there is a transient error in the Rin signal when the request signal (Rout signal) is high, then the output of the C-gatc cannot change. If it occurs when Rout signal is low, this means that the next stage is awaiting data. The output of C-gate goes high and load signal is generated prematurely. Here, the register will wait until coded data arrives before generating an Aout signal and starting the next circuit. Thus the error is masked. If the error happens when Aout signal is high, the transient error in Rin signal is masked by the AND gate. If Rin goes to zero prematurely, the C gate does not allow the latch signal to drop until the register is loaded (when Aout is asserted). If Rin is either held to 0 or 1 for a long period of time, the circuit simply stops computing until the transient goes away. We are finding many cases where the circuit simply stops when an error occurs. If the error goes away, the computations continue without error. If it lasts too long, a timeout error is generated. 2. Transient Errors on the Load Signal - During error-free operation, this signal should be generated when correct morphic input data is available at the input port of the register and the following computational block is reset. At this time, the register has been previously reset to all zero pairs. We find that if it is raised prematurely (while the following computational block is reset and waiting for data) the circuit simply waits until the correctly coded data arrives because AOUT will not be asserted. If an error causes load to be raised while the following computational block is evaluating, a detectable coding error will be created in the register. (If any incoming bit pair is non-zero and has a different value from the current register, a detectable 1,1 value will be latched.) If the load signal goes low because of a transient error before correct data is latched to the register, then the generation of Aout signal is delayed until load signal goes back to high and correct data is latched. .?. Data Register Reset - The data register of the circuit is reset after the use of the data in the register and before the new data is latched. If a faulty reset signal is applied after it has been reset and before new data arrives, then there is no effect on the circuit. If the register has data arriving but the completion signal has not been set and the reset signal goes high because of a transient error, then the circuit will wait for the reset transient to go away and for the data to arrive. If it takes too long, a timeout will occur. If a reset occurs after the computational block has started, the computational
30 block can not generate a correct output. In that case a timeout error is detected. Thus we can say that the transient error on the reset signal is tolerated or detected by the circuit. If the reset signal fails to go to "one" due to a transient, the Aout signal fails to be reset and this is also detected because the circuit times out. 4. Transients in Aout - The Aout signal of the current stage is the Ain signal of the previous stage. A logic "one" transient error on the Aout signal can cause an early start of a following computational block by prematurely raising its Rout signal when the computational block has already evaluated previous data and precharged the functional circuit. In this case the computational block does not generate morphic outputs until correct input data comes from the register. Thus the circuit waits until correct data arrives or times out. If there is a "one" transient error on the Aout signal when the Rout signal of a previous stage is high, and the output of the functional block of the previous stage has not generated yet, then the register of the previous stage is reset forcing the computational block inputs to zero, and eventually a timeout error occurs. 5. Transients on Rout - The Rout signal starts evaluation of the computational block. If a "one" transient error affects the Rout signal when the Rout signal should be low, then the error on the Rout signal does not have any effect on the circuit (the input register is inputting 0,0 so the circuit will remain precharged). If the Rout signal is high when a "zero" transient error hits the signal, then the effect of the error depends on the timing of the Rout signal. If the Rout signal is just beginning to go to one, then some computational block outputs will be 0,0, No completion will be generated and the circuit waits for the transient to go away. But if the Rout signal is hit by a transient error at the end of evaluation, then evaluation starts again and never finishes. In this case timeout error is detected. At least in the cases of the transient errors in the control signals that we have studied so far, they are tolerated — producing delays or error signals.
1.1.7 Conclusions Having done experimental fault insertions on both the JPL-STAR and Fault-Tolerant Building Block Computers, the author certainly understands that inserting a few hundred errors does not adequately determine coverage for any system [18]. But the fact that we have found no undetected errors so far tends to indicate that the basic design approach is sound. We will not be surprised to find (as occurred in the STAR machine) a few signals that are not adequately covered and have to modify the design to improve their coverage.
31 This study indicates that self-timed design techniques can be adapted to fault-tolerant systems, and that they offer considerable potential in the implementation of modules that have concurrent error detection. Of course, self-timed logic is a matter of religion to many, but it is not clear to what degree it will ever displace conventional clocked CMOS designs. We make no projections here, but only note that asynchronous design is very interesting, and its fault-tolerance properties need to be explored from an architecture prospective. The cost of this approach is reasonable, and we are optimistic that this design style will become more important as fault tolerant systems made are made from larger chips with smaller feature sizes. There are many interesting problems still unexplored. First, more extensive simulation experiments and analysis are needed to prove the effectiveness of these design techniques. Another interesting problem is to compare the power consumption in Watts per MIP of duplicated synchronous processors vs. self-checking self-timed designs. Another intriguing question is the possibility of implementing error recovery in the form of microrollback in micropipelines. By latching old values at each stage, it may be possible to restart and correct computations when an error has occurred. But these interesting problems must be left for subsequent investigations.
1.1.8 References 1. Rennels, D. and J. Rohr, "Fault-Tolerant Parallel Processors for Avionics with Reduced Maintenance," Proc. 9th Digital Avionics Systems Conference, October ISIS, 1990, Virginia Beach, Virginia. 2. W.C. Carter, A.B. Wadia, and D.C.Jessep Jr., "Computer Error Control by Testable Morphic Boolean Functions - A Way of Removing Hardcore", In Proc. 1972 Int. Symp. Fault-Tolerant Computing, pages 154-159, Newton, Massachusetts, June 1972. 3. Rennels, D., "Architectures for Fault-Tolerant Spacecraft Computers", Proc. of the IEEE, October 1978, 66-10: 1255-1268. 4. David A. Rennels and Hyeongil Kim, "VLSI Implementation of A Self-Checking Self-Exercising Memory System". Proc. 21th Int. Symp. Fault-Tolerant Computing, pages 170—177, Montreal, Canada, June 1991. 5. Meyer, J. and L.Wei, "Influence of Workload on Error Recovery in Random Access Memories," IEEE Trans. Computers, April 1988, pp. 500-507.
32 6. Z.Barzilai, V.S. Iyengar, B.K. Rosen, and C M . Silberman, "Accurate Fault Modeling and Efficient Simulation of Differential CVS Circuits" In International Test Conference, pages 722-729, Philadelphia. PA, Nov 1985. 7. R. K. Montoye, "Testing Scheme for Differential Cascode Voltage Switch Circuits". IBM Technical Disclosure Bulletin, 27(10B):6I48-6152. Mar 1985. 8. Niraj K, Jha, "Fault Detection in CVS Parity Trees: Application to SSC CVS Parity and Two-Rail Checkers", In Proc. 19th Int. Symp. Fault-Tolerant Computing, pages 407-414. Chicago, IL, June 1989. 9. Niraj K. Jha, "Testing of Differential Cascode Voltage Switch One-Count Generators". IEEE Journal of Soiid-State Circuits, 25(1 ):246-253, Feb 1990 10. Andres R. Takach and Niraj K. Jha., "Easily Testable DCVS Multiplier". In IEEE International Symposium on Circuits and Systems, pages 2732--2735, New Orleans. LA.. June 1990. 11. N. Kanopoulos and N. Vasanthavada, "Testing of Differential Cascode Voltage Switch (DCVS) Circuits", IEEE Journal of Solid-Slale Circuits, 25(3):806-813, June 1990. 12. N.Kanopoulos, Dimitris Pantzartzis, and Frederick R. Bartram, "Design of SelfChecking Circuits Using DCVS Logic: A Case Study", IEEE Transactions on Computers. 41(7):891-896, July 1992. 13. Alain J. Martin, Steven M. Burns, T. K. Lee, Drazen Borkovic. and Pieter J. Hazewindus. "The Design of an Asynchronous Microprocessor". Technical Report Caltech-CS-TR-89-2. CSD, Caltech, 1989 14. Gordon M. Jacobs and Robert W. Broderson, "A Fully Asynchronous Digital Signal Processor Using Self-timed Circuits". IEEE Journal of Solid-State Circuits. 25(6): 1526-1537, Dec 1990. 15. W.C. Carter and PR. Schneider, "Design of Dynamically Checked Computers", In Proc. IFIP Congress 68, pages 878-883, Edinburgh, Scotland, Aug 1968. 16. Richard M. Sedmak and Harris L. Liebergot, "Fault Tolerance of a General Purpose Computer Implemented by Very Large Scale Integration". IEEE Transactions on Computer, 29(6):492-500, June 1980.
33 17. Teresa H. Meng. Synchronization Design for Digital Systems, Kluwer Academic Publishers, 1991. 18. A. Avizienis and D. Rennels, "Fault-Tolerance Experiments with the JPL-STAR Computer". Dig. of the 6th Annual IEEE Computer Society Int. Conf. (COMPCON), San Francisco. 1972, pp. 321-324.
SECTION 1.2
DESIGN OF SELF-CHECKING PROCESSORS USING EFFICIENT BERGER CHECK PREDICTION LOGIC T. R. N. Rao, Gui-Liang Feng, and Mahadev S. Kolluru Abstract Processors with concurrent error detection (CED) capability are called selfchecking processors. CED is a very important and necessary feature in VLSI microprocessors that are integral and ultradependable for real-time applications. The design of self-checking reduced instruction set computer (RISC) requires the stateof-the-art techniques in computer architectures, implementation and self-checking designs. Among the components of a processor, the most difficult circuits to check are the arithmetic and logic units (ALUs). In this chapter, we shall concentrate on the design of a self-checking ALU. We introduce a new totally self-checking (TSC) ALU design scheme called Berger check prediction (BCP), Using the BCP, the selfchecking processor design can be made very efficient. Also, we discuss the theory involving the use of a reduced Berger code for a more efficient BCP design. A novel design for a Berger code checker based on a generalized code partitioning scheme is discussed here, and is used to efficiently implement the Berger code checking.
Key words: Self-Checking, Berger Code, Berger Code Partitioning, FaultTolerance, Self-Checking ALU, BCP, Reduced Berger Check.
The autliors are with the Center for Advanced Computer Studies, University of Southwestern Louisiana, Lafayette, LA 70504. Tliis work was supported by the Office of Naval Research under Grant N00014-91-J-1067.
36 1.2.1 Introduction Tlie complexity of an IC chip increases significantly as a result of tlie advent of very large scale integrated (VLSI) technology. A modem microprocessor built on a single VLSI IC chip is much more complex than a medium scale computer built just a few years ago. The high density and small feature size contribute to the increasing vulnerability to a-particles. Since the future VLSI circuits should be more dense with smaller feature sizes, permanent and transient faults are more likely to occur in the future VLSI circuits than in those at the present time. Concurrent error detection (CED) is thus vital for the success of the future development in VLSI. The high performance (in terms of speed, throughput et al) requirement of the processors makes it necessary to adopt a RISC based processor design. The design philosophy of RISC architectures is to analyze the target applications to detemiine the most frequently used operations, and then to optimize the data path design to execute these instructions as quickly as possible. This philosophy of RISC is applicable to special purpose processors as well as to large general purpose computers. The design of self-checking reduced instruction set computer (RISC) requires the stateof-the-art techniques in computer architectures, implementation and self-checking designs. A self-checking processor (SCP) is a processor that is designed to have the concurrent error detection (CED) capability. CED is an error detection process that is designed to operate concurrently with the normal processor operations. CED is a very important and necessary feature in the VLSI microprocessors that are integral to real-time applications, since the error latency time will be very small. The incorporation of CED capability enables fast error recovery and also helps in preventing system crashes. An SCP can be very effective in the fault-tolerant computer system design. The important classes of SCP include a totally self-checking (TSC) processor, and a strongly fault-secure (SFS) processor. A typical TSC or SFS processor consists of a TSC or SFS functional circuit, and a TSC or strongly code disjoint (SCD) code checker. The study of a self-checking processor utilizes such selfchecking circuits to form a complete CPU. Due to its operative nature, arithmetic and logic units (ALUs) are the most difficult functional circuits to check among the components of processors. It is well known that in the presence of a fault, arithmetic operations may gencrate arithmetic errors, which are of the fomi 5" = 6' ± 2', where 5" denotes the corrupted sum, S is the uncomipted sum, and / is the faulty bit position. Residue codes and AN codes are the two major classes of arithmetic error-contiol codes (15, 17). Several non-aritlimetic codes such as parity codes [19], parity-based linear codes |5],
37 two-rail codes [6], and Bergcr codes [4, 8], have also been studied for applications in arithmetic operations. Reed-MuUer codes are the only non-duplicated codes that can handle logical operations. A number of codes such as the parity codes [19], residue codes [16], two-rail codes, and Berger codes [8], have also been studied to protect the logical operations from failures. Among the existing codes and designs, two-rail code encoded ALUs, and Berger encoded ALUs are the only known self-checking ALUs. The designs of TSC two-rail encoded ALUs are widely used in self-checking processor designs, such as the ones in [6, 10, 11, 12]. The SFS Berger check prediction ALU [8] is the only known technique for self-checking ALU, other than the duplication or two-rail methods. A self-checking processor with an SFS Berger check prediction ALU is more efficient in temis of redundancy than that with a two-rail encoded ALU. Berger code is the only known systematic all-unidirectional error detecting code. It has two different encoding schemes: BQ and B j . The Bo encoding scheme uses the binary representation of tlie number of O's in information bits as tlie check symbol, and the B i encoding scheme uses the I's complement of the number of 1 's in information bits as the check symbol. Both the encoding schemes are valid for tlie detection of all-unidirectional errors. Several papers have dealt with the design of the Bergcr code checkers [1, 2, 3, 9,14]. In this discussion, we examine the application of a reduced Berger code to the design of a self-checking ALU. The reduced Berger code may use only two check bits, regardless of tlie information length. Since a Bergcr code requires riog2(n + 1)1 check bits for n information bits, (where F x 1 denotes tlic smallest integer that is greater than or equal to x), the application of reduced Berger code yields a more efficient implementation. Section 1.2.2 presents a brief review of the required terminology most commonly used in the areas of self-checking and coding theory. In section 1.2.3, we discuss the Berger codes and the Berger check prediction scheme. Section 1.2.4 presents the fomiulation of check prediction equations of the reduced Berger code for various arithmetic and logical operations. The circuit design of the proposed reduced Berger Check Prediction ALU will also be given in Section 1.2.4. Then, Section 1.2.5 demonstrates the VLSI design of a reduced Berger code encoded Manchester can-y chain ALU. In Section 1.2.6, we present the theory of generalized Berger code partitioning, which i)rovides a foundation for the design presented here. Section 1.2.7 describes the design alternatives based on the generalized Berger code partitioning, followed finally by some conclusive remarks.
38 1.2.2 Conceptual Overview It is desirable to design circuits that will indicate any malfunction during normal operation and will not produce an erroneous result without an error indication. Some of the arithmetic and non-arithmetic types of codes are briefly reviewed in this section.
1.2.2.1 Definitions and Terminology We present here some commonly encountered terminology and a few definitions from checker design and coding theory. Definition 1: Concurrent Error Detection (CED) is an error detection capability that is desigued to operate concurrently with the normal processor operations. Detinition 2: A Self-Checking Processor (SCP) is a processor that is designed to have the concurrent error detection capability. A model for the .self-checking circuit G is illustrated in Figure 1.2.1. The circuit comprises a functional circuit L and a check circuit CK. G Inpui X—
Functional Cifcuu L
Ouipui Y
Checker CK
Error Indicaior Z
Figure 1.2.1. Model for a self-checking circuit Let us consider a combinational circuit L that produces an output vector Y(X, /), w hich is a function of the input vector X and a fault/ e F, a specified set of faults
39 in the circuit. The absence of a fault is termed a null fault and is denoted by X. If the circuit L has n inputs, then the input space Q.\ of L is the set of all 2" input vectors. Similarly, if the circuit L has r outputs, then the set of all 2'' output vectors is the output space Oy of L. Further let N be the subset of inputs (referred to as the input codespace) received by the logic circuit under normal operation and S be the subset of outputs (referred to as the output codespace) of the circuit under normal operation. Definition 3: A circuit L is fault secure (FS) for an input set N and a fault set F, if for any input X in N and for any fault/in F, Y(X,f) = Y(X, X), or Y(X,f) i S. A fault-secure circuit is depicted in Figure 1.2.2. Input Space
0"'P"' Space
Y(X, k)
X6N
f6F XeN
Circuit L
Y(X. f) - y(X, A.) —»
or Y(X, f) < S
Figure 1.2.2. Fault-secure circuit
Definition 4: A circuit L is self-testing (ST) for an input set N and a fault set F, if for every fault/in F, there is some input JV in N such that Y(X,f) is not in S. The model for a self-testing circuit is shown in Figure 1.2.3.
40 Ouiput Space
Input Space
f eF YiX, A.1 e S for all X 6 N
XS N
Circuit L Y(X.f) t SforiomeX € N
Figure 1.2.3. Self-testing circuit Definition 5: A circuit L is said to be lotally self-checking (TSC) if it is both selftesting and fault-secure. If each input in N occurs during the normal course of operations of the circuit, then the self-testing property guarantees that all faults in /•" produce detectable errors during normal operation. lu general, self-testing is a difficult condition to satisfy properly, in comparison to fault secureness. This is due to the fact that in some cases the required number N of inputs, to detect every fault in F, do not exist. The concept of strongly fault-secure (SFS) was proposed in [20]. According to this concept, a fault sequence {fx, f^,-, /„ } . where / , e F, 1 < i < «, is defined to represent the event where fault/i occurs, followed by the occurrence of fault/T, and so on until/„ occurs. At this instant the effect of the entire/au/r sequence is present on the system. A line stuck at a 0 or 1 is assumed to remain stuck at that value, and further the faults are assumed to occur one at a time and that the time interval between any two such fault occurrences is sufficient for all the input code combinations to be apphed to the circuit. Definition 6: Assume that the circuit L always gives correct codewords for a sequence of fewer than m faults, 2< m) = Y(x,X) Also assume that for /n-faults sequence {fi.fz.-.fm-i.fm such that y(X.{fuf2-JJ}
) tliere is some A" e N
is not in S
Then L is said to be strongly fault secure (SFS) for f/i ,/2,-,/m }• We can verify that any SFS circuit satisfies the TSC conditions. Furthermore, if a circuit is not SFS, it is always possible to produce an erroneous code output prior to a noncode output. Hence, SFS circuits form the largest class of circuits that satisfy the TSC conditions. Fault secureness and self-testing specify only the behavior of the circuit for the codeword inputs (the inputs are always assumed to be error free). However, we need to study the circuit behavior for noncodeword inputs. Definition 7: A circuit L with output codespace S is error-secure (ES) for input noncodespace ilx - N if for any inputXg in Q.x - N, whereX^ =Xp + E,Xp eN,E* Y(X„, X) i S, or Y(X,, X) = Y(X., X)
YlX, >i) t S
W X,-X, ".
. E
Ciicuil L iruUt-Int) Y ' * . * l - Yil-.,
6 N
Figure 1.2.4. Error-secure circuit
42 This behavior is ilhistrated in Figure 1.2.4. If an error seeure circuit receives an erroneous input, it either passes a noncodeword on to the subsequent circuit blocks, or masks (i.e., corrects) the error in the input word. Definition 8: A circuit L with output codespace S is error-preserving (EP) or code disjoin! {CD) for input noncodespace ^ x - N if for any input X in Qx - N, Y(X, X)iS An error preserving circuit is error secure, but not vice-versa. Under a faultfree ojieration, a code-disjoint circuit maps all of its input noncodespace members to noncodeword outputs. Definition 9: Assume that the circuit L is such that for a sequence of fewer than m faults, 2<m< «, in / \ and for all noncodeword inputs A' e Qx - N,
Also assume that for m-faults sequence {f\,fi,-,fm-\,fm This implies,
} circuit L is self-testing.
"^i^, f/i ./2.--./m }) is not in S for some X e N Then ihc circuit L is said to be strongly code disjoint (SCD) for {f\,f2,--,fm }•
1.2.2.2 Arithmetic Codes Codes used for aritlimetic operations are modeled as integers in some ring 1^ , v\ ilh nioduUis M. The codewords are to be closed under addition of integers. Any integer N can be expressed as a polynomial in radix r as A' = fln.ir""' -i-a„_2r""^ -i-.... +a^r + ai^ for «, e {0, /,.., r-/jf. The integer is written in the fomi of an «-/M/?/e,
43
Definition 10: The arithmetic weight War(N) of an integer A^ is the smallest number of nonzero terms in a minimal expression for ^V of the form A^ = ifln.ir"-'±fl„_2r"~^±....±air±ao where a, s{0,1,.., r-1 Jaad r is the radix of the system. For example, the integer 31 in radix 2 is written as 31=(11111)2 and its arithmetic weight obtained from the minimal form of expression 31=2^-2" is VV^ar(31) = 2. We can clearly see that the arithmetic weight of an integer N depends on the radix r of the system. Definition 11: The arithmetic distance between two integers iVi and A^2 denoted Dar{NuN2), is given by the arithmetic weight ofN^- N2. For example Da,(31, 39) = W^rO^ - 39) = W^,( - 2^) = 1. The aritlimetic distance so defined corresponds with the errors that occur in arithmetic units based on the error propagation. The code C with minimum aritlunetic distance, d^i„ , provides the following properties: 1. Code C can detect up to d arithmetic errors for d < rf^in. 2. Code C can correct / or fewer aritlimetic errors if and only ifrf^i^> 2/ + 1. 3. Code C can correct up to t errors and detect up to d errors with d>i if and only if rfmjn >d+t+l.
1.2.2.3 Arithmetic Code Classes For arithmetic codewords, carries occur between infomiation digits and check digits and may cause problems in handling. We should hence make a clear distinction between systematic and separate codes.
44 Definition 12: An arithmetic code is said to be systematic if for each codeword (of « digits) there are k specified positions called information digits and the rest n-k positions are known as the check digits. Definition 13: An arithmetic code is said to be a separate code if the code is systematic, and the addition structure for the code has separate adders for the information and check digits. This implies that no carries occur between the information and the check digits during addition. There are three distinct classes of arithmetic codes: (1) AN codes, which are non-systematic; (2) residue codes, which are also referred to as separate codes; and (3) systematic AN codes. Systematic AN codes are not separate. This indicates that codewords are integers and there could be carry propagation between the information and check parts. We shall briefly describe the AN codes and the residue codes. Also, parity codes and two-rail codes are briefly reviewed here.
1.2.2.4 /W Codes In AN codes, A denotes the generator of the code and A' is the represented information. Every codeword is an integer of the form AN. For infomiation N e Z^i where iW = A x m , the integers m and M are called the range of information and modulus of the code, respectively. There are m codewords { 0, A, Z4,.., (m - 1)A }. Each codeword is represented as a binary «-tuple, and hence n satisfies the following inequality: l " " ' < Aim-1)
< 2"
and n is called the length of the code. For two codewords 4yV, and AN2, their sum is given by: R = I AW , + AW, I >; = Ax\
N i + N 2\ ,n
is also a codeword. Thus, AN codes are also linear since the sum of two codewords is also a codeword. If an error e is introduced iu the addition, then the erroneous result is given by the relation: R +e \ K, = \ AN
M
45 To check for errors, we find the syndrome of R denoted as follows: S( R ) = \ R I .4 = I I AA?3 + e I ,1,
1.2.2.5 Residue Codes Systematic codes that have separate adders for information and check parts are called separate or residue codes. Error detection using residue codes: In this case, the applied codewords are of the fomi [N, C(N)]. Tlie code is closed or preserved under the addition operation if and only if C(N) is a residue check of N for some base b, and the operation * in the checker is an addition modulo b. The separate adder and checker circuit is depicted in Figure 1.2.5. N,
1
Adder
-
Ni*N2
Error Detector
- n C(N,)
Checker
Error Indicator
_
Figure 1.2.5. Separate Adder and Checker circuit For a detailed study of arithmetic codes the reader is advised to refer to [15].
1.2.2.6 Parity Codes Codes which use the concept of parity are termed the parity codes or paritybased codes. Here, we shall briefly review the concept of parity. The parity of an
46 integer can be either odd or even. An even parity is referred to as parity 0, while parity 1 denotes an odd parity. The panty of a certain gronp of bits refers to the parity of the number of elements of value 1 ni the group. Furtliermore, the parity of the sum of two blocks (of digits) is equal to the sum of their parities. A block of bits of length k can be converted to a block of length k + J having a desired parity 0 or 1. by a adding a 0 or 1 parity bit. The parity bit is obtained by XORing all the k bits of the original block. Let us briefly see how the parity concept is used in determining the existence of an odd number of errors in a received block. Let the information block 10 be transmitted be of length k. The transmitter appends the parity bit to the information part of length A:, thus converting it into a block of length n = k + 1 having parity 0. The receiver knov\s that any uansmitted block (of length n = k + I) has parity 0. Upon receiving a block, the receiver checks its parity and the received block is determined to be erroneous if and only if its parity is 1. The process performed at the transmitting end is teirned the encoding process while that performed at the receiving end is referred to as the decoding process. Figure 1.2.6 illustrates the encoding and decoding processes. Informauon Block 1 of Icogih i
Aiiaching to I as parity bit
Traasmiued Block of length k + I
Encoding process
Received Block of leagih k + I
check the paiity of the received block Decodiiig procea
Figure 1.2.6. Basic Error Detection scheme
1.2.2.7 Two-Rail Codes One of the techniques used to encode informaUon bits is a two-rail encoding scheme. In a two-rail encoding scheme, bits of information occur as complementary pairs, (0, 1) and (1,0); the pairs (0, 0) and (1, 1) denote errors. Thus, the same circuit that checks a two-rail encoding can be used to combine several self-checking checker output pairs into one pair. A self-checking two-rail code checker will map m input
47 pairs, referred to as {(GO.^O). (^i.^i) ( « „ - ] . ^m-i)). to one output pair referred to as (zo. 21 )• The output pair must be complementary if and only if each of the input pairs is all complementary. Figure 1.2.7 shows the schematic for a two-rail code checker with duplication check. The duplicated check is sometimes designed in a complementary form to prevent identical failure states in both the functional circuit and the duplicated circuit. In Figure 1.2.7, the output of one of the two identical circuits is inverted and fed into a self-checking two-rail code checker. The circuit acts as a self-testing comparator.
Functional Circuit L
Duplicated Circuit L
Inverters 1-1
'm-1
1_±
_1_±
Two-rail Code Checker (Comparator)
^0
H
Figure 1.2.7. Two-rail code checker for duplication check
1.2.3 Berger Codes Berger code is the only known systematic all-unidirectional error detecting code. It has two different encoding schemes: Bo and Bj. The Bo encoding scheme uses the binary representation of the number of O's in information bits as the check symbol, and the B^ encoding scheme uses the I's complement of the number of I's in information bits as the check symbol. Both the encoding schemes are valid for all-unidirectional error detection. Let us consider the BQ encoding scheme here.
48 Using the standard {n,k) block codes approach, for the codeword [OQ. ^ i , -•. ^ t - i , a n, then a reduced Berger code is just a Berger code. Let us consider the addition of two n-bit numbers, ]?= ( J;„, ..., JCi.-ii ) andy= ( y„,... ,>'2, ^i ), to obtain a sum7= ( s „ , . . . , ^2- -^i ) witli internal carries c^= ( c„, ... , c^, C] ), where the input cairy Cjn'^ CQ and output carry Coui = c„. For each 1 < i < n, we have A-,- + y; + c,_i
= 2c,. + Si .
Thus ixi + -£ .y,. + i: c,-, = 2^: f,• + i: s^. j = l
i = l
1= 1
1 = 1
r" = l
50 n
Since Wf?) = "^ .t, , we have i = l
W(t) + W(f) +W(^) + Co - c„„, = 2W(c') + Wit)
(1.2.2)
Since B(]?) = n - WC?), and from Eq. (1.2.2), we have B(t)
= ^(:?) +Biy) -5C?) - Co + c,„,
(1.2.3)
From Definition 14, we have R(t)
= ( ROt) + RC^) - R(t)
- Co + c„„, ) mod 2'^ .
(1.2.4)
In the two's complement ALU design, the subtraction of 5* = 5* - y is handled by performing the addition of "?=]? + >'+ 1. However, if a carry input is required for the subtraction, then the carry input to the adder must be inverted to obtain the result ? = A* - y - Ci„. Thus, wc assume that the inputs to the ALU are ^ y and C;„ for the general case, and x, y and c,„, during the two's complement subtraction. Similarly, we have R(T> = ( R(f)
- R(f)
- R(c) + ( n mod 2'' ) - fo + c„,„ ) mod 2'' (1,2.5)
If ( n mod 2'') = 0, then we have R(s)
= ( « ( t ) - Ri^) - RCc) ~ Co + c„,„ ) mod 2''
(1.2.6)
The RBCP scheme can be easily extended to other aritlimetic oix^rations such as the Us complement subtraction, sign-magnitude subtraction, multiplication and division. There are 16 possible logical operations on two o|)erands, including six trivial oi^crations, 0, 1, x, y, !v, y, and ten nontrivial operations. Here we only examine three basic logical operations AND(n), OR(u), XOR(©). We can easily verify the following: ^i ^ y, = -^'r + .Vi - (-"^i ^ .V, )
(1.2.7)
51 X; u .y,- = JJ; + yi - ( .V; n y,- ) ^i ®yi
(1.2.8) (1.2.9)
^ ^i + >-.• - 2 ( -X,- n y,- )
Similarly, we have for AND operation, RCi) = ( R(i)
+ R(f)
C/? operation, R(t) = ( «(?) + R(^
- R{1^\jy)
) mod 2-^(1.2.10)
- RC^ nf)
) mod 2'' (1.2.11)
for XOR operation, R(^
= ( R(7) + Rif)
- 2 R(J ny
+ ( n mod 2'' ) ) ) mod 2'' (1.2.12)
If ( n mod 2 ) = 0, then equation (1.2.12) reduces to R(s) = ( ROt) + /^(37) - 2 «( ? n y ) ) /no6? 2"
(1.2.13)
In the following, we describe the design of a circuit to implement these equations. For practical reasons, we consider a 32-bit ALU.
1.2.4.2 The Design of RBCP ALU The design of a reduced Berger check prediction circuit for a typical arithmetic and logic unit (ALU) will be given in this section. ALU has three control signals: flo. ' ' i . 3nd Oj. When OQ is "0", ALU performs the arithmetic operations. When flo is "1", ALU performs the logic operations. The signals a^ and Oj select a specified arithmetic or logical operation to be performed. The operations determined by these three control signals are Usted in Table 1.2.1. CoDtrols a, a, 02
6
Fuoctioai tl
ll
0 0
0 1
X X
S - X + Y + c. S- X -Y - c
0 0
0 0
1 1 1 1
0 0 1 1
0 1 0 1
s-xn Y
1 0 0
0 1 0 X
S-X® Y S-XU Y
s-x
x X: doo't cut
t1
1 I 1 0
U
0 1
C,>« - Cj. + 1
0 0 0 0
1 1 1 0
Table 1.2.1 Function Table for RBCP ALU in Figure 1.2.9
52 The RBCP ALU circuit is shown in Figure 1.2.9. The circuit translating the external signals (GQ.