Embedded Systems Handbook

I N D U S T R I A L I N F O R M AT I O N T E C H N O L O G Y S E R I E S Series Editor RICHARD ZURAWSKI Published Books...

Author: Richard Zurawski

210 downloads 2152 Views 12MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form

DOWNLOAD PDF

I N D U S T R I A L I N F O R M AT I O N T E C H N O L O G Y S E R I E S Series Editor

RICHARD ZURAWSKI Published Books Industrial Communication Technology Handbook Edited by Richard Zurawski

Embedded Systems Handbook Edited by Richard Zurawski

Forthcoming Books Electronic Design Automation for Integrated Circuits Handbook Luciano Lavagno, Grant Martin, and Lou Scheffer

© 2006 by Taylor & Francis Group, LLC

EMBEDDED SYSTEMS

H

A

N

D

B

O

O

Edited by

RICHARD ZURAWSKI

Boca Raton London New York

A CRC title, part of the Taylor & Francis imprint, a member of the Taylor & Francis Group, the academic division of T&F Informa plc.


K

Published in 2006 by CRC Press Taylor & Francis Group 6000 Broken Sound Parkway NW, Suite 300 Boca Raton, FL 33487-2742 © 2006 by Taylor & Francis Group, LLC CRC Press is an imprint of Taylor & Francis Group No claim to original U.S. Government works Printed in the United States of America on acid-free paper 10 9 8 7 6 5 4 3 2 1 International Standard Book Number-10: 0-8493-2824-1 (Hardcover) International Standard Book Number-13: 978-0-8493-2824-4 (Hardcover) Library of Congress Card Number 2005040574 This book contains information obtained from authentic and highly regarded sources. Reprinted material is quoted with permission, and sources are indicated. A wide variety of references are listed. Reasonable efforts have been made to publish reliable data and information, but the author and the publisher cannot assume responsibility for the validity of all materials or for the consequences of their use. No part of this book may be reprinted, reproduced, transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying, microfilming, and recording, or in any information storage or retrieval system, without written permission from the publishers. For permission to photocopy or use material electronically from this work, please access www.copyright.com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC) 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that provides licenses and registration for a variety of users. For organizations that have been granted a photocopy license by the CCC, a separate system of payment has been arranged. Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for identification and explanation without intent to infringe. Library of Congress Cataloging-in-Publication Data Embedded systems handbook / edited by Richard Zurawski. p. cm. Includes bibliographical references and index. ISBN 0-8493-2824-1 (alk. paper) 1. Embedded computer systems--Handbooks, manuals, etc. I. Zurawski, Richard. TK7895.E42E64 2005 004.16--dc22

2005040574

Visit the Taylor & Francis Web site at http://www.taylorandfrancis.com Taylor & Francis Group is the Academic Division of T&F Informa plc.


and the CRC Press Web site at http://www.crcpress.com

To my wife, Celine


International Advisory Board Alberto Sangiovanni-Vincentelli, University of California, Berkeley, U.S. (Chair) Giovanni De Micheli, Stanford University, U.S. Stephen A. Edwards, Columbia University, U.S. Aarti Gupta, NEC Laboratories, Princeton, U.S. Rajesh Gupta, University of California, San Diego, U.S. Axel Jantsch, Royal Institute of Technology, Sweden Wido Kruijtzer, Philips Research, The Netherlands Luciano Lavagno, Cadence Berkeley Laboratories, Berkeley, U.S., and Politecnico di Torino, Italy Robert de Simone, INRIA, France Grant Martin, Tensilica, U.S. Pierre G. Paulin, ST Microelectronics, Canada Antal Rajnák, Volcano AG, Switzerland Françoise Simonot-Lion, LORIA, France Thomas Weigert, Motorola, U.S. Reinhard Wilhelm, University of Saarland, Germany Lothar Thiele, Swiss Federal Institute of Technology, Switzerland


Preface Introduction The purpose of the Embedded Systems Handbook is to provide a reference useful to a broad range of professionals and researchers from industry and academia involved in the evolution of concepts and technologies, as well as development and use of embedded systems and related technologies. The book provides a comprehensive overview of the field of embedded systems and applications. The emphasis is on advanced material to cover recent significant research results and technology evolution and developments. It is primarily aimed at experienced professionals from industry and academia, but will also be useful to novices with some university background in embedded systems and related areas. Some of the topics presented in the book have received limited coverage in other publications either owing to the fast evolution of the technologies involved, or material confidentiality, or limited circulation in the case of industry-driven developments. The book covers extensively the design and validation of real-time embedded systems, design and verification languages, operating systems and scheduling, timing and performance analysis, power aware computing, security in embedded systems, the design of application-specific instruction-set processors (ASIPs), system-on-chip (SoC) and network-on-chip (NoC), testing of core-based ICs, network embedded systems and sensor networks, and embedded applications to include in-car embedded electronic systems, intelligent sensors, and embedded web servers for industrial automation. The book contains 46 contributions, written by leading experts from industry and academia directly involved in the creation and evolution of the ideas and technologies treated in the book. Many of the contributions are from industry and industrial research establishments at the forefront of the developments shaping the field of embedded systems: Cadence Systems and Cadence Berkeley Labs (USA), CoWare (USA), Microsoft (USA), Motorola (USA), NEC Laboratories (USA), Philips Research (The Netherlands), ST Microelectronics (Canada), Tensilica (USA), Volcano (Switzerland), etc. The contributions from academia and governmental research organizations are represented by some of the most renowned institutions such as Columbia University, Duke University, Georgia Institute of Technology, Princeton University, Stanford University, University of California at Berkeley/Riverside/ San Diego/Santa Barbara, University of Texas at Austin/Dallas, Virginia Tech, Washington University — from the United States; Delft University of Technology (Netherlands), IMAG (France), INRIA/IRISA (France), LORIA-INPL (France), Malardalen University (Sweden), Politecnico di Torino (Italy), Royal Institute of Technology — KTH (Sweden), Swiss Federal Institute of Technology — ETHZ (Switzerland), Technical University of Berlin (Germany), Twente University (The Netherlands), Universidad Politecnica de Madrid (Spain), University of Bologna (Italy), University of Nice Sophia Antipolis (France), University of Oslo (Norway), University of Pavia (Italy), University of Saarbrucken (Germany), University of Toronto (Canada), and many others. The material presented is in the form of tutorials, surveys, and technology overviews. The contributions are grouped into sections for cohesive and comprehensive presentation of the treated areas. The reports on recent technology developments, deployments, and trends frequently cover material released to the profession for the first time. The book can be used as a reference (or prescribed text) for university (post)graduate courses: Section I (Embedded Systems) provides “core” material on embedded systems. Selected illustrations of actual applications are presented in Section VI (Embedded Applications). Sections II and III (System-on-Chip Design, and Testing of Embedded Core-Based Integrated Circuits) offer material on recent advances in system-on-chip design and testing of core-based ICs. Sections IV and V (Networked Embedded Systems, and Sensor Networks) are suitable for a course on sensor networks.


x

Preface

The handbook is designed to cover a wide range of topics that comprise the field of embedded systems and applications. The material covered in this volume will be of interest to a wide spectrum of professionals and researchers from industry and academia, as well as graduate students, from the fields of electrical and computer engineering, computer science and software engineering, as well as mechatronic engineering. It is an indispensable companion for those who seek to learn more about embedded systems and applications, and those who want to stay up to date with recent technical developments in the field. It is also a comprehensive reference for university or professional development courses on embedded systems.

Organization Embedded systems is a vast field encompassing numerous disciplines. Not every topic, however important, can be covered in a book of reasonable volume without superficial treatment. Choices need to be made with respect to the topics covered, balance between research material and reports on novel industrial developments and technologies, balance between so-called “core” topics and new trends, and other aspects. The “time-to-market” is another important factor in making those decisions, along with the availability of qualified authors to cover the topics. One of the main objectives of any handbook is to give a well-structured and cohesive description of fundamentals of the area under treatment. It is hoped that the section Embedded Systems has achieved this objective. Every effort was made to make sure that each contribution in this section contains introductory material to assist beginners with the navigation through more advanced issues. This section does not strive to replicate or replace university level material, but, rather, tries to address more advanced issues, and recent research and technology developments. To make this book timely and relevant to a broad range of professionals and researchers, the book includes material reflecting state-of-the-art trends to cover topics such as design of ASIPs, SoC communication architectures including NoC, design of heterogeneous SoC, as well as testing of core-based integrated circuits. This material reports on new approaches, methods, technologies, and actual systems. The contributions come from the industry driving those developments, industry-affiliated research institutions, and academic establishments participating in major research initiatives. Application domains have had a considerable impact on the evolution of embedded systems, in terms of required methodologies and supporting tools, and resulting technologies. A good example is the accelerated evolution of the SoC design to meet demands for computing power posed by DSP, network and multimedia processors. SoCs are slowly making inroads into the area of industrial automation to implement complex field-area intelligent devices which integrate the intelligent sensor/actuator functionality by providing on-chip signal conversion, data and signal processing, and communication functions. There is a growing tendency to network field-area intelligent devices around industrial communication networks. Similar trends appear in the automotive electronic systems where the Electronic Control Units (ECUs) are networked by means of safety-critical communication protocols such as FlexRay, for instance, for the purpose of controlling vehicle functions such as electronic engine control, anti-locking break system, active suspension, etc. The design of this kind of networked embedded system (this also includes hard real-time industrial control systems) is a challenge in itself due to the distributed nature of processing elements, sharing a common communication medium and safety-critical requirements. With the automotive industry increasingly keen on adopting mechatronic solutions, it was felt that exploring, in detail, the design of in-vehicle electronic embedded systems would be of interest to the readers of this book. The applications part of the book also touches the area of industrial automation (networked control systems) where the issues are similar. In this case, the focus is on the design of web servers embedded in the intelligent field-area devices, and the security issues arising from internetworking. Sensor networks are another example of networked embedded systems, although, the “embedding” factor is not so evident as in other applications; particularly for wireless and self-organizing networks where the nodes may be embedded in the ecosystem, battlefield, or a chemical plant, for instance. The area of


Preface

xi

wireless sensor networks has now evolved into a relative maturity. Owing to novelty, and growing importance, it has been included in the book to give a comprehensive overview of the area, and present new research results which are likely to have a tangible impact on further developments and technology. The specifics of the design automation of integrated circuits have been deliberately omitted in this book to keep the volume at a reasonable size and in view of the publication of another handbook which covers these aspect in a comprehensive way: The Electronic Design Automation for Integrated Circuits Handbook, CRC Press, FL, 2005, Editors: Luciano Lavagno, Grant Martin, and Lou Scheffer. The aim of the Organization section is to provide highlights of the contents of the individual chapters to assist readers with identifying material of interest, and to put topics discussed in a broader context. Where appropriate, a brief explanation of the topic under treatment is provided, particularly for chapters describing novel trends, and with novices in mind. The book is organized into six sections: Embedded Systems, System-on-Chip Design, Testing of Embedded Core-Based Integrated Circuits, Networked Embedded Systems, Sensor Networks, and Embedded Applications.

I Embedded Systems This section provides a broad introduction to embedded systems. The presented material offers a combination of fundamental and advanced topics, as well as novel results and approaches, to cover the area fairly comprehensively. The presented topics include issues in real-time and embedded systems, design and validation, design and verification languages, operating systems, timing and performance analysis, power aware computing, and security.

Real-Time and Embedded Systems This subsection provides a context for the material covered in the book. It gives an overview of real-time and embedded systems and their networking to include issues, methods, trends, applications, etc. The focus of the chapter Embedded Systems: Toward Networking of Embedded Systems is on networking of embedded systems. It briefly discusses the rationale for the emergence of these kinds of systems, their benefits, types of systems, diversity of application domains and requirements arising from that, as well as security issues. Subsequently, the chapter discusses the design methods for networked embedded systems, which fall into the general category of system-level design. The methods overviewed focus on two separate aspects, namely the network architecture design and the system-on-chip design. The design issues and practices are illustrated by examples from the automotive application domain. After that, the chapter introduces selected application domains for networked embedded systems, namely: industrial and building automation control, and automotive control applications. The focus of the discussion is on the networking aspects. The chapter gives an overview of the networks used in industrial applications, including the industrial Ethernet and its standardization process; building automation control; and networks for automotive control and other applications from the automotive domain — but the emphasis is on networks for safety critical solutions. Finally, general aspects of wireless sensor/actuator networks are presented, and illustrated by an actual industrial implementation of the concept. At the end of the chapter, a few paragraphs are dedicated to the security issues for networked embedded systems. An authoritative introduction to real-time systems is provided in Real-Time in Embedded Systems. The chapter covers extensively the areas of design and analysis, with some examples of analysis, as well as tools; operating systems (an in-depth discussion of real-time embedded operating systems is presented in the chapter Real-Time Embedded Operating Systems Standards and Perspectives); scheduling (the chapter Real-Time Embedded Operating Systems: The Scheduling and Resource Management Aspects presents an authoritative description and analysis of real-time scheduling); communications to include descriptions of selected fieldbus technologies and Ethernet for real-time communications; and component based design, as well as testing and debugging. This is essential reading for anyone interested in the area of real-time systems.


xii

Preface

Design and Validation of Embedded Systems The subsection Design and Validation of Embedded Systems contains material presenting design methodology for embedded systems and supporting tools, as well as selected software and hardware implementation aspects. Models of Computation (MoC) — which are essentially abstract representations of computing systems — are used throughout to facilitate design and validation stages of systems development and approaches to validation as well as available methods and tools. The verification methods, together with an overview of verification languages, are presented in subsection Design and Verification Languages. In addition, the subsection presents novel research material including a framework used to introduce different models of computation particularly suited to the design of heterogeneous multiprocessor SoC, and a mathematical model of embedded systems based on the theory of agents and interactions. A comprehensive introduction to the design methodology for embedded systems is presented in the chapter Design of Embedded Systems. It gives an overview of the design issues and stages. Then, the chapter presents, in quite some detail, the functional design, function/architecture and hardware/software codesign, and hardware/software coverification and hardware simulation. Subsequently, the chapter discusses selected software and hardware implementation issues. While discussing different design stages and approaches, the chapter also introduces and evaluates supporting tools. An excellent introduction to the topic of models of computation, particularly for embedded systems, is presented in the chapter Models of Embedded Computation. The chapter introduces the origin of MoC, and the evolution from models of sequential and parallel computation to attempts to model heterogeneous architectures. In the process, the chapter discusses, in relative detail, selected nonfunctional properties such as power consumption, component interaction in heterogeneous systems, and time. It also presents a new framework used to introduce four different models of computation, and shows how different time abstractions can serve different purposes and needs. The framework is subsequently used to study the coexistence of different computational models; specifically the interfaces between two different MoCs and the refinement of one MoC into another. This part of the chapter is particularly relevant to the material on the design of heterogeneous multiprocessor SoC presented in the section System-on-Chip Design. A comprehensive survey of selected models of computation is presented in the chapter Modeling Formalisms for Embedded System Design. The surveyed formalisms include Finite State Machines (FSM), Finite State Machines with Datapath (FSMD), Moore machine, Mealy machine, Codesign Finite State Machines (CFSM), Program State Machines (PSM), Specification and Description Language (SDL), Message Sequence Charts (MSC), Statecharts, Petri nets, synchronous/reactive models, discrete event system, Dataflow Models, etc. The presentation of individual models is augmented by numerous examples. The chapter System Validation briefly discusses approaches to requirements capture, analysis and validation, and surveys available methods and tools to include: descriptive formal methods such as VDM, Z, B, RAISE (Rigorous Approach to Industrial Software Engineering), CASL (Common Algebraic Specification Language), SCR (Software Cost Reduction), and EVES; deductive verifiers: HOL, Isabelle, PVS, Larch, Nqthm, and Nuprl; state exploration tools: SMV (Symbolic Model Verifier), Spin, COSPAN (COordination SPecification Analysis), MEIJE, CADP, and Murphi. It also presents a mathematical model of embedded systems based on the theory of agents and interactions. To underline a novelty of this formalism, classical theories of concurrency are surveyed to include process algebras, temporal logic, timed automata, (Gurevich’s) ASM (Abstract State Machine), and rewriting logic. As an illustration, the chapter presents a specification of a simple scheduler.

Design and Verification Languages This section gives a comprehensive overview of languages used to specify, model, verify, and program embedded systems. Some of those languages embody different models of computation discussed in the previous section. A brief overview of Architecture Description Languages (ADL) is presented in


Preface

xiii

Embedded Applications (Automotive Networks); the use of this class of languages, in the context of describing in-car embedded electronic systems, is illustrated through the EAST-ADL language. An authoritative introduction to a broad range of languages used in embedded systems is presented in the chapter Languages for Embedded Systems. The chapter surveys some of the most representative and widely used languages. Software languages: assembly languages for complex instruction set computers (CISC), reduced instruction set computers (RISC), digital signal processors (DSPs) and very-long instruction word processors (VLIWs), and for small (4- and 8-bit) microcontrollers; the C and C++ Languages; Java; and real-time operating systems. Hardware languages: Verilog and VHDL. Dataflow languages: Kahn Process Networks and Synchronous Dataflow (SDF). Hybrid languages: Esterel, SDL, and SystemC. Each group of languages is characterized for their specific application domains and illustrated with ample code examples. An in-depth introduction to synchronous languages is presented in The Synchronous Hypothesis and Synchronous Languages. Before introducing the synchronous languages, the chapter discusses the concept of synchronous hypothesis: the basic notion, mathematical models, and implementation issues. Subsequently, it overviews the structural languages used for modeling and programming synchronous applications. Imperative languages, Esterel and SyncCharts, provide constructs to deal with controldominated programs. Declarative languages, Lustre and Signal, are particularly suited for applications based on intensive data computation and dataflow organization. Future trends are also covered. The chapter Introduction to UML and the Modeling of Embedded Systems gives an overview of the use of UML (Unified Modeling Language) for modeling embedded systems. The chapter presents a brief overview of UML and discusses UML features suited to represent the characteristics of embedded systems. The UML constructs, the language use, and other issues are introduced through an example of an automatic teller machine. The chapter also briefly discusses a standardized UML profile (a specification language instantiated from the UML language family) suitable for modeling of embedded systems. A comprehensive survey and overview of verification languages is presented in the chapter Verification Languages. It describes languages for verification of hardware, software, and embedded systems. The focus is on the support that a verification language provides for dynamic verification based on simulation, as well as static verification based on formal techniques. Before discussing the languages, the chapter provides some background on verification methods. This part introduces basics of simulation-based verification, formal verification, and assertion-based verification. It also discusses selected logics that form the basis of languages described in the chapter: propositional logic, first-order predicate logic, temporal logics, and regular and ω-regular languages. The hardware verification languages (HVLs) covered include: e, OpenVera, Sugar/PSL, and ForSpec. The languages for software verification overviewed include programming languages: C/C++, and Java; and modeling languages: UML, SDL, and Alloy. Languages for SoCs and embedded systems verification include system-level modeling languages: SystemC, SpecC, and SystemVerilog. The chapter also surveys domain-specific verification efforts, such as those based on Esterel and hybrid systems.

Operating Systems and Quasi-Static Scheduling This subsection offers a comprehensive introduction to real-time and embedded operating systems to cover fundamentals and selected advanced issues. To complement this material with new developments, it gives an overview of the operating system interfaces specified by the POSIX 1003.1 international standard and related to real-time programming and introduces a class of operating systems based on virtual machines. The subsection also includes research material on quasi-static scheduling. The chapter Real-Time Embedded Operating Systems: Standards and Perspectives provides a comprehensive introduction to the main features of real-time embedded operating systems. It overviews some of the main design and architectural issues of operating systems: system architectures, process and thread model, processor scheduling, interprocess synchronization and communication, and network support. The chapter presents a comprehensive overview of the operating system interfaces specified by


xiv

Preface

the POSIX 1003.1 international standard and related real-time programming. It also gives a short description of selected open-source real-time operating systems to include eCos, µClinux, RT-Linux and RTAI, and RTEMS. The chapter also presents a fairly comprehensive introduction to a class of operating systems based on virtual machines. Task scheduling algorithms and resource management policies, put in the context of real-time systems, are the main focus of the chapter Real-Time Embedded Operating Systems: The Scheduling and Resource Management Aspects. The chapter discusses in detail periodic task handling to include Timeline Scheduling (TS), Rate-Monotonic (RM) scheduling, Earliest Deadline First (EDF) algorithm, and approaches to handle tasks with deadlines less than periods scheme; and aperiodic task handling. Protocols for accessing shared resources discussed include Priority Inheritance Protocol (PIP) and Priority Ceiling Protocol (PCP). Novel approaches, which provide efficient support for real-time multimedia systems, for handling transient overloads and execution overruns in soft real-time systems working in dynamic environments are also mentioned in the chapter. The chapter Quasi-Static Scheduling of Concurrent Specifications presents methods aimed at efficient synthesis of uniprocessor software with an aim to improve speed of the scheduled design. The proposed approach starts from a specification represented in terms of concurrent communicating processes, derives an intermediate representation based on Petri nets or Boolean Dataflow Graphs, and finally attempts to obtain a sequential schedule to be implemented on a processor. The potential benefits result from replacement of explicit communication among processes by data assignment and reduced amount of context switches due to a reduction of the number of processes.

Timing and Performance Analysis Many embedded systems, particularly hard real-time systems, impose strict restrictions on the execution time of tasks which are required to be completed within certain time bounds. For this class of systems, schedulability analysis requires the upper bounds for the execution times of all tasks to be known in order to verify whether the system meets its timing requirements. The chapter Determining Bounds on Execution Times presents architecture of the aiT timing-analysis tool and an approach to timing analysis implemented in the tool. In the process, the chapter discusses cache-behavior prediction, pipeline analysis, path analysis using integer linear programming, and other issues. The use of this approach is put in the context of upper bounds determination. In addition, the chapter gives a brief overview of other approaches to timing analysis. The validation of nonfunctional requirements of selected implementation aspects such as deadlines, throughputs, buffer space, power consumption, etc., comes under performance analysis. The chapter Performance Analysis of Distributed Embedded Systems discusses issues behind performance analysis and its role in the design process. It also surveys a few selected approaches to performance analysis for distributed embedded systems to include simulation-based methods, holistic scheduling analysis, and compositional methods. Subsequently, the chapter introduces the performance network approach, as stated by authors, influenced by the worst-case analysis of communication networks. The presented approach allows one to obtain upper and lower bounds on quantities such as end-to-end delay and buffer space; it also covers all possible corner cases independent of their probability.

Power Aware Computing Embedded nodes, or devices, are frequently battery powered. The growing power dissipation, with the increase in density of integrated circuits and clock frequency, has a direct impact on the cost of packaging and cooling, as well as reliability and lifetime. These and other factors make the design for low power consumption a high priority for embedded systems. The chapter Power Aware Embedded Computing presents a survey of design techniques and methodologies aimed at reducing static and dynamic power dissipation. The chapter discusses energy and power modeling to include instruction


Preface

xv

level and function level power models, micro-architectural power models, memory and bus models, and battery models. Subsequently, the chapter discusses system/application level optimizations which explore different task implementations exhibiting different power/energy versus quality-of-service characteristics. Energy efficient processing subsystems: voltage and frequency scaling, dynamic resource scaling, and processor core selection, are also overviewed in the chapter. Finally, the chapter discusses energy efficient memory subsystems: cache hierarchy tuning, novel horizontal and vertical cache partitioning schemes, dynamic scaling of memory elements, software controlled memories, scratch-pad memories, improving access patterns to on-chip memory, special purpose memory subsystems for media streaming, and code compression, and interconnect optimizations.

Security in Embedded Systems There is a growing trend for networking of embedded systems. Representative examples of such systems can be found in automotive, train, and industrial automation domains. Many of those systems are required to be connected to other networks to include LAN, WAN, and the Internet. For instance, there is a growing demand for remote access to process data at the factory floor. This, however, exposes systems to potential security attacks, which may compromise their integrity and cause damage. The limited resources of embedded systems pose considerable challenge for the implementation of effective security policies which, in general, are resource demanding. An excellent introduction to the security issues in embedded systems is presented in the chapter Design Issues in Secure Embedded Systems. The chapter outlines security requirements in computing systems, classifies abilities of attackers, and discusses security implementation levels. Security constraints in the embedded systems designs discussed include energy considerations, processing power limitations, flexibility and availability requirements, and cost of implementation. Subsequently, the chapter presents the main issues in the design of secure embedded systems. It also covers, in detail, attacks and countermeasures of cryptographic algorithm implementations in embedded systems.

II System-on-Chip Design Multi-Processor Systems-on-Chip (MPSoC), which combine the advantages of parallel processing with the high integration levels of SoCs, emerged as a viable solution to meet the demand for computational power required by applications such as network and media processors. The design of MPSoCs typically involves integration of heterogeneous hardware and software IP components. However, the support for reuse of hardware and software IP components is limited, thus potentially making the design process labor-intensive, error-prone, and expensive. Selected component-based design methodologies for the integration of heterogeneous hardware and software IP components are presented in this section together with other issues such as design of ASIPs, communication architectures to include NoC, and platform based design, to mention some. Those topics are presented in eight chapters introducing the SoC concept and design issues; design of ASIPs; SoC communication architectures; principles and guidelines for the NoC design; platform-based design principles; converter synthesis for incompatible protocols; a component-based design automation approach for multiprocessor SoC platforms; an interface-centric approach to the design and programming of embedded multiprocessors; and an STMicroelectronics developed exploration multiprocessor SoC platform. A comprehensive introduction to the SoC concept, in general, and design issues is provided in the chapter System-on-Chip and Network-on-Chip Design. The chapter discusses basics of SoC; IP cores and virtual components; introduces the concept of architectural platforms and surveys selected industry offerings; and provides a comprehensive overview of the SoC design process. A retargetable framework for ASIP design is presented in A Novel Methodology for the Design of Application-Specific Instruction-Set Processors. The framework, which is based on machine descriptions in the LISA language, allows for automatic generation of software development tools including HLL C-compiler, assembler, linker, simulator, and graphical debugger frontend. In addition, synthesizable


xvi

Preface

hardware description language code can be derived for architecture implementation. The chapter also gives an overview of various machine description languages in the context of their suitability for the design of ASIP; discusses the ASIPs design flow, and the LISA language. On-chip communication architectures are presented in the chapter State-of-the-Art SoC Communication Architectures. The chapter offers an in-depth description and analysis of three most relevant, from industrial and research viewpoints, architectures to include ARM developed AMBA (Advanced MicroController Bus Architecture) and new interconnect schemes, namely Multi-Layer AHB and AMBA AXI; IBM developed CoreConect; and STMicroelectronics developed STBus. In addition, the chapter surveys other architectures such as Wishbone, Sonics SiliconBackplane Micronetwork, Peripheral Interconnect Bus (PI-Bus), Avalon, and CoreFrame. The chapter also offers analysis of selected architectures and extends the discussion of on-chip interconnects to NoC. Basic principles and guidelines for the NoC design are introduced in Network-on-Chip Design for Gigascale Systems-on-Chip. It discusses a rationale for the design paradigm shift of SoC communication architectures from shared busses to NoCs; and briefly surveys related work. Subsequently, the chapter presents details of NoC building blocks to include switch, network interface, and switch-to-switch links. In discussing the design guidelines, the chapter uses a case study of a real NoC architecture (Xpipes) which employs some of the most advanced concepts in NoC design. It also discusses the issue of heterogeneous NoC design, and the effects of mapping the communication requirements of an application onto a domain-specific NoC. An authoritative discussion of the platform-based design (PBD) concept is provided in the chapter Platform-Based Design for Embedded Systems. The chapter introduces PBD principles and outlines the interplay between micro-architecture platforms and Application Program Interface (API), or programmer model, which is a unique abstract representation of the architecture platform via the software layer. The chapter also introduces three applications of PBD: network platforms for communication protocol design, fault-tolerant platforms for the design of safety-critical applications, and analog platforms for mixed-signal integrated circuit design. An approach to synthesis of interface converters for incompatible protocols in a componentbased design automation is presented in Interface Specification and Converter Synthesis. The chapter surveys several approaches for synthesizing converters illustrated by simple examples. It also introduces more advanced frameworks based on abstract algebraic solutions that guarantee converter correctness. The chapter Hardware/Software Interface Design for SoC presents a component-based design automation approach for MPSoC platforms. It briefly surveys basic concepts of MPSoC design and discusses some related platform and component-based approaches. It provides a comprehensive overview of hardware/software IP integration issues to include bus-based and core-based approaches, integrating software IP, communication synthesis (the concept is presented in detail in Interface Specification and Converter Synthesis), and IP derivation. The focal point of the chapter is a new component-based design methodology and the design environment for the integration of heterogeneous hardware and software IP components. The presented methodology, which adopts the automatic communication synthesis approach and uses a high-level API, generates both hardware and software wrappers, as well as a dedicated operating system for programmable components. The IP integration capabilities of the approach and accompanying software tools are illustrated by redesigning a part of a VDSL modem. The chapter Design and Programming of Embedded Multiprocessors: An Interface-Centric Approach presents a design methodology for implementing media processing applications as MPSoCs centered around the Task Transaction Level (TTL) interface. The TTL interface can be used to build executable specifications; it also provides a platform interface for implementing applications as communicating hardware and software tasks on a platform infrastructure. The chapter introduces the TTL interface in the context of the requirements, and discusses mapping technology which supports structured design and programming of embedded multiprocessor systems. The chapter also presents two case studies of implementations of TTL interface on different architectures: a multi-DSP


Preface

xvii

architecture, using an MP3 decoder application to evaluate this implementation; and a smart-imaging multiprocessor. The STMicroelectronics developed StepNPTM flexible MPSoC platform and its key architectural components are described in A MultiProcessor SoC Platform and Tools for Communications Applications. The platform was developed with an aim to explore tool and architectural issues in a range of high-speed communications applications, particularly packet processing applications used in network infrastructure SoCs. Subsequently, the chapter reviews the MultiFlex modeling and analysis tools developed to support the StepNP platform. The MultiFlex environment supports two parallel programming models: a distributed system object component (DSOC) message passing model and a symmetrical multiprocessing (SMP) model using shared memory. It maps these models onto the StepNP MPSoC platform. The use of the platform and supporting environment are illustrated by two examples mapping IPv4 packet forwarding and traffic management applications onto the StepNP platform. Detailed results are presented and discussed for a range of architectural parameters.

III Testing of Embedded Core-Based Integrated Circuits The ever-increasing circuit densities and operating frequencies, as well as the use of the SoC designs, have resulted in enormous test data volume for today’s embedded core-based integrated circuits. According to the Semiconductor Industry Association, in the International Technology Roadmap for Semiconductors (ITRS), 2001 Edition, the density of ICs can reach 2 billion transistors per square cm, and 16 billion transistors per chip are likely by 2014. Based on that, according to some estimates (A. Khoche and J. Rivoir, “I/O bandwidth bottleneck for test: is it real?” Test Resource Partitioning Workshop, 2002), the test data volume for ICs in 2014 is likely to increase 150 times in reference to 1999. Some other problems include the growing disparity between performance of the design and the automatic test equipment which makes at-speed testing, particularly of high-speed circuits, a challenge and results in increasing yield loss; high cost of manually developed functional tests; and growing cost of high-speed and high-pincount testers. This section contains two chapters introducing new techniques addressing some of the issues indicated above. The chapter Modular Testing and Built-In Self-Test of Embedded Cores in System-on-Chip Integrated Circuits presents a survey of techniques that have been proposed in the literature for reducing test time and test data volume. The techniques surveyed rely on modular testing of embedded cores and built-in self test (BIST). The material on modular testing of embedded cores in a system-on-a-chip describes wrapper design and optimization, test access mechanism (TAM) design and optimization, test scheduling, integrated TAM optimization and test scheduling, and modular testing of mixed-signal SOCs. In addition, the chapter reviews a recent deterministic BIST approach in which a reconfigurable interconnection network (RIN) is placed between the outputs of the linear-feedback shift register (LFSR) and the inputs of the scan chains in circuit under test. The RIN, which consists only of multiplexer switches, replaces the phase shifter that is typically used in pseudo-random BIST to reduce correlation between the test data bits that are fed into the scan chains. The proposed approach does not require any circuit redesign and it has minimal impact on circuit performance. Hardware-based self-testing techniques (BIST) have limitations due to performance, area, and design time overhead, as well as problems caused by the application of nonfunctional patterns (which may result in higher power consumption during testing, over-testing, yield loss problems, etc.). The embedded softwarebased self-testing technique has a potential to alleviate the problems caused by using external testers, as well as structural BIST problems. The embedded software-based self-testing utilizes on-chip programmable resources (such as embedded microprocessors and DSPs) for on-chip test generation, test delivery, signal acquisition, response analysis, and even diagnosis. The chapter Embedded Software-Based Self-Testing for SoC Design discusses processor self-test methods targeting stuck-at faults and delay faults; presents a brief description of a processor self-diagnosis method; presents methods for self-testing of buses and global


xviii

Preface

interconnects as well as other nonprogrammable IP cores on SoC; describes instruction-level design-fortestability (Df T) methods based on insertion of test instructions to increase the fault coverage and reduce the test application time and test program size; and outlines DSP-based self-test for analog/mixed-signal components.

IV Networked Embedded Systems Networked embedded systems (NES) are essentially spatially distributed embedded nodes (implemented on a board, or a single chip in future) interconnected by means of wireline or/and wireless communication infrastructure and protocols, interacting with the environment (via sensor/actuator elements) and each other, and, possibly, a master node performing some control and coordination functions to coordinate computing and communication in order to achieve certain goal(s). An example of a network embedded system may be an in-vehicle embedded network comprising a collection of ECUs networked by means of safety-critical communication protocols, such as FlexRay or TTP/C, for the purpose of controlling vehicle functions, such as electronic engine control, anti-locking brake system, active suspension, etc. (for details of automotive applications see the last section in the book). An excellent introduction to NES is presented in the chapter Design Issues in Networked Embedded Systems. This chapter outlines some of the most representative characteristics of NES, and surveys potential applications. It also explains design issues for large-scale distributed NES such as environment interaction, life expectancy of nodes, communication protocol, reconfigurability, security, energy constraints, operating systems, etc. Design methodologies and tools are discussed as well. The topic of middleware for NES is addressed in Middleware Design and Implementation for Networked Embedded Systems. This chapter discusses the role of middleware in NES and the challenges in design and implementation, such as remote communication, location independence, reuse of the existing infrastructure, providing real-time assurances, providing a robust DOC middleware, reducing middleware footprint, and support for simulation environments. The focal points of the chapter are the sections describing the design and implementation of nORB (a small footprint real-time object request broker tailored to specific embedded sensor/actuator applications), and the rationale behind the adopted approach, namely to address the NES design and implementation challenges.

V Sensor Networks The distributed (wireless) sensor networks are a relatively new and exciting proposition for collecting sensory data in a variety of environments. The design of this kind of network poses a particular challenge due to limited computational power and memory size, bandwidth restrictions, power consumption restriction if battery powered, communication requirements, and unattended mode of operation in case of inaccessible and/or hostile environments, to mention some. It provides a fairly comprehensive discussion of the design issues related to, in particular, self-organizing wireless networks. It introduces fundamental concepts behind sensor networks, discusses architectures, energy-efficient Medium Access Control (MAC), time synchronization, distributed localization, routing, distributed signal processing, security, and it surveys selected software solutions. A general introduction to the area of wireless sensor networks is provided in Introduction to Wireless Sensor Networks. A comprehensive overview of the topic is provided in Issues and Solutions in Wireless Sensor Networks, which introduces fundamental concepts, selected application areas, design challenges, and other relevant issues. The chapter Architectures for Wireless Sensor Networks provides an excellent introduction to various aspects of the architecture of wireless sensor networks. It includes the description of a sensor node architecture and its elements: sensor platform, processing unit, communication interface, and power source. In addition, it presents a mathematical model of power consumption by a node, to account for energy consumption by radio, processor, and sensor elements. The chapter also discusses architectures


Preface

xix

of wireless sensor networks developed on the protocol stack approach and EYES project approach. In the context of the EYES project approach, which consists of only two key system abstraction layers, namely the sensor and networking layer and the distributed services layer, the chapter discusses distributed services that are required to support applications for wireless sensor networks and approaches adopted by various projects. Energy efficiency is one of the main issues in developing MAC protocols for wirelesss sensor networks. This is largely due to unattended operation and battery-based power supply, and a need for collaboration as a result of limited capabilities of individual nodes. Energy-Efficient Medium Access Control offers a comprehensive overview of the issues involved in the design of MAC protocols. It contains a discussion of MAC requirements for wireless sensor networks such as hardware characteristics of the node, communication patterns, and others. It surveys 20 medium access protocols specially designed for sensor networks and optimized for energy efficiency. It also discusses qualitative merits of different organizations; contention-based, slotted, and TDMA-based protocols. In addition, the chapter provides a simulationbased comparison of the performance and energy efficiency of four MAC protocols: Low Power Listening, S-MAC, T-MAC, and L-MAC. The knowledge of time at a sensor node may be essential for the correct operation of the system. Time Division Multiple Access (TDMA) scheme (adopted in TTP/C and FlexRay protocols, for instance — see section on automotive applications) requires the nodes to be synchronized. The time synchronization issues in sensor networks are discussed in Overview of Time Synchronization Issues in Sensor Networks. The chapter introduces basics of time synchronization for sensor networks. It also describes design challenges and requirements in developing time synchronization protocols such as the need to be robust, energy aware, able to operate correctly in absence of time servers (server-less), be light-weight, and to offer a tunable service. The chapter also overviews factors influencing time synchronization such as temperature, phase noise, frequency noise, asymmetric delays, and clock glitches. Subsequently, different types of timing techniques are discussed: Network Time Protocol (NTP), Timing-sync Protocol for Sensor Networks (TPSN), Reference-Broadcast Synchronization (RBS), and Time-Diffusion Synchronization Protocol (TDP). The knowledge of the location of nodes is essential for the base station to process information from sensors, and to arrive at valid and meaningful results. The localization issues in ad hoc wireless sensor networks are discussed in Distributed Localization Algorithms. The focus of this presentation is on three distributed localization algorithms for large-scale ad hoc sensor networks which meet the basic requirements for self-organization, robustness, and energy efficiency: ad hoc positioning by Niculescu and Nath, N-hop multilateration by Savvides et al., and robust positioning by Savarese et al. The selected algorithms are evaluated by simulation. In order to forward information from a sensor node to the base station or another node for processing, the node requires routing information. The chapter Routing in Sensor Networks provides a comprehensive survey of routing protocols used in sensor networks. The presentation is divided into flat routing protocols: Sequential Assignment Routing (SAR), direct diffusion, minimum cost forwarding approach, Integer Linear Program (ILP) based routing approach, Sensor Protocols for Information via Negotiation (SPIN), geographic routing protocols, parametric probabilistic routing protocol, and Min-MinMax; and clusterbased routing protocols: Low Energy Adaptive Clustering Hierarchy (LEACH), Threshold sensitive Energy Efficient sensor Network protocol (TEEN), and two-level clustering algorithm. Due to their limited resources, sensor nodes frequently provide incomplete information on the objects of their observation. Thus the complete information has to be reconstructed from data obtained from many nodes, frequently providing redundant data. The distributed data fusion is one of the major challenges in sensor networks. The chapter Distributed Signal Processing in Sensor Networks introduces a novel mathematical model for distributed information fusion, which focuses on solving a benchmark signal processing problem (spectrum estimation) using sensor networks. With deployment of sensor networks in areas such as battlefield or factory floor, security becomes of paramount importance, and a challenge. The existing solutions are impractical due to limited capabilities (processing power, available memory, and available energy) of sensor nodes. The chapter


xx

Preface

Sensor Network Security gives an introduction to selected specific security challenges in wireless sensor networks: denial of service and routing security, energy efficient confidentiality and integrity, authenticated broadcast, alternative approaches to key management, and secure data aggregation. Subsequently, it discusses in detail some of the proposed approaches and solutions: SNEP and µTESLA protocols for confidentiality and integrity of data, the LEAP protocol, and probabilistic key management for key management, to mention some. The chapter Software Development for Large-Scale Wireless Sensor Networks presents basic concepts related to software development for wireless sensor networks, as well as selected software solutions. The solutions include: TinyOS, a component-based operating system, and related software packages; MATÉ, a byte-code interpreter; and TinyDB, a query processing system for extracting information from a network of TinyOS sensor nodes. SensorWare, a software framework for wireless sensor networks, provides querying, dissemination, and fusion of sensor data, as well as coordination of actuators. MiLAN (Middleware Linking Applications and Networks), a middleware concept, aims to exploit information redundancy provided by sensor nodes. EnviroTrack, a TinyOS-based application, provides a convenient way to program sensor network applications that track activities in their physical environment. SeNeTs, a middleware architecture for wireless sensor networks, is designed to support the pre-deployment phase. The chapter also discusses software solutions for simulation, emulation, and test of large-scale sensor networks: TinyOS SIMulator (TOSSIM), a simulator based on the TinyOS framework; EmStar, a software environment for developing and deploying applications for sensor networks consisting of 32-bit embedded Microserver platforms; and SeNeTs, a test and validation environment.

VI Embedded Applications The last section in the book, Embedded Applications, focuses on selected applications of embedded systems. It covers automotive field, industrial automation, and intelligent sensors. The aim of this section is to introduce examples of the actual embedded applications in fast-evolving areas which, for various reasons, have not received proper coverage in other publications, particularly in the automotive area.

Automotive Networks The automotive industry is aggressively adopting mechatronic solutions to replace or duplicate existing mechanical/hydraulic systems. The embedded electronic systems together with dedicated communication networks and protocols play pivotal roles in this transition. This subsection contains three chapters that offer a comprehensive overview of the area by presenting topics, such as networks and protocols, operating systems and other middleware, scheduling, safety and fault tolerance, and actual development tools, used by the automotive industry. This section begins with a contribution entitled Design and Validation Process of In-Vehicle Embedded Electronic Systems that provides a comprehensive introduction to the use of embedded systems in automobiles, their design and validation methods, and tools. The chapter identifies and describes a number of specific application domains for in-vehicle embedded systems, such as power train, chassis, body, and telematics and HMI. It then outlines some of the main standards used in the automotive industry to ensure interoperability between components developed by different vendors; this includes networks and protocols, as well as operating systems. The surveyed networks and protocols include (for details of networks and protocols see The Industrial Communication Technology Handbook, CRC Press, 2005, Richard Zurawski, editor) Controller Area Network (CAN), Vehicle Area Network (VAN), J1850, TTP/C (Time-Triggered Protocol), FlexRay, Local Interconnect Network (LIN), Media Oriented System Transport (MOST), and IDB-1394. This material is followed by a brief introduction of OSEK/VDX (Offene Systeme und deren schnittstellen für die Elektronik im Kraft-fahrzeug), a multitasking operating system that has become a standard for automotive applications in Europe. The chapter introduces a new language, EAST-ADL, which offers support for an unambiguous description of in-vehicle embedded electronic


Preface

xxi

systems at each level of their development. The discussion of the design and validation process and related issues is facilitated by a comprehensive case study drawn from actual PSA Peugeot-Citroën application. This case study is essential reading for those interested in the development of this kind of embedded system. The planned adoption of X-by-wire technologies in automotive applications pushed the automotive industry into the realm of safety critical systems. There is a substantial body of literature on safety critical issues and fault tolerance, particularly when applied to components and systems. Less has been published on safety-relevant communication services and fault-tolerant communication systems as mandated in X-by-wire technologies in automotive applications. This is largely due to the novelty of fast-evolving concepts and solutions, which is pursued mostly by industrial consortia. Those two topics are presented in detail in Fault-Tolerant Services for Safe In-Car Embedded Systems. The material on safety-relevant communication services discusses some of the main services and functionalities that the communication system should provide to facilitate the design of fault-tolerant automotive applications. This includes services supporting reliable communication, such as robustness against electromagnetic interference (EMI), time-triggered transmission, global time, atomic broadcast, and avoiding“babbling-idiots.” Also discussed are higher-level services that provide fault-tolerant mechanisms that belong conceptually to layers above MAC in the OSI reference model, namely group membership service, management of nodes’ redundancy, support for functioning mode, etc. The chapter also discusses fault tolerant communication protocols to include TTP/C, FlexRay, and variants of CAN (TTCAN, RedCAN, and CANcentrate). The Volcano concept for design and implementation of in-vehicle networks using the standardized CAN and LIN communication protocols is presented in the chapter Volcano — Enabling Correctness by Design. This chapter provides an in-depth description of the Volcano approach and a suite of software tools, developed by Volcano Communications Technologies AG, which supports requirements capture, modelbased design, automatic code generation, and system-level validation capabilities. This is an example of an actual development environment widely used by the automotive industry.

Industrial Automation The current trend for flexible and distributed control and automation has accelerated the migration of intelligence and control functions to the field devices; particularly sensors and actuators. The increased processing capabilities of those devices were instrumental in the emergence of a trend for networking of field devices around industrial data networks, thus making access to any device from any place in the plant, or even globally, technically feasible. The benefits are numerous, including increased flexibility, improved system performance, and ease of system installation, upgrade, and maintenance. Embedded web servers are increasingly used in industrial automation to provide Human–Machine Interface (HMI), which allows for web-based configuration, control and monitoring of devices and industrial processes. An introduction to the design of embedded web servers is presented in the chapter Embedded Web Servers in Distributed Control Systems. The focus of this chapter is on Field Device Web Servers (FDWS). The chapter provides a comprehensive overview of the context in which the embedded web servers are usually implemented, as well as the structure of an FDWS application with the presentation of its component packages and the mutual relationship between the content of the packages and the architecture of a typical embedded site. All this is discussed in the context of an actual FDWS implementation and application deployed at one of the Alstom (France) sites. Remote access to field devices may lead to many security challenges. The embedded web servers are typically run on processors with limited memory and processing power. These restrictions necessitate a deployment of lightweight security mechanisms. Vendor tailored versions of standard security protocol suites such as Secure Sockets Layer (SSL) and IP Security Protocol (IPSec) may still not be suitable due to excessive demand for resources. In applications restricted to the Hypertext Transfer Protocol (HTTP), Digest Access Authentication (DAA), which is a security extension to HTTP, offers an alternative and viable solution. Those issues are discussed in the chapter HTTP Digest Authentication for Embedded Web


xxii

Preface

Servers. This chapter overviews mechanisms and services, as well as potential applications of HTTP Digest Authentication. It also surveys selected embedded web server implementations for their support for DAA. This includes Apache 2.0.42, Allegro RomPager 4.05, and GoAhead 2.1.2.

Intelligent Sensors The advances in the design of embedded systems, availability of tools, and falling fabrication costs allowed for cost-effective migration of the intelligence and control functions to the field devices, particularly sensors and actuators. Intelligent sensors combine computing, communication, and sensing functions. The trend for increased functional complexity of those devices necessitates the use of formal descriptive techniques and supporting tools throughout the design and implementation process. The chapter Intelligent Sensors: Analysis and Design tackles some of those issues. It reviews some of the main characteristics of the generic intelligent sensor formal model; subsequently, it discusses an implementation of the model using the CAP language, which was developed specifically for the design of intelligent sensors. A brief introduction to the language is also provided. The whole development process is illustrated by using an example of a simple distance measuring system comprising an ultrasonic transmitter and two receivers.

Locating Topics To assist readers with locating material, a complete table of contents is presented at the front of the book. Each chapter begins with its own table of contents. Two indexes are provided at the end of the book: the index of authors contributing to the book, together with the titles of their contributions, and a detailed subject index.

Richard Zurawski


Acknowledgments My gratitude goes to Luciano Lavagno, Grant Martin, and Alberto Sangiovanni-Vincentelli who have provided advice and support while preparing this book. This book would never have had a chance to take off without their assistance. Andreas Willig helped with identifying some authors for the section on Sensor Networks. Also, I would like to thank the members of the International Advisory Board for their help with the organization of the book and selection of authors. I have received tremendous cooperation from all contributing authors. I would like to thank all of them for that. I would like to express gratitude to my publisher Nora Konopka, and other Taylor and Francis staff involved in the book production, particularly Jessica Vakili, Elizabeth Spangenberger, and Gail Renard. My love goes to my wife who tolerated the countless hours I spent on preparing this book.


About the Editor Dr. Richard Zurawski is president of ISA Group, San Francisco and Santa Clara, CA, involved in providing solutions to Fortune 1000 companies. Prior to that, he held various executive positions with San Francisco Bay area based companies. Dr. Zurawski is a cofounder of the Institute for Societal Automation, Santa Clara, a research and consulting organization. Dr. Zurawski has close to thirty years of academic and industrial experience, including a regular professorial appointment at the Institute of Industrial Sciences, University of Tokyo, and a full-time R&D advisor position with Kawasaki Electric Corp., Tokyo. He provided consulting services to Kawasaki Electric, Ricoh, and Toshiba Corporations, Japan, and participated in 1990s in a number of Japanese Intelligent Manufacturing Systems programs. Dr. Zurawski has served as editor at large for IEEE Transactions on Industrial Informatics, and associate editor for IEEE Transactions on Industrial Electronics; he also served as associate editor for Real-Time Systems: The International Journal of Time-Critical Computing Systems, Kluwer Academic Publishers. He was a guest editor of four special sections in IEEE Transactions on Industrial Electronics and a guest editor of a special issue of the Proceedings of the IEEE dedicated to industrial communication systems. In 1998, he was invited by IEEE Spectrum to contribute material on Java technology to “Technology 1999: Analysis and Forecast Issues.” Dr. Zurawski is series editor for The Industrial Information Technology Series, Taylor and Francis Group, Boca Raton, FL. Dr. Zurawski has served as a vice president of the Institute of Electrical and Electronics Engineers (IEEE) Industrial Electronics Society (IES), and was on the steering committee of the ASME/IEEE Journal of Microelectromechanical Systems. In 1996, he received the Anthony J. Hornfeck Service Award from the IEEE Industrial Electronics Society. Dr. Zurawski has served as a general, program, and track chair for a number of IEEE conferences and workshops, and has published extensively on various aspects of formal methods in the design of real-time, embedded, and industrial systems, MEMS, parallel and distributed programming and systems, as well as control and robotics. He is the editor of The Industrial Information Technology Handbook (2004), and The Industrial Communication Technology Handbook (2005), both published by Taylor and Francis Group. Dr. Richard Zurawski received his M.Sc. in informatics and automation, University of Mining and Metallurgy, Krakow, Poland, and his Ph.D. in computer science, La Trobe University, Melbourne, Australia.


Contributors

Parham Aarabi

Davide Bertozzi

Krishnendu Chakrabarty

Department of Electrical and Computer Engineering University of Toronto Ontario, Canada

Dipartimento Elettronica Informatica Sistemistica University of Bologna Bologna, Italy

Department of Electrical and Computer Engineering Duke University Durham, North Carolina

José L. Ayala

Jan Blumenthal

S. Chatterjea

Dpto. Ingenieria Electronica E.T.S.I. Telecomunicacion Ciudad Universitaria s/n Madrid, Spain

Institute of Applied Microelectronics and Computer Science Dept. of Electrical Engineering and Information Technology University of Rostock Rostock, Germany

Faculty of Electrical Engineering, Mathematics, and Computer Science University of Twente Enschede The Netherlands

João Paulo Barros Universidade Nova de Lisboa Faculdade de Ciências e Tecnologia Dep. Eng. Electrotécnica Caparica, Portugal

Ali Alphan Bayazit Princeton University Princeton, New Jersey

Gunnar Braun CoWare Inc. Aachen, Germany

Kwang-Ting (Tim) Cheng Department of Electrical and Computer Engineering University of California Santa Barbara, California

Anikó Costa Giorgio C. Buttazzo Dip. di Informatica e Sistemistica University of Pavia Pavia, Italy

Universidade Nova de Lisboa, Faculdade de Ciências e Tecnologia Dep. Eng. Electrotécnica Caparica, Portugal

Luca P. Carloni

Mario Crevatin

Advanced System Technology STMicroelectronics Ontario, Canada

EECS Department University of California at Berkeley Berkeley, California

Corporate Research ABB Switzerland Ltd Baden-Dattwil, Switzerland

Ivan Cibrario Bertolotti

Wander O. Cesário

IEIIT — National Research Council Turin, Italy

SLS Group TIMA Laboratory Grenoble, France

Luca Benini Dipartimento Elettronica Informatica Sistemistica University of Bologna Bologna, Italy

Essaid Bensoudane

Fernando De Bernardinis



xxviii

Contributors

Erwin de Kock

Frank Golatowski

Hans Hansson

Philips Research Eindhoven, The Netherlands


Department of Computer Science and Engineering Mälardalen University Västerås, Sweden

Giovanni De Micheli Gates Computer Science Stanford University Stanford, California

Robert de Simone INRIA Sophia-Antipolis, France

Eric Dekneuvel

Luís Gomes Universidade Nova de Lisboa Faculdade de Ciências e Tecnologia Dep. Eng. Electrotécnica Caparica, Portugal

University of Nice Sophia Antipolis Biot, France

Aarti Gupta

S. Dulman

Rajesh Gupta


Department of Computer Science and Engineering University of California at San Diego San Diego, California

Stephen A. Edwards

Tallwood Venture Capital Palo Alto, California

Department of Computer Science Columbia University New York, New York

Gerben Essink Philips Research Eindhoven, The Netherlands

A. G. Fragopoulos Department of Electrical and Computer Engineering University of Patras Patras, Greece

Shashidhar Gandham The Department of Computer Science The University of Texas at Dallas Richardson, Texas

Christopher Gill Department of Computer Science and Engineering Washington University St. Louis, Missouri


NEC Laboratories America Princeton, New Jersey

Sumit Gupta

Marc Haase Institute of Applied Microelectronics and Computer Science Dept. of Electrical Engineering and Information Technology University of Rostock Rostock, Germany

P. Havinga Faculty of Electrical Engineering, Mathematics, and Computer Science University of Twente Enschede The Netherlands

Øystein Haugen Department of Informatics University of Oslo Oslo, Norway

Tomas Henriksson Philips Research Eindhoven, The Netherlands

Andreas Hoffmann CoWare Inc. Aachen, Germany

T. Hoffmeijer Faculty of Electrical Engineering, Mathematics, and Computer Science University of Twente Enschede The Netherlands

J. Hurink

Faculty of Electrical Engineering, Mathematics, and Computer Science Delft University of Technology Delft, The Netherlands


Matthias Handy

Margarida F. Jacome


Department of Electrical and Computer Engineering University of Texas at Austin Austin, Texas

Gertjan Halkes

Omid S. Jahromi Bioscrypt Inc. Markham, Ontario, Canada

Contributors

xxix

Axel Jantsch

Damien Lyonnard

Roberto Passerone

Department for Microelectronics and Information Technology Royal Institute of Technology Kista, Sweden


Cadence Design Systems, Inc. Berkeley Cadence Labs Berkeley, California

A. A. Jerraya SLS Group TIMA Laboratory Grenoble, France

J. V. Kapitonova

Yogesh Mahajan Princeton University Princeton, New Jersey

Grant Martin Tensilica Inc. Santa Clara, California

Glushkov Institute of Cybernetics National Academy of Science of Ukraine Kiev, Ukraine

Birger Møller-Pedersen

Alex Kondratyev

Ravi Musunuri

Cadence Berkeley Labs Berkeley, California

Wido Kruijtzer Philips Research Eindhoven, The Netherlands

Koen Langendoen Faculty of Electrical Engineering, Mathematics, and Computer Science Delft University of Technology Delft, The Netherlands

Michel Langevin Advanced System Technology STMicroelectronics Ontario, Canada

Luciano Lavagno Cadence Berkeley Laboratories Berkeley, California; and Dipartimento di Elettronica Politecnico di Torino, Italy

A. A. Letichevsky Glushkov Institute of Cybernetics National Academy of Science of Ukraine Kiev, Ukraine

Marisa López-Vallejo Dpto. Ingenieria Electronica E.T.S.I. Telecomunicacion Ciudad Universitaria s/n Madrid, Spain


Department of Informatics University of Oslo Oslo, Norway

Hiren D. Patel Electrical and Computer Engineering Virginia Tech Blacksburg, Virginia

Maulin D. Patel The Department of Computer Science The University of Texas at Dallas Richardson, Texas

Pierre G. Paulin

The Department of Computer Science The University of Texas at Dallas Richardson, Texas


Nicolas Navet


Institut National Polytechnique de Lorraine Nancy, France

Gabriela Nicolescu Ecole Polytechnique de Montreal Montreal, Quebec Canada

Achim Nohl CoWare Inc. Aachen, Germany

Mikael Nolin Department of Computer Science and Engineering Mälardalen University Västerås, Sweden

Thomas Nolte Department of Computer Science and Engineering Mälardalen University Västerås, Sweden

Claudio Passerone Dipartimento di Elettronica Politecnico di Torino Turin, Italy

Chuck Pilkington

Claudio Pinello EECS Department University of California at Berkeley Berkeley, California

Dumitru Potop-Butucaru IRISA Rennes, France

Antal Rajnák Advanced Engineering Labs Volcano Communications Technologies AG Tagerwilen, Switzerland

Anand Ramachandran Department of Electrical and Computer Engineering University of Texas at Austin Austin, Texas

Niels Reijers Faculty of Electrical Engineering, Mathematics, and Computer Science Delft University of Technology Delft, The Netherlands

xxx

Alberto L. Sangiovanni-Vincentelli EECS Department University of California at Berkeley Berkeley, California

Contributors

Weilian Su

Thomas P. von Hoff

Broadband and Wireless Networking Laboratory School of Electrical and Computer Engineering Georgia Institute of Technology Atlanta, Georgia

ABB Switzerland Ltd Corporate Research Baden-Dattwil, Switzerland

Udit Saxena Microsoft Corporation Seattle, Washington

Guenter Schaefer Institute of Telecommunication Systems Technische Universität Berlin Berlin, Germany

D. N. Serpanos Department of Electrical and Computer Engineering University of Patras Patras, Greece

Venkita Subramonian Department of Computer Science and Engineering Washington University St. Louis, Missouri

Jacek Szymanski ALSTOM Transport Centre Meudon La Fort Meudon La Fort, France

Jean-Pierre Talpin IRISA Rennes, France

Marco Sgroi

Lothar Thiele


Department Information Technology and Electrical Engineering Computer Engineering and Networks Laboratory Swiss Federal Institute of Technology Zurich, Switzerland

Sandeep K. Shukla Electrical and Computer Engineering Virginia Tech Blacksburg, Virginia

Pieter van der Wolf

Françoise Simonot-Lion

Philips Research Eindhoven, The Netherlands

Institut National Polytechnique de Lorraine Nancy, France

V. A. Volkov

YeQiong Song Université Henri Poincaré Nancy, France


Glushkov Institute of Cybernetics National Academy of Science of Ukraine Kiev, Ukraine

A. G. Voyiatzis Department of Electrical and Computer Engineering University of Patras Patras, Greece

Flávio R. Wagner UFRGS — Instituto de Informática Porto Alegre, Brazil

Ernesto Wandeler Department Information Technology and Electrical Engineering Computer Engineering and Networks Laboratory Swiss Federal Institute of Technology Zurich, Switzerland

Yosinori Watanabe Cadence Berkeley Labs Berkeley, California

Thomas Weigert Global Software Group Motorola Schaumburg, Illinois

Reinhard Wilhelm University of Saarland Saarbruecken, Germany

Richard Zurawski ISA Group San Francisco, California

Contents

SECTION I

Embedded Systems

Real-Time and Embedded Systems 1 Embedded Systems: Toward Networking of Embedded Systems Luciano Lavagno and Richard Zurawski . . . . . . . . . . . . .

1-1

2 Real-Time in Embedded Systems Hans Hansson, Mikael Nolin, and Thomas Nolte . . . . . . . . . . . . . . . . . . . . . . . .

2-1

Design and Validation of Embedded Systems 3 Design of Embedded Systems Luciano Lavagno and Claudio Passerone . . . . . . . . . . . . . . . . . . . . . .

3-1

4 Models of Embedded Computation

4-1

Axel Jantsch . . . . . . . . .

5 Modeling Formalisms for Embedded System Design Luís Gomes, João Paulo Barros, and Anikó Costa . . . . . . . . . . . . . . . . .

5-1

6 System Validation J.V. Kapitonova, A.A. Letichevsky, V.A. Volkov, and Thomas Weigert . . . . . . . . . . . . . . . . . . . . .

6-1

Design and Verification Languages . . . . . .

7-1

8 The Synchronous Hypothesis and Synchronous Languages Dumitru Potop-Butucaru, Robert de Simone, and Jean-Pierre Talpin .

8-1

9 Introduction to UML and the Modeling of Embedded Systems Øystein Haugen, Birger Møller-Pedersen, and Thomas Weigert

. . .

9-1

10 Verification Languages Aarti Gupta, Ali Alphan Bayazit, and Yogesh Mahajan . . . . . . . . . . . . . . . . . . . . . . .

10-1

7 Languages for Embedded Systems

Stephen A. Edwards

Operating Systems and Quasi-Static Scheduling 11 Real-Time Embedded Operating Systems: Standards and Perspectives Ivan Cibrario Bertolotti . . . . . . . . . . . . . . . . . . . .


11-1

xxxii

Contents

12 Real-Time Operating Systems: The Scheduling and Resource Management Aspects Giorgio C. Buttazzo . . . . . . . . . . .

12-1

13 Quasi-Static Scheduling of Concurrent Specifications Alex Kondratyev, Luciano Lavagno, Claudio Passerone, and Yosinori Watanabe . . . . . . . . . . . . . . . . . . . . . .

13-1

Timing and Performance Analysis . . .

14-1

15 Performance Analysis of Distributed Embedded Systems Lothar Thiele and Ernesto Wandeler . . . . . . . . . . . . . . .

15-1

14 Determining Bounds on Execution Times

Reinhard Wilhelm

Power Aware Computing 16 Power Aware Embedded Computing Margarida F. Jacome and Anand Ramachandran . . . . . . . . . . . . . . . . . . . . .

16-1

Security in Embedded Systems 17 Design Issues in Secure Embedded Systems A.G. Voyiatzis, A.G. Fragopoulos, and D.N. Serpanos . . . . . . . . . . . . . .

SECTION II

17-1

System-on-Chip Design . . .

18-1

19 A Novel Methodology for the Design of Application-Specific Instruction-Set Processors Andreas Hoffmann, Achim Nohl, and Gunnar Braun . . . . . . . . . . . . . . . . . . . . . . . .

19-1

20 State-of-the-Art SoC Communication Architectures José L. Ayala, Marisa López-Vallejo, Davide Bertozzi, and Luca Benini . . . . . .

20-1

21 Network-on-Chip Design for Gigascale Systems-on-Chip Davide Bertozzi, Luca Benini, and Giovanni De Micheli . . . . . .

21-1

22 Platform-Based Design for Embedded Systems Luca P. Carloni, Fernando De Bernardinis, Claudio Pinello, Alberto L. Sangiovanni-Vincentelli, and Marco Sgroi . . . . . . .

22-1

23 Interface Specification and Converter Synthesis

Roberto Passerone .

23-1

24 Hardware/Software Interface Design for SoC Wander O. Cesário, Flávio R. Wagner, and A.A. Jerraya . . . . . . . . . . . . . . .

24-1

25 Design and Programming of Embedded Multiprocessors: An Interface-Centric Approach Pieter van der Wolf, Erwin de Kock, Tomas Henriksson, Wido Kruijtzer, and Gerben Essink . . . . . .

25-1

18 System-on-Chip and Network-on-Chip Design


Grant Martin

Contents

xxxiii

26 A Multiprocessor SoC Platform and Tools for Communications Applications Pierre G. Paulin, Chuck Pilkington, Michel Langevin, Essaid Bensoudane, Damien Lyonnard, and Gabriela Nicolescu . . .

26-1

SECTION III Circuits

Testing of Embedded Core-Based Integrated

27 Modular Testing and Built-In Self-Test of Embedded Cores in System-on-Chip Integrated Circuits Krishnendu Chakrabarty . . .

27-1

28 Embedded Software-Based Self-Testing for SoC Design Kwang-Ting (Tim) Cheng . . . . . . . . . . . . . . . . . . .

28-1

SECTION IV

Networked Embedded Systems

29 Design Issues for Networked Embedded Systems Sumit Gupta, Hiren D. Patel, Sandeep K. Shukla, and Rajesh Gupta . . . . . . .

29-1

30 Middleware Design and Implementation for Networked Embedded Systems Venkita Subramonian and Christopher Gill . . . . . . .

30-1

SECTION V

Sensor Networks

31 Introduction to Wireless Sensor Networks S. Dulman, S. Chatterjea, and P. Havinga . . . . . . . . . . . . . . . . . . . . . . . .

31-1

32 Issues and Solutions in Wireless Sensor Networks Ravi Musunuri, Shashidhar Gandham, and Maulin D. Patel . . . . . . . . . . .

32-1

33 Architectures for Wireless Sensor Networks S. Dulman, S. Chatterjea, T. Hoffmeijer, P. Havinga, and J. Hurink . . . . . .

33-1

34 Energy-Efficient Medium Access Control Koen Langendoen and Gertjan Halkes . . . . . . . . . . . . . . . . . . . . . . . .

34-1

35 Overview of Time Synchronization Issues in Sensor Networks Weilian Su . . . . . . . . . . . . . . . . . . . . . . . . . .

35-1

36 Distributed Localization Algorithms Koen Langendoen and Niels Reijers . . . . . . . . . . . . . . . . . . . . . . . . .

36-1

37 Routing in Sensor Networks Shashidhar Gandham, Ravi Musunuri, and Udit Saxena . . . . . . . . . . . . . . . . . . . . . . .

37-1

38 Distributed Signal Processing in Sensor Networks Omid S. Jahromi and Parham Aarabi . . . . . . . . . . . . . . . . . . . . . .

38-1


xxxiv

Contents

Guenter Schaefer . . . . . . . . . . .

39-1

40 Software Development for Large-Scale Wireless Sensor Networks Jan Blumenthal, Frank Golatowski, Marc Haase, and Matthias Handy . . . . . . . . . . . . . . . . . . . . . . .

40-1

39 Sensor Network Security

SECTION VI

Embedded Applications

Automotive Networks 41 Design and Validation Process of In-Vehicle Embedded Electronic Systems Françoise Simonot-Lion and YeQiong Song . . . . . . .

41-1

42 Fault-Tolerant Services for Safe In-Car Embedded Systems Nicolas Navet and Françoise Simonot-Lion . . . . . . . . . . . .

42-1

43 Volcano — Enabling Correctness by Design

Antal Rajnák . . . . .

43-1

44 Embedded Web Servers in Distributed Control Systems Jacek Szymanski . . . . . . . . . . . . . . . . . . . . . . .

44-1

45 HTTP Digest Authentication for Embedded Web Servers Mario Crevatin and Thomas P. von Hoff . . . . . . . . . . . . .

45-1

Industrial Automation

Intelligent Sensors 46 Intelligent Sensors: Analysis and Design


Eric Dekneuvel

. . . . .

46-1

I Embedded Systems


Real-Time and Embedded Systems 1

Embedded Systems: Toward Networking of Embedded Systems Luciano Lavagno and Richard Zurawski

2

Real-Time in Embedded Systems Hans Hansson, Mikael Nolin, and Thomas Nolte


1 Embedded Systems: Toward Networking of Embedded Systems 1.1 1.2

Luciano Lavagno Cadence Berkeley Laboratories and Politecnico di Torino

Richard Zurawski ISA Group

1.3

Networking of Embedded Systems . . . . . . . . . . . . . . . . . . . . . Design Methods for Networked Embedded Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Networks Embedded Systems . . . . . . . . . . . . . . . . . . . . . . . . . . .

1-1 1-3 1-5

Networked Embedded Systems in Industrial Automation • Networked Embedded Systems in Building Automation • Automotive Networked Embedded Systems • Sensor Networks

1.4 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-14 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-14

1.1 Networking of Embedded Systems The last two decades have witnessed a remarkable evolution of embedded systems from being assembled from discrete components on printed circuit boards, although, they still are, to systems being assembled from Intellectual Property (IP) components “dropped” onto silicon of the system on a chip. Systems on a chip offer a potential for embedding complex functionalities, and to meet demanding performance requirements of applications such as DSPs, network, and multimedia processors. Another phase in this evolution, already in progress, is the emergence of distributed embedded systems; frequently termed as networked embedded systems, where the word “networked” signifies the importance of the networking infrastructure and communication protocol. A networked embedded system is a collection of spatially and functionally distributed embedded nodes interconnected by means of wireline or wireless communication infrastructure and protocols, interacting with the environment (via a sensor/actuator elements) and each other, and, possibly, a master node performing some control and coordination functions, to coordinate computing and communication in order to achieve certain goal(s). The networked embedded systems appear in a variety of application domains such as, automotive, train, aircraft, office building, and industrial — primarily for monitoring and control, environment monitoring, and, in future, control, as well. There have been various reasons for the emergence of networked embedded systems, influenced largely by their application domains. The benefit of using distributed systems and an evolutionary need to replace point-to-point wiring connections in these systems by a single bus are some of the most important ones.

1-1


1-2

Embedded Systems Handbook

The advances in design of embedded systems, tools availability, and falling fabrication costs of semiconductor devices and systems, have allowed for infusion of intelligence into field devices such as sensors and actuators. The controllers used with these devices provide typically on-chip signal conversion, data processing, and communication functions. The increased functionality, processing, and communication capabilities of controllers have been largely instrumental in the emergence of a widespread trend for networking of field devices around specialized networks, frequently referred to as field area networks. The field area networks, or fieldbuses [1] (fieldbus is, in general, a digital, two-way, multi-drop communication link) as commonly referred to, are, in general, networks connecting field devices such as sensors and actuators with field controllers (for instance, Programmable Logic Controllers [PLCs] in industrial automation, or Electronic Control Units [ECUs] in automotive applications), as well as man–machine interfaces, for instance, dashboard displays in cars. In general, the benefits of using those specialized networks are numerous, including increased flexibility attained through combination of embedded hardware and software, improved system performance, and ease of system installation, upgrade, and maintenance. Specifically, in automotive and aircraft applications, for instance, they allow for a replacement of mechanical, hydraulic, and pneumatic systems by mechatronic systems, where mechanical or hydraulic components are typically confined to the end-effectors; just to mention their two different application areas. Unlike Local Area Networks (LANs), due to the nature of communication requirements imposed by applications, field area networks, by contrast, tend to have low data rates, small size of data packets, and typically require real-time capabilities which mandate determinism of data transfer. However, data rates above 10 Mbit/sec, typical of LANs, have already become a commonplace in field area networks. The specialized networks tend to support various communication media such as twisted pair cables, fiber optic channels, power line communication, radio frequency channels, infrared connections, etc. Based on the physical media employed by the networks, they can be, in general, divided into three main groups, namely: wireline-based networks using media such as twisted pair cables, fiber optic channels (in hazardous environments like chemical and petrochemical plants), and power lines (in building automation); wireless networks supporting radio frequency channels, and infrared connections; and hybrid networks composed of wireline and wireless networks. Although the use of wireline-based field area networks is dominant, the wireless technology offers a range of incentives in a number of application areas. In industrial automation, for instance, wireless device (sensor/actuator) networks can provide a support for mobile operation required in case of mobile robots, monitoring, and control of equipment in hazardous and difficult to access environments, etc. In a wireless sensor/actuator network, stations may interact with each other on a peer-to-peer basis, and with a base station. The base station may have its transceiver attached to a cable of a (wireline) field area network, giving rise to a hybrid wireless–wireline system [2]. A separate category is the wireless sensor networks, mainly envisaged to be used for monitoring purposes, which is discussed in detail in the book. The variety of application domains impose different functional and nonfunctional requirements onto the operation of networked embedded systems. Most of them are required to operate in a reactive way; for instance, systems used for control purposes. With that comes the requirement for real-time operation, in which systems are required to respond within a predefined period of time, mandated by the dynamics of the process under control. A response, in general, may be periodic to control a specific physical quantity by regulating dedicated end-effector(s), or aperiodic arising from unscheduled events such as out-of-bounds state of a physical parameter or any other kind of abnormal conditions, or sporadic with no period but with known minimum time between consecutive occurrences. Broadly speaking, systems which can tolerate a delay in response are called soft real-time systems; in contrast, hard real-time systems require deterministic responses to avoid changes in the system dynamics which potentially may have negative impact on the process under control, and as a result may lead to economic losses or cause injury to human operators. Representative examples of systems imposing hard real-time requirement on their operation are fly-by-wire in aircraft control, and steer-by-wire in automotive applications, to mention a few. The need to guarantee a deterministic response mandates using appropriate scheduling schemes, which are frequently implemented in application domain specific real-time operating systems or custom designed


Toward Networking of Embedded Systems

1-3

“bare-bone” real-time executives. Most of those issues (real-time scheduling and real-time operating systems) are discussed in this book in a number of chapters. The networked embedded systems used in safety-critical applications such as fly-by-wire and steer-bywire require a high level of dependability to ensured that a system failure does not lead to a state in which human life, property, or environment are endangered. The dependability issue is critical for technology deployment; various solutions are discussed in this chapter in the context of automotive applications. One of the main bottlenecks in the development of safety-critical systems is the software development process. This issue is briefly discussed in this chapter in the context of the automotive application domain. As opposed to applications mandating hard real-time operation, such as the majority of industrial automation controls or safety-critical automotive control applications, building automation control systems, for instance, seldom have a need for hard real-time communication; the timing requirements are much more relaxed. The building automation systems tend to have a hierarchical network structure and typically implement all seven layers of the ISO/OSI reference model [3]. In the case of field area networks employed in industrial automation, for instance, there is little need for the routing functionality and end-to-end control. Therefore, typically, only the layers 1 (physical layer), 2 (data link layer, including implicitly the medium access control layer), and 7 (application layer, which also covers user layer) are used in those networks. This diversity of requirements imposed by different application domains (soft/hard real-time, safety critical, network topology, etc.) necessitated different solutions, and using different protocols based on different operation principles. This has resulted in plethora of networks developed for different application domains. Some of those networks will be overviewed in one of the subsequent sections. With the growing trend for networking of embedded system and their internetworking with LAN, Wide Area Network (WAN), and the Internet (for instance, there is a growing demand for remote access to process data at the factory floor), many of those systems may become exposed to potential security attacks, which may compromise their integrity and cause damage as a result. The limited resources of embedded nodes pose considerable challenge for the implementation of effective security policies which, in general, are resource demanding. These restrictions necessitate a deployment of lightweight security mechanisms. Vendor tailored versions of standard security protocol suites, such as Secure Sockets Layer (SSL) and IP Security Protocol (IPSec), may still not be suitable due to excessive demand for resources. Potential security solutions for this kind of systems depend heavily on the specific device or system protected, application domain, and extent of internetworking and its architecture. (The details of potential security measures are presented in this book in two separate chapters.)

1.2 Design Methods for Networked Embedded Systems Design methods for networked embedded systems fall into the general category of system-level design. They include two separate aspects, which will be discussed briefly. A first aspect is the network architecture design, in which communication protocols, interfaces, drivers, and computation nodes are selected and assembled. A second aspect is the system-on-chip design, in which the best hardware/software partition is selected, and an existing platform is customized, or a new chip is created for the implementation of a computation or a communication node. Both aspects share several similarities, but so far have generally been solved using ad hoc methodologies and tools, since the attempt to create a unified electronic system-level design methodology have so far failed. When one considers the complete networked system, including several digital and analog parts, many more trade-offs can be played at the global level. However, it also means that the interaction between the digital portion of the design activity and the rest is much more complicated, especially in terms of tools, formats, and standards with which one must interoperate and interface. In the case of network architecture design, tools such as OpNet and NS are used to identify communication bottlenecks, investigate the effect of parameters such as channel bit error rate, and analyze the choice of coding, medium access, and error correction mechanisms on the overall system performance.


1-4


For wireless networks, tools such as Matlab and Simulink are also used, in order to analyze the impact of detailed channel models, thanks to their ability to model both digital and analog components, as well as physical elements, at a high level of abstraction. In all cases, the analysis is essentially functional, that is, it takes into account only in a very limited manner effects such as power consumption, computation time, and cost. This is the main limitation that will need to be addressed in the future, if one wants to model and design in an optimal manner low power networked embedded systems, such as those that are envisioned for wireless sensor network applications. At the system-on-chip architecture level, the first decision to be made is whether to use a platform instance or design an Application-Specific Integrated Circuit (ASIC) from scratch. The first option builds on the availability of large libraries of IP, both in the form of processors, memories, and peripherals, from major silicon vendors. These IP libraries are guaranteed to work together, and hence constitute what is termed as a platform. A platform is a set of components, together with usage rules that ensure their correct and seamless interoperation. They are used to speed up time-to-market, by ensuring rapid implementation of complex architectures. Processors (and the software executing on them) provide flexibility to adapt to different applications and customizations (e.g., localization and adherence to regional standards), while hardware IPs provide efficient implementation of commonly used functions. Configurable processors can be adapted to the requirements of specific applications and via instruction extensions, offer considerable performance and power advantages over fixed instruction set architectures. Thus, a platform is a single abstract model that hides the details of a set of different possible implementations as clusters of lower level components. The platform, for example, a family of microprocessors, peripherals, and bus protocols, allows developers of application designs to operate without detailed knowledge of the implementation (e.g., the pipelining of the processor or the internal implementation of the UART). At the same time, it allows platform implementors to share design and fabrication costs among a broad range of potential users, broader than if each design was a one-of-a-kind type. Design methods that exploit the notion of platform generally start from a functional specification, which is then mapped onto an architecture (a platform instance) in order to derive performance information and explore the design space. Full exploitation of the notion of platform results in better reuse, by decoupling independent aspects that would otherwise tie, for example, a given functional specification to low level implementation details. The guiding principle of separation of concerns distinguishes between: 1. Computation and communication. This separation is important because refinement of computation is generally done by hand, or by compilation and scheduling, while communication makes use of patterns. 2. Application and platform implementation, because they are often defined and designed independently by different groups or companies. 3. Behavior and performance, which should be kept separate because performance information can represent either nonfunctional requirements (e.g., maximum response time of an embedded controller), or the result of an implementation choice (e.g., the worst-case execution time of a task). Nonfunctional constraint verification can be performed traditionally, by simulation and prototyping, or with static formal checks, such as schedulability analysis. Tool support for system-on-chip architectural design is, so far, mostly limited to simulation and interface generation. The first category includes tools such as NC-SystemC from Cadence, ConvergenSC from CoWare, and SystemStudio from Synopsys. Simulators at the system-on-chip level provide abstractions for the main architectural components (processors, memories, busses, and hardware blocks) and permit quick instantiation of complete platform instances from template skeletons. Interface synthesis can take various forms, from the automated instantiation of templates offered by N2C from CoWare, to the automated consistent file generation for software and hardware offered by Beach Solutions. A key aspect of design problems in this space is compatibility with respect to specifications, at the interface level (bus and networking standards), instruction-set architecture level, and Application Procedural Interface (API) level. Assertion-based verification techniques can be used to ease the problem of verifying compliance with a digital protocol standard (e.g., for a bus).



1-5

Let us consider an example of a design flow in the automotive domain, which can be considered as a paradigm of any networked embedded system. Automotive electronic design starts, usually 5 to 10 years before the actual introduction of a product, when a car manufacturer defines the specifications for its future line of vehicles. It is now an accepted practice to use the notion of platform also in this domain, so that the electronic portion (as well as the mechanical one, which is outside the scope of this discussion) is modularized and componentized, enabling sharing across different models. An ECU generally includes a microcontroller (8, 16, and 32 bits), memory (SRAM, DRAM, and Flash), some ASIC or FPGA for interfacing, one or more in-vehicle network interfaces (e.g., CAN [Controller Area Network] or FlexRay), and several sensor and actuator interfaces (analog/digital and digital/analog converters, pulse-width modulators, power transistors, display drivers, and so on). The system-level design activity is performed by a relatively small team of architects, who know the domain well (mechanics, electronics, and business), define the specifications for the electronic component suppliers, and interface with the teams that specify the mechanical portions (body and engine). These teams essentially use past experience to perform their job, and currently have serious problems forecasting the state of electronics ten years in advance. Control algorithms are defined in the next design phase, when the first engine models (generally described using Simulink, Matlab, and StateFlow) become available, as a specification for both the electronic design and the engine design. An important aspect of the overall flow is that these models are not frozen until much later, and hence both algorithm design and (often) ECU software design must cope with their changes. Another characteristic is that they are parametric models, sometimes reused across multiple engine generations and classes, whose exact parameter values will be determined only when prototypes or actual products will be available. Thus, control algorithms must consider both allowable ranges and combinations of values for these parameters, and the capability to measure directly or indirectly their values from the behavior of engine and vehicle. Finally, algorithms are often distributed over a network of cooperating ECUs, thus deadlines and constraints generally span a number of electronic modules. While control design progresses, ECU hardware design can start, because rough computational and memory requirement, as well as interfacing standards, sensors, and actuators, are already known. At the end of both control design and hardware design, software implementation can start. As mentioned earlier, most of the software running on modern ECUs is automatically generated (model-based design). The electronic subsystem supplier in the hardware implementation phase can use both off-the-shelf components (such as memories), Application Specific Standard Products (ASSPs) (such as microcontrollers and standard bus interfaces), and even ASICs and FPGAs (typically for sensor and actuator signal conditioning and conversion). The final phase, called system integration, is generally performed by the car manufacturer again. It can be an extremely lengthy and an expensive phase, because it requires the use of expensive detailed models of the controlled system (e.g., the engine, modeled with DSP-based multiprocessors) or even of actual car prototypes. The goal of integration is to ensure smooth subsystem communication (e.g., checking that there are no duplicate module identifiers and that there is enough bandwidth in every in-vehicle bus). Simulation support in this domain is provided by companies such as Vast and Axys (now part of ARM), who sell both fast instruction-set simulators for the most commonly used processors in the networked embedded system domain, and network simulation models exploiting either proprietary simulation engines, for example, in the case of Virtio, or standard simulators (HDL [Hardware Description Language] or SystemC).

1.3 Networks Embedded Systems 1.3.1 Networked Embedded Systems in Industrial Automation Although for the origins of field area networks, one can look back as far as the end of 1960s in the nuclear instrumentation domain, CAMAC network [4], and beginning of 1970s in avionics and aerospace


1-6


applications, MIL-STD-1553 bus [5], it was the industrial automation area which brought the main thrust of developments. The need for integration of heterogeneous systems, difficult at that time due to the lack of standards, resulted in two major initiatives which have had a lasting impact on the integration concepts, and architecture of the protocol stack of field area networks. These initiatives were TOP (Technical and Office Protocol) [6] and MAP (Manufacturing Automation Protocol) [7] projects. The two projects exposed some pitfalls of the full seven-layer stack implementations (ISO/OSI model [3]). As a result, typically, only the layers 1 (physical layer), 2 (data link layer, including implicitly the medium access control layer), and 7 (application layer, which also covers user layer) are used in the field area networks [8]; also prescribed by the international fieldbus standard, IEC 61158 [9]. In IEC 61158, functions of layers 3 and 4 are recommended to be placed in either layers 2 or 7 — network and transport layers are not required in a single segment network typical of process and industrial automation (situation is different though in building automation, for instance, where the routing functionality and end-to-end control may be needed arising from a hierarchical network structure); functions of layers 5 and 6 are always covered in layer 7. The evolution of fieldbus technology which begun well over two decades ago has resulted in a multitude of solutions reflecting the competing commercial interests of their developers and standardization bodies, both national and international: IEC [10], ISO [11], ISA[12], CENELEC [13], and CEN[14]. This is also reflected in IEC 61158 (adopted in 2000), which accommodates all national standards and user organization championed fieldbus systems. Subsequently, implementation guidelines were compiled into communication profiles, IEC 61784-1 [15]. Those communication profiles identify seven main systems (or communication profile families) known by brand names as Foundation Fieldbus (H1, HSE, H2) used in process and factory automation; ControlNet and EtherNet/IP both used in factory automation; PROFIBUS (DP, PA) used in factory and process automation respectively; PROFInet used in factory automation; P-Net (RS 485, RS 232) used in factory automation and shipbuilding; WorldFIP used in factory automation; INTERBUS, INTERBUS TCP/IP, and INTERBUS Subset used in factory automation; Swiftnet transport, Swiftnet full stack used by aircraft manufacturers. The listed application areas are the dominant ones. Ethernet, the backbone technology for office networks, is increasingly being adopted for communication in factories and plants at the fieldbus level. The random and native CSMA/CD arbitration mechanism is being replaced by other solutions allowing for deterministic behavior required in real-time communication to support soft and hard real-time deadlines, time synchronization of activities required to control drives, for instance, and for exchange of small data records characteristic of monitoring and control actions. The emerging Real-Time Ethernet (RTE), Ethernet augmented with real-time extensions, under standardization by IEC/SC65C committee, is a fieldbus technology which incorporates Ethernet for the lower two layers in the OSI model. There are already a number of implementations, which use one of the three different approaches to meet real-time requirements. First approach is based on retaining the TCP/UDP/IP protocols suite unchanged (subject to nondeterministic delays); all real-time modifications are enforced in the top layer. Implementations in this category include Modbus/TPC [16] (defined by Schneider Electric and supported by Modbus-IDA [17]), EtherNet/IP [18] (defined by Rockwell and supported by the Open DeviceNet Vendor Association (ODVA) [19] and ControlNet International [20]), P-Net (on IP) [21] (proposed by the Danish P-Net national committee), and Vnet/IP [22] (developed by Yokogawa, Japan). In the second approach, the TCP/UDP/IP protocols suite is bypassed, the Ethernet functionality is accessed directly — in this case, RTE protocols use their own protocol stack in addition to the standard IP protocol stack. The implementations in this category include Ethernet Powerlink (EPL) [23] (defined by Bernecker + Rainer [B&R], and now supported by the Ethernet Powerlink Standardisation Group [24]), TCnet (a Time-critical Control Network) [25] (a proposal from Toshiba), EPA (Ethernet for Plant Automation) [26] (a Chinese proposal), and PROFIBUS CBA (Component-Based Automation) [27] (defined by several manufacturers including Siemens, and supported by PROFIBUS International [28]). Finally, in the third approach, the Ethernet mechanism and infrastructure are modified. The implementations include SERCOS III [29] (under development by SERCOS), EtherCAT [30] (defined by Beckhoff and supported by the EtherCat Technology Group [31]), and PROFINET IO [32] (defined by several manufacturers including Siemens, and supported by PROFIBUS International).



1-7

The use of standard components such as protocol stacks, Ethernet controllers, bridges, etc., allows to mitigate the ownership and maintenance cost. The direct support for the Internet technologies allows for vertical integration of various levels of industrial enterprise hierarchy to include seamless integration between automation and business logistic levels to exchange jobs and production (process) data; transparent data interfaces for all stages of the plant life cycle; the Internet- and web-enabled remote diagnostics and maintenance, as well as electronic orders and transactions. In the case of industrial automation, the advent and use of networking has allowed for horizontal and vertical integration of industrial enterprises.

1.3.2 Networked Embedded Systems in Building Automation Another fast growing application area for networked embedded systems is building automation [33]. Building automation systems aim at the control of the internal environment, as well as the immediate external environment of a building, or a building complex. At present, the focus of research and technology development is on commercial type of buildings (office building, exhibition center, shopping complex, etc.). In future, this will also include industrial type of buildings, which pose substantial challenges to the development of effective monitoring and control solutions. Some of the main services to be offered by the building automation systems typically include: climate control to include heating, ventilation, air conditioning; visual comfort to cover artificial lighting, control of day light; safety services such as fire alarm, and emergency sound system; security protection; control of utilities such as power, gas, water supply, etc.; internal transportation systems to mention lifts, escalators, etc. In terms of the quality of the service requirements imposed on the field area networks, building automation systems differ considerably from their counterparts in industrial automation, for instance. There is seldom a need for hard real-time communication; the timing requirements are much more relaxed. Traffic volume in normal operation is low. Typical traffic is event driven, and mostly uses peer-topeer communication paradigm. Fault tolerance and network management are important aspects. As with industrial fieldbus systems, there are a number of bodies involved in the standardization of technologies for building automation, including the field area networks. The communication architecture supporting automation systems embedded in the buildings has typically three levels: field, control, and management levels. The field level, involves operation of elements such as switches, motors, lighting cells, dry cells, etc. The peer-to-peer communication is perhaps most evident at that level; toggling a switch should activate a lighting cell(s), for instance. The automation level is typically used to evaluate new control strategies for the lower level in response to the changes in the environment; reduction in the day light intensity, external temperature change, etc. LonWorks [34], BACnet [35], and EIB/KNX [36–39] are open system networks, which can be used at more than one level of the communication architecture. A round up of LonWorks will be provided in the following, as a representative example of specialized field area networks used in building automation. LonWorks (EIA-709), a trademark of Echelon Corp. [40], employs LonTalk protocol which implements all seven layers of the ISO/OSI reference model. The LonTalk protocol was published as a formal standard [41], and revised in 2002 [42]. In EIA-709, layer 2 supports various communication media such as twisted pair cables (78 Kbit/sec [EIA-709.3] or 1.25 Mbit/sec), power line communication (4 Kbit/sec, EIA-709.2), radio frequency channel, infrared connections, fiber optic channels (1.25 Mbit/sec), as well as IP connections based on the EIA-852 protocol standard [43] in order to tunnel EIA-709 data packets through IP (Intranet, Internet) networks. A p-persistent CSMA bus arbitration scheme is used on twisted pair cables. For other communication media, the EIA-709 protocol stack uses the arbitration scheme defined for the very media. The EIA-709 layer 3 supports a variety of different addressing schemes and advanced routing capabilities. The entire routable address space of a LonTalk network is referred to as the domain (Figure 1.1). A domain is restricted to 255 subnets; a subnet allows for up to 127 nodes. The total number of addressable nodes in a domain can reach 32385; up to 248 domains can be addressed. Domain gateways can be built between logical domains in order to allow for a communication across domain boundaries. Groups can be formed in order to send a single data packet to a group of nodes using a multicast addressed message.


1-8

Embedded Systems Handbook Group # 1 Node x

Node 3

Node x

S/N

S/N

S/N

Subnet 1

Subnet 2

S/N

S/N

S/N

Router

Node 1 Node 2

Subnet x

S/N

Router

Node 1

Node 2

S/N Domain 2 Domain gateway Domain 1 Subnet x Node x Router Subnet 1

S/N Node 1

Router

S/N Node 1 S/N Node 2

S/N

Subnet 2

S/N Node 1

S/N Node 2

FIGURE 1.1 Addressing elements in EIA-709 networks. (From D. Loy, Fundamentals of LonWorks/EIA — 709 networks: ANSI/EIA — 709 protocol standard (LonTalk). In The Industrial Communication Technology Handbook, Zurawski, R. (Ed.), CRC Press, Boca Raton, FL, 2005. With permission.)

Routing is performed between different subnets only. An EIA-709 node can send a unicast addressed message to exactly one node using either unique 48-bit node identification (Node ID) address or the logical subnet/node address. A multicast addressed message can be sent to either a group of nodes (group address), or all nodes in the subnet, or all nodes in the entire domain (broadcast address). The EIA-709 layer 4 supports four types of services. The unacknowledged service transmits the data packet from the sender to the receiver. The unacknowledged repeated service transmits the same data packet a number of times. The number of retries is programmable. The acknowledged service transmits the data packet and waits for an acknowledgment from the receiver. If not received by the transmitter, the same data packet is sent again. The number of retries is programmable. The request response service sends a request message to the receiver; the receiver must respond with a response message, for instance, with statistics information. There is a provision for authentication of acknowledged transmissions, although not very efficient. Network nodes (which, typically, include Neuron chip, RAM/Flash, power source, clock, network transceiver, and input/output interface connecting to sensor and actuator) can be based on the Echelon’s Neuron chip series manufactured by Motorola, Toshiba, and Cypress; recently also based on other platform independent implementations such as LoyTec LC3020 controller. The Neuron chips-based controllers are programmed with the Echelon’s Neuron C language, which is a derivative of ANSI C. Other controllers such as LC3020 are programmed with standard ANSI C. The basic element of Neuron C is the Network Variable (NV) which can be propagated over the network. For instance, SNVT_temp variable represents temperature in degree Celsius; SNVT stands for Standard Network Variable Type. Network nodes communicate with each other by exchanging NVs. Another way to communicate between nodes is by using explicit messages. The Neuron C programs are used to schedule application events and to react to incoming data packets (receiving NVs) from the network interface. Depending on the network media and the network transceivers, a variety of network topologies are possible with LonWorks nodes, to include bus, ring, star, and free topology. © 2006 by Taylor & Francis Group, LLC


1-9

As the interoperability on all seven OSI layers does not guarantee interworkable products, the LonMark organization [44] has published interoperability guidelines for nodes that use the LonTalk protocol. A number of task groups within LonMark define functional profiles (subset of all the possible protocol features) for analog input, analog output, temperature sensor, etc. The task groups focus on various types of applications such as home/utility, HVAC, lighting, etc. LonBuilder and NodeBuilder are development and integration tools offered by Echelon. Both tools allow writing Neuron C programs, to compile and link them and download the final application into the target node hardware. NodeBuilder supports debugging of one node at the time. LonBuilder, which supports simultaneous debugging of multiple nodes, has a built in protocol analyzer and a network binder to create communication relationships between network nodes. The Echelon’s LNS (network operating system) provides tools that allow one to install, monitor, control, manage, and maintain control devices, and to transparently perform these services over any IP-based network, including the Internet.

1.3.3 Automotive Networked Embedded Systems Similar trends appear in the automotive electronic systems where the ECUs are networked by means of one of automotive specific communication protocols for the purpose of controlling one of the vehicle functions; for instance, electronic engine control, antilocking break system, active suspension, telematics, to mention a few. In Reference 45, a number of functional domains have been identified for the deployment of automotive networked embedded systems. They include the powertrain domain, involving, in general, control of engine and transmission; the chassis domain involving control of suspension, steering and braking, etc.; the body domain involving control of wipers, lights, doors, windows, seats, mirrors, etc.; the telematics domain involving, mostly, the integration of wireless communications, vehicle monitoring systems, and vehicle location systems; and the multimedia and Human Machine Interface (HMI) domains. The different domains impose varying constraints on the networked embedded systems in terms of performance, safety requirements, and Quality of Services (QoSs). For instance, the powertrain and chassis domains will mandate real-time control; typically, bounded delay is required, as well as fault-tolerant services. There are a number of reasons for the interest of the automotive industry in adopting mechatronic solutions, known by their generic name as x-by-wire, aiming to replace mechanical, hydraulic, and pneumatic systems by electrical/electronic systems. The main factors seem to be economic in nature, improved reliability of components, and increased functionality to be achieved with a combination of embedded hardware and software. Steer-by-wire, brake-by-wire, or throttle-by-wire systems are representative examples of those systems. But, it seems that certain safety-critical systems such as steer-by-wire and brake-by-wire will be complemented with traditional mechanical/hydraulic backups, for safety reasons. The dependability of x-by-wire systems is one of the main requirements, as well as constraints on the adoption of this kind of systems. In this context, a safety-critical x-by-wire system has to ensure that a system failure does not lead to a state in which human life, property, or environment are endangered; and a single failure of one component does not lead to a failure of the whole x-by-wire system [46]. When using Safety Integrity Level (SIL) scale, it is required for x-by-wire systems that the probability of a failure of a safety-critical system does not exceed the figure of 10−9 per hour/system. This figure corresponds to the SIL4 level. Another equally important requirement for the x-by-wire systems is to observe hard realtime constraints imposed by the system dynamics; the end-to-end response times must be bounded for safety-critical systems. A violation of this requirement may lead to performance degradation of the control system, and other consequences as a result. Not all automotive electronic systems are safety critical. For instance, system(s) to control seats, door locks, internal lights, etc., are not. Different performance, safety, and QoS requirements dictated by various in-car application domains necessitate adoption of different solutions, which, in turn, gave rise to a significant number of communication protocols for automotive applications. Time-triggered protocols based on TDMA (Time Division Multiple Access) medium access control technology are particularly well suited for the safety-critical solutions, as they provide deterministic access to the medium. In this category, there are two protocols, which, in principle, meet the requirements © 2006 by Taylor & Francis Group, LLC

1-10


of x-by-wire applications, namely TTP/C [47] and FlexRay [48] (FlexRay can support a combination of both time-triggered and event-triggered transmissions). The following discussion will focus mostly on TTP/C and FlexRay. The TTP/C (Time-Triggered Protocol) is a fault-tolerant time-triggered protocol; one of two protocols in the Time Triggered Architecture (TTA) [49]. The other one is a low cost fieldbus protocol TTP/A [50]. In TTA, the nodes are connected by two replicated communication channels forming a cluster. In TTA, a network may have two different interconnection topologies, namely bus and star. In the bus configuration, each node is connected to two replicated passive buses via bus guardians. The bus guardians are independent units preventing associated nodes from transmitting outside predetermined time slots, by blocking the transmission path; a good example may be a case of a controller with a faulty clock oscillator which attempts to transmit continuously. In the star topology, the guardians are integrated in to two replicated central star couplers. The guardians are required to be equipped with their own clocks, distributed clock synchronization mechanism, and power supply. In addition, they should be located at a distance from the protected node to increase immunity to spatial proximity faults. To cope with internal physical faults, TTA employs partitioning of nodes in to so-called Fault-Tolerant Units (FTUs), each of which is a collection of several stations performing the same computational functions. As each node is (statically) allocated a transmission slot in a TDMA round, failure of any node or a frame corruption is not going to cause degradation of the service. In addition, data redundancy allows, by voting process, to ascertain the correct data value. TTP/C employs synchronous TDMA medium access control scheme on replicated channels, which ensures fault-tolerant transmission with known delay and bounded jitter between the nodes of a cluster. The use of replicated channels, and redundant transmission, allows for the masking of a temporary fault on one of channels. The payload section of the message frame contains up to 240 bytes of data protected by a 24-bit CRC checksum. In TTP/C, the communication is organized in to rounds. In a round, different slot sizes may be allocated to different stations. However, slots belonging to the same station are of the same size in successive rounds. Every node must send a message in every round. Another feature of TTP/C is faulttolerant clock synchronization that establishes global time base without a need for a central time provider. In the cluster, each node contains the message schedule. Based on that information, a node computes the difference between the predetermined and actual arrival time of a correct message. Those differences are averaged by a fault-tolerant algorithm, which allows for the adjustment of the local clock to keep it in synchrony with clocks of other nodes in the cluster. TTP/C provides so-called membership service to inform every node about the state of every other node in the cluster; it is also used to implement the fault-tolerant clock synchronization mechanism. This service is based on a distributed agreement mechanism, which identifies nodes with failed links. A node with a transmission fault is excluded from the membership until restarted with a proper state of the protocol. Another important feature of TTP/C is a clique avoidance algorithm to detect and eliminate formation of cliques in case the fault hypothesis is violated. In general, the fault-tolerant operation based on FTUs cannot be maintained if the fault hypothesis is violated. In such a situation, TTA activates Never-Give-Up (NGU) strategy [46]. The NGU strategy, specific to the application, is initiated by TTP/C in combination with the application with an aim to continue operation in a degraded mode. The TTA infrastructure, and the TTP/A and TTP/C protocols have a long history dating back to 1979 when the Maintainable Architecture for Real-Time Systems (MARSs) project started at the Technical University of Berlin. Subsequently, the work was carried out at the Vienna University of Technology. TTP/C protocol have been experimented with and considered for deployment for quite some time. However, to date, there have been no actual implementations of that protocol involving safety-critical systems in commercial automobiles, or trucks. In 1995, a “proof of concept,” organized jointly by Vienna University of Technology and DaimlerChrysler, demonstrated a car equipped with a “brake-by-wire” system based on time-triggered protocol. The TTA design methodology, which distinguishes between the node design and the architecture design, is supported by a comprehensive set of integrated tools from TTTech. A range of development and prototyping hardware is available from TTTech, as well. Austriamicrosystems offers automotive certified TTP-C2 Communication Controller (AS8202NF).



1-11

Communication cycle Network communication time

Static segment

Network idle time

Dynamic segment

Symbol window Optional

Static slot

Static slot

Static slot

Minislot

Minislot

Minislot

Minislot

Minislot

Minislot

FIGURE 1.2 FlexRay communication cycle. (From D. Millinger and R. Nossal, FlexRay Communication Technology. In The Industrial Communication Technology Handbook, Zurawski, R. (Ed.), CRC Press, Boca Raton, FL, 2005. With permission.)

FlexRay, which appears to be the frontrunner for future automotive safety-critical control applications, employs a modified TDMA medium access control scheme on a single or replicated channel. The payload section of a frame contains up to 254 bytes of data protected by a 24-bit CRC checksum. To cope with transient faults, FlexRay also allows for a redundant data transmission over the same channel(s) with a time delay between transmissions. The FlexRay communication cycle comprises of a network communication time, and network idle time, Figure 1.2. Two or more communication cycles can form an application cycle. The network communication time is a sequence of static segment, dynamic segment, and symbol window. The static segment uses a TDMA MAC protocol. The static segment comprises of static slots of fixed duration. Unlike in TTP/C, the static allocation of slots to a node (communication controller) applies to one channel only. The same slot may be used by another node on the other channel. Also, a node may possess several slots in a static segment. The dynamic segment uses a FTDMA (Flexible Time Division Multiple Access) MAC protocol, which allows for a priority and demand driven access pattern. The dynamic segment comprises of so-called mini-slots with each node allocated a certain number of mini-slots, which do not have to be consecutive. The mini-slots are of a fixed length, and much shorter than static slots. As the length of a mini-slot is not sufficient to accommodate a frame (a mini-slot only defines a potential start time of a transmission in the dynamic segment), it has to be enlarged to accommodate transmission of a frame. This in turn reduces the number of mini-slots in the reminder of the dynamic segment. A mini-slot remains silent if there is nothing to transmit. The nodes allocated mini-slots toward the end of the dynamic segment are less likely to get transmission time. This in turn enforces a priority scheme. The symbol window is a time slot of fixed duration used for network management purposes. The network idle time is a protocol specific time window, in which no traffic is scheduled on the communication channel. It is used by the communication controllers for the clock synchronization activity; in principle, similar to the one described for TTP/C. If the dynamic segment and idle window are optional, the idle time, and minimal static segment are mandatory parts of a communication cycle; minimum two static slots (degraded static segment), or four static slots for fault-tolerant clock synchronization are required. With all that, FlexRay allows for three configurations: pure static; mixed, with both static and dynamic — bandwidth ratio depends on the application; and pure dynamic, where all bandwidth is allocated to the dynamic communication. FlexRay supports a range of network topologies offering a maximum of scalability and a considerable flexibility in the arrangement of embedded electronic architectures in automotive applications. The supported configurations include bus, active star, active cascaded stars, and active stars with bus extension. FlexRay also uses the bus guardians in the same way as TTP/C. The existing FlexRay communication controllers support communication bit rates of up to 10 Mbit/sec on two channels. The transceiver component of the communication controller also provides a set


1-12


of automotive network specific services. Two major services are alarm handling and wakeup control. In addition to the alarm information received in a frame, an ECU also receives the alarm symbol from the communication controller. This redundancy can be used to validate critical signals; for instance, an air bag fire command. The wakeup service is required where electronic components have a sleep mode to reduce power consumption. FlexRay is a joint effort of a consortium involving some of the leading car makers and technology providers to mention BMW, Bosch, DaimlerChrysler, General Motors, Motorola, Philips, and Volkswagen, as well as Hyundai Kia Motors as a premium associate member with voting rights. DECOMSYS offers Designer Pro, a comprehensive set of tools to support the development process of FlexRay based applications. The FlexRay protocol specification version 2.0 was released in 2004. The controllers are currently available from Freescale, and in future from NEC. The latest controller version, MFR4200, implements the protocol specification versions 1.0 and 1.1. Austriamicrosystems offers high-speed automotive bus transceiver for FlexRay (AS8221). The special physical layer for FlexRay is provided by Phillips. It supports the topologies described above, and a data rate of 10 Mbit/sec on one channel. Two versions of the bus driver will be available. Time-Triggered Controller Area Network (TTCAN) [51], that can support a combination of both timetriggered and event-triggered transmissions, utilize physical and data-link layer of the CAN protocol. Since this protocol, as in the standard, does not provide necessary dependability services, it is unlikely to play any role in fault-tolerant communication in automotive applications. TTP/C and FlexRay protocols belong to class D networks in the classification published by the Society for Automotive Engineers [52, 53]. Although the classification dates back to 1994, it is still a reasonable guideline for distinction of different protocols based on data transmission speed and functions distributed over the network, which comprises of four classes. Class A includes networks with a data rate less than 10 Kbit/sec. Some of the representative protocols are Local Interconnect Network (LIN) [54] and TTP/A [50]. Class A networks are employed largely to implement the body domain functions. Class B networks operate within the range of 10 Kbit/sec to 125 Kbit/sec. Some of the representative protocols are J1850 [55], low-speed CAN [56], and VAN (Vehicle Area Network) [57]. Class C networks operate within the range of 125 Kbit/sec to 1 Mbit/sec. Examples of this class networks are high-speed CAN [58] and J1939 [59]. Network in this class are used for the control of powertrain and chassis domains. High-speed CAN, although used in the control of powertrain and chassis domains, is not suitable for safety-critical applications as it lacks the necessary fault-tolerant services. Class D networks (not formally defined as yet) includes networks with a data rate over 1 Mbit/sec. Networks to support the x-by-wire solutions fall in to this class, to include TTP/C and FlexRay. Also, MOST (Media Oriented System Transport) [60] and IDB-1394 [61], both for multimedia applications, belong to this class. The cooperative development process of networked embedded automotive applications brings with itself heterogeneity of software and hardware components. Even with the inevitable standardization of those components, interfaces, and even complete system architectures, the support for reuse of hardware and software components is limited. Thus potentially making the design of networked embedded automotive applications labor-intensive, error-prone, and expensive. This necessitates the development of component-based design integration methodologies. An interesting approach is based on platform-based design [62], discussed in this book with a view for automotive applications. Some industry standardization initiatives include: OSEK/VDX with its OSEKTime OS (OSEK/VDX Time-Triggered Operating Systems) [63]; OSEK/VDX Communication [64] which specifies a communication layer that defines common software interfaces and common behavior for internal and external communications among application processes; and OSEK/VDX FTCom (Fault-Tolerant Communication) [65] — a proposal for a software layer to provide services to facilitate development of fault-tolerant applications on top of time-triggered networks; HIS (Herstellerinitiative Software)[66] with a broad range of goals including standardization of software modules, specification of process maturity levels, development of software test, development of software tools, etc; ASAM (Association for Standardization of Automation and Measuring Systems) [67] which develops, amongst other projects, a standardized XML based format for data exchange between tools from different vendors.



1-13

One of the main bottlenecks in the development of safety-critical systems is the software development process. The automotive industry clearly needs a software development process model and supporting tools suitable for the development of safety-critical software. At present, there are two potential candidates: MISRA (Motor Industry Software Reliability Association) [68], which published recommended practices for safe automotive software. The recommended practices, although automotive specific, do not support x-by-wire. IEC 61508 [69] is an international standard for electrical, electronic and programmable electronic safety related systems. IEC 61508 is not automotive specific, but broadly accepted in other industries.

1.3.4 Sensor Networks Another trend in networking of field devices has emerged recently; namely, sensor networks, which is another example of networked embedded systems. Here, the “embedding” factor is not as evident as in other applications; particularly true for wireless and self-organizing networks where the nodes may be embedded in the ecosystem or a battlefield, to mention some. Although potential applications in the projected areas are still under discussion, the wireless sensor/actuator networks are in the deployment stage by the manufacturing industry. The use of wireless links with field devices, such as sensors and actuators, allow for flexible installation and maintenance, mobile operation required in case of mobile robots, and alleviates problems with cabling. A wireless communication system to operate effectively in the industrial/factory floor environment has to guarantee high reliability, low and predictable delay of data transfer (typically, less than 10 msec for real time applications), support for high number of sensor/actuators (over 100 in a cell of a few meters radius), and low power consumption, to mention some. In the industrial environments, the characteristic for the wireless channel degradation artifacts can be compounded by the presence of electric motors or a variety of equipment causing the electric discharge, which contributes to even greater levels of bit error and packet losses. One way to partially alleviate the problem is either by designing robust and loss-tolerant applications and control algorithms, or by trying to improve the channel quality; all subject of extensive research and development. In a wireless sensor/actuator network, stations may interact with each other on the peer-to-peer basis, and with the base station. To leverage low cost, small size, and low-power consumptions, standard Bluetooth (IEEE 802.15.1) 2.4 GHz radio transceivers [70, 71] may be used as the sensor/actuators communication hardware. To meet the requirements for high reliability, low and predictable delay of data transfer, and support for high number of sensor/actuators, custom optimized communication protocols may be required as the commercially available solutions such as IEEE 802.15.1, IEEE 802.15.4 [72], and IEEE 802.11 [73–75] variants may not fulfill all the requirements. The base station may have its transceiver attached to a cable of a fieldbus, giving rise to a hybrid wireless-wireline fieldbus system [2]. A representative example of this kind of systems is a wireless sensor/actuator network developed by ABB and deployed in a manufacturing environment [76]. The system, known as WISA (wireless sensor/actuator) has been implemented in a manufacturing cell to network proximity switches, which are some of the most widely used position sensors in automated factories to control positions of a variety of equipment, including robotic arms, for instance. The sensor/actuators communication hardware is based on a standard Bluetooth 2.4 GHz radio transceiver and low power electronics that handle the wireless communication link. The sensors communicate with a wireless base station via antennas mounted in the cell. For the base station, a specialized RF front end was developed to provide collision free air access by allocating a fixed TDMA time slot to each sensor/actuator. Frequency Hopping (FH) was employed to counter both frequency-selective fading and interference effects, and operates in combination with Automatic Retransmission Requests (ARQs). The parameters of this TDMA/FH scheme were chosen to satisfy the requirements of up to 120 sensor/actuators per base station. Each wireless node has a response or cycle time of 2 msec, to make full use of the available radio band of 80 MHz width. The FH sequences are cell-specific and were chosen to have low cross-correlations to permit parallel operation of many cells on the same factory floor with low self-interference. The base station can handle up to 120 wireless


1-14


sensor/actuators and is connected to the control system via a (wireline) field bus. To increase capacity, a number of base stations can operate in the same area. WISA provides wireless power supply to the sensors, based on magnetic coupling [77].

1.4 Concluding Remarks This chapter has presented an overview of trends for networking of embedded systems, their design, and selected application domain specific network technologies. The networked embedded systems appear in a variety of application domains to mention automotive, train, aircraft, office building, and industrial automation. With the exception of building automation, the systems discussed in this chapter tend to be confined to a relatively small area covered and limited number of nodes, as in the case of an industrial process, an automobile, or a truck. In the building automation controls, the networked embedded systems may take on truly large proportions in terms of area covered and number of nodes. For instance, in a LonTalk network, the total number of addressable nodes in a domain can reach 32385; up to 248 domains can be addressed. The wireless sensor/actuator networks, as well as wireless-wireline hybrid networks, have started evolving from the concept to actual implementations, and are poised to have a major impact on industrial, home, and building automation — at least in these application domains, for a start. The networked embedded systems pose a multitude of challenges in their design, particularly for safetycritical applications, deployment, and maintenance. The majority of the development environments and tools for specific networking technologies do not have firm foundations in computer science or software engineering models and practices making the development process labor-intensive, error-prone, and expensive.

References [1] R. Zurawski (Ed.), The Industrial Communication Systems, Special Issue. In Proceedings of the IEEE, 93, June 2005. [2] J.-D. Decotignie, P. Dallemagne, and A. El-Hoiydi, Architectures for the Interconnection of Wireless and Wireline Fieldbusses. In Proceedings of the 4th IFAC Conference on Fieldbus Systems and Their Applications 2001 (FET ’2001), Nancy, France, 2001. [3] H. Zimmermann, OSI Reference Model: The ISO Model of Architecture for Open System Interconnection. IEEE Transactions on Communications, 28, 425–432, 1980. [4] Costrell, CAMAC Instrumentation System — Introduction and General Description. IEEETransactions-on-Nuclear-Science, 18, 3–8, 1971. [5] C.-A. Gifford, A Military Standard for Multiplex Data Bus. In Proceedings of the IEEE-1974, National Aerospace and Electronics Conference, May 13–15, 1974, Dayton, OH, USA, pp. 85–88. [6] N. Collins, Boeing Architecture and TOP (Technical and Office Protocol). In Networking: A-Large-Organization-Perspective, April, 1986, Melbourne, FL, USA, pp. 49–54. [7] H.A. Schutz, The Role of MAP in Factory Integration. IEEE Transactions on Industrial Electronics, 35, 6–12, 1988. [8] P. Pleinevaux and J.-D. Decotignie, Time Critical Communication Networks: Field Buses. IEEE Network, 2, 55–63, 1988. [9] International Electrotechnical Commission, Digital data communications for measurement and control — Fieldbus for use in industrial control systems, Part 1: Introduction. IEC 61158-1, IEC, 2003. [10] International Electrotechnical Commission (IEC). www.iec.ch. [11] International Organization for Standardization (ISO). www.iso.org. [12] Instrumentation Society of America (ISA). www.isa.org. [13] Comité Européen de Normalisation Electrotechnique (CENELEC). www.cenelec.org.



1-15

[14] European Committee for Standardization (CEN). www.cenorm.be. [15] International Electrotechnical Commission, Digital data communications for measurement and control — Part 1: Profile sets for continuous and discrete manufacturing relative to fieldbus use in industrial control systems, IEC 61784-1, IEC, 2003. [16] International Electrotechnical Commission, Real Time Ethernet Modbus-RTPS, Proposal for a Publicly Available Specification for Real-Time Ethernet, document IEC 65C/341/NP, 2004. [17] www.modbus-ida.org. [18] International Electrotechnical Commission, Real Time Ethernet: EtherNet/IP with Time Synchronization, Proposal for a Publicly Available Specification for Real-Time Ethernet, document IEC, 65C/361/NP, IEC, 2004. [19] www.odva.org. [20] www.controlnet.org. [21] International Electrotechnical Commission, Real Time Ethernet: P-NET on IP, Proposal for a Publicly Available Specification for Real-Time Ethernet, document IEC, 65C/360/NP, IEC, 2004. [22] International Electrotechnical Commission, Real Time Ethernet Vnet/IP, Proposal for a Publicly Available Specification for Real-Time Ethernet, document IEC, 65C/352/NP, IEC, 2004. [23] International Electrotechnical Commission, Real Real Time Ethernet EPL (ETHERNET Powerlink), Proposal for a Publicly Available Specification for Real-Time Ethernet, document IEC, 65C/356a/NP, IEC, 2004. [24] www.ethernet-powerlink.org. [25] International Electrotechnical Commission, Real Time Ethernet TCnet (Time-Critical Control Network), Proposal for a Publicly Available Specification for Real-Time Ethernet, document IEC, 65C/353/NP, IEC, 2004. [26] International Electrotechnical Commission, Real Time Ethernet EPA (Ethernet for Plant Automation), Proposal for a Publicly Available Specification for Real-Time Ethernet, document IEC 65C/357/NP, IEC, 2004. [27] J. Feld, PROFINET — Scalable Factory Communication for all Applications. In Proceedings of the 2004 IEEE International Workshop on Factory Communication Systems, September 22–24, 2004, Vienna, Austria, pp. 33–38. [28] www.profibus.org. [29] International Electrotechnical Commission, Real Time Ethernet SERCOS III, Proposal for a Publicly Available Specification for Real-Time Ethernet, document IEC, 65C/358/NP, IEC, 2004. [30] International Electrotechnical Commission, Real Time Ethernet Control Automation Technology (ETHERCAT), Proposal for a Publicly Available Specification for Real-Time Ethernet, document IEC, 65C/355/NP, IEC, 2004. [31] www.ethercat.org. [32] International Electrotechnical Commission, Real-Time Ethernet PROFINET IO, Proposal for a Publicly Available Specification for Real-Time Ethernet, document IEC, 65C/359/NP, IEC, 2004. [33] Deborah Snoonian, Smart Buildings. IEEE Spectrum, 40, 18–23, 2003. [34] D. Loy, D. Dietrich, and H. Schweinzer, Open Control Networks, Kluwer, Dordrecht, 2004. [35] Steven T. Bushby. BACnet: A Standard Communication Infrastructure for Intelligent Buildings. Automation in Construction, 6, 529–540, 1997. [36] ENV 13154-2, Data Communication for HVAC Applications — Field net — Part 2: Protocols, 1998. [37] EIA/CEA 776.5, CEBus-EIB Router Communications Protocol — The EIB Communications Protocol, 1999. [38] EN 50090-X, Home and Building Electronic Systems (HBES), 1994–2004. [39] Konnex Association, Diegem, Belgium. KNX Specifications, V. 1.1, 2004. [40] www.echelon.com. [41] Control Network Protocol Specification, ANSI/EIA/CEA-709.1-A, 1999. [42] Control Network Protocol Specification, EIA/CEA Std. 709.1, Rev. B, 2002.


1-16


[43] Tunneling Component Network Protocols Over Internet Protocol Channels, ANSI/EIA/CEA 852, 2002. [44] www.lonmark.org. [45] F. Simonot-Lion, In-Car Embedded Electronic Architectures: How to Ensure Their Safety. In Proceedings of the 5th IFAC International Conference on Fieldbus Systems and their Applications — FeT’2003, July 2003, Aveiro, Portugal. [46] X-by-Wire Project, Brite-EuRam 111 Program, X-By-Wire — Safety Related Fault Tolerant Systems in Vehicles, Final report, 1998. [47] TTTech Computertechnik GmbH. Time-Triggered Protocol TTP/C, High-Level Specification Document, Protocol Version 1.1, November 2003. www.tttech.com. [48] FlexRay Consortium, FlexRay Communication System, Protocol Specification, Version 2.0, June 2004. www.flexray.com. [49] H. Kopetz and G. Bauer, The Time Triggered Architecture. Proceedings of the IEEE, 91, 112–126, 2003. [50] H. Kopetz et al., Specification of the TTP/A Protocol, University of Technology, Vienna, 2002. [51] International Standard Organization, 11898-4, Road Vehicles — Controller Area Network (CAN) — Part 4: Time-Triggered Communication, ISO, 2000. [52] Society of Automotive Engineers, J2056/1 Class C Application Requirements Classifications. In SAE Handbook, SAE, 1994. [53] Society of Automotive Engineers, J2056/2 Survey of Known Protocols. SAE Handbook, Vol. 2, SAE, 1994. [54] Antal Rajnak, The LIN Standard. In The Industrial Communication Technology Handbook, CRC Press, Boca Raton, FL, 2005. [55] Society of Automotive Engineers, Class B Data Communications Network Interface — SAE J1850 Standard — rev. nov96, 1996. [56] International Standard Organization, ISO 11519-2, Road Vehicles — Low Speed Serial Data Communication — Part 2: Low Speed Controller Area Network, ISO, 1994. [57] International Standard Organization, ISO 11519-3, Road Vehicles — Low Speed Serial Data Communication — Part 3: Vehicle Area Network (VAN), ISO, 1994. [58] International Standard Organization, ISO 11898, Road Vehicles — Interchange of Digital Information — Controller Area Network for High-speed Communication, ISO, 1994. [59] SAE J1939 Standards Collection. www.sae.org. [60] MOST Cooperation, MOST Specification Revision 2.3, August 2004. www.mostnet.de. [61] www.idbforum.org. [62] K. Keutzer, S. Malik, A.R. Newton, J. Rabaey, and A. Sangiovanni-Vincentelli, System Level Design: Orthogonalization of Concerns and Platform-Based Design. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 19(12), 1523–1543, 2000. [63] OSEK Consortium, OSEK/VDX Operating System, Version 2.2.2, July 2004. www.osek-vdx.org. [64] OSEK Consortium, OSEK/VDX Communication, Version 3.0.3, July 2004. www.osek-vdx.org. [65] OSEK Consortium, OSEK/VDX Fault-Tolerant Communication, Version 1.0, July 2001. www.osek-vdx.org. [66] www.automotive-his.de. [67] www.asam.de. [68] www.misra.org.uk. [69] International Electrotechnical Commission, IEC 61508:2000, Parts 1–7, Functional Safety of Electrical/Electronic/Programmable Electronic Safety-Related Systems, 2000. [70] Bluetooth Consortium, Specification of the Bluetooth System, 1999. www.bluetooth.org. [71] Bluetooth Special Interest Group, Specification of the Bluetooth System, Version 1.1, December 1999. [72] LAN/MAN Standards Committee, IEEE Standard for Information Technology — Telecommunications and Information Exchange between Systems — Local and Metropolitan Area Networks — Specific Requirements — Part 15.4: Wireless Medium Access Control (MAC) and Physical Layer



[73]

[74]

[75]

[76]

[77]

1-17

(PHY) Specifications for Low Rate Wireless Personal Area Networks (LR-WPANs), IEEE Computer Society, Washington, 2003. LAN/MAN Standards Committee of the IEEE Computer Society, IEEE Standard for Information Technology — Telecommunications and Information Exchange between Systems — Local and Metropolitan Networks — Specific Requirements — Part 11: Wireless LAN Medium Access Control (MAC) and Physical Layer (PHY) Specifications: Higher Speed Physical Layer (PHY) Extension in the 2.4 GHz band, 1999. LAN/MAN Standards Committee of the IEEE Computer Society, Information Technology — Telecommunications and Information Exchange between Systems — Local and Metropolitan Area Networks — Specific Requirements — Part 11: Wireless LAN Medium Access Control (MAC) and Physical Layer (PHY) Specifications, 1999. Institute of Electrical and Electronic Engineering “Part 11: Wireless LAN Medium Access Control (MAC) and Physical Layer (PHY) Specifications, Amendment 4: Further Higher Data Rate Extension in the 2.4 GHz Band,” June 2003, aNSI/IEEE Std 802.11. Christoffer Apneseth, Dacfey Dzung, Snorre Kjesbu, Guntram Scheible, and Wolfgang Zimmermann, Introducing Wireless Proximity Switches. ABB Review, 4, 42–49, 2002. www.abb.com/review. Dacfey Dzung, Christoffer Apneseth, and Jan Endresen, A Wireless Sensor/Actuator Communication System for Real-Time Factory Applications, private communication. IEEE Transactions on Industrial Electronics (submitted).


2 Real-Time in Embedded Systems 2.1 2.2

Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Design of RTSs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2-1 2-3

Reference Architecture • Models of Interaction • Execution Strategies • Component-Based Design • Tools for Design of RTSs

2.3

Real-Time Operating Systems . . . . . . . . . . . . . . . . . . . . . . . . . .

2-8

Typical Properties of RTOSs • Mechanisms for Real-Time • Commercial RTOSs

2.4

Real-Time Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-11 Introduction to Scheduling • Offline Schedulers • Online Schedulers

2.5

Real-Time Communications . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-13 Communication Techniques • Fieldbuses • Ethernet for Real-Time Communication • Wireless Communication

2.6

Analysis of RTSs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-18 Timing Properties • Methods for Timing Analysis • Example of Analysis • Trends and Tools

2.7

Hans Hansson, Mikael Nolin, and Thomas Nolte Mälardalen University

Component-Based Design of RTS . . . . . . . . . . . . . . . . . . . . . . 2-25 Timing Properties and CBD • Real-Time Operating Systems • Real-Time Scheduling

2.8 Testing and Debugging of RTSs . . . . . . . . . . . . . . . . . . . . . . . . . 2-29 2.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-30 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-31

In this chapter we will provide an introduction to issues, techniques, and trends in real-time systems (RTSs). We will specifically discuss design of RTSs, real-time operating systems (RTOSs), real-time scheduling, real-time communication, real-time analysis, as well as testing and debugging of RTSs. For each of these areas, state-of-the-art tools and standards are presented.

2.1 Introduction Consider the airbag in the steering-wheel of your car. It should after the detection of a crash (and only then) inflate just in time to softly catch your head to prevent it from hitting the steering-wheel; not too early — since this would make the airbag deflate before it can catch you; nor too late — since the exploding 2-1


2-2


airbag could then injure you by blowing up in your face and/or catch you too late to prevent your head from banging into the steering wheel. The computer controlled airbag system is an example of a RTS. But RTSs come in many different flavors, including vehicles, telecommunication systems, industrial automation systems, household appliances, etc. There is no commonly agreed upon definition of what a RTS is, but the following characterization is (almost) universally accepted: • RTSs are computer systems that physically interact with the real world. • RTSs have requirements on the timing of these interactions. Typically, the real-world interactions are via sensors and actuators, rather than the keyboard and screen of your standard PC. Real-time requirements typically express that an interaction should occur within specified timing bound. It should be noted that this is quite different from requiring the interaction to be as fast as possible. Essentially all RTSs are embedded in products, and the vast majority of embedded computer systems are RTSs. RTSs is the dominating application of computer technology, as more than 99% of the manufactured processors (more than 8 billion in 2000 [1]) are used in embedded systems. Returning to the airbag system, we note that, in addition to being a RTS it is a safety-critical system, that is, a system that owing to severe risks of damage have strict Quality of Service (QoS) requirements, including requirements on the functional behavior, robustness, reliability, and timeliness. A typical strict timing property could be that a certain response to an interaction always must occur within some prescribed time, for example, the charge in the airbag must detonate between 10 and 20 msec from the detection of a crash; violating this must be avoided at any cost, since it would lead to something unacceptable, such as having to spend a couple of months in hospital. A system that is designed to meet strict timing requirements is often referred to as a hard RTS. In contrast, systems for which occasional timing failures are acceptable — possibly because this will not lead to something terrible — are termed soft RTS. An illustrative comparison between hard and soft RTSs that highlights the difference between the extremes is shown in Table 2.1. A typical hard RTS could in this context be an engine control system, which must operate with µsec-precision, and which will severely damage the engine if timing requirements fail by more than a few msec. A typical soft RTS could be a banking system, for which timing is important, but where there are no strict deadlines and some variations in timing are acceptable. Unfortunately, it is impossible to build real systems that satisfy hard real-time requirements, since, owing to the imperfection of hardware (and designers) any system may break. The best that can be achieved is a system that, with very high probability provides the intended behavior during a finite interval of time. However, on the conceptual level hard real-time makes sense, since it implies a certain amount of rigor in the way the system is designed, for example, it implies an obligation to prove that the strict timing requirements are met. TABLE 2.1 Typical Characteristics of Hard- and Soft-RTSs [2] Characteristic Timing requirements Pacing Peak-load performance Error detection Safety Redundancy Time granularity Data files Data integrity


Hard real-time

Soft real-time

Hard Environment Predictable System Critical Active Millisecond Small Short term

Soft Computer Degraded User Noncritical Standby Second Large Long term

Real-Time in Embedded Systems

2-3

Since the early 1980s a substantial research effort has provided a sound theoretical foundation (e.g., [3,4]) and many practically useful results for the design of hard RTSs. Most notably, hard RTS scheduling has evolved into a mature discipline, using abstract, but realistic, models of tasks executing on single CPU, multiprocessor, or distributed computer systems, together with associated methods for timing analysis. Such schedulability analysis, for example, the well-known rate-monotonic analysis [5–7], have also found significant use in some industrial segments. However, hard real-time scheduling is not the cure for all RTSs. Its main weakness is that it is based on analysis of the worst possible scenario. For safety-critical systems this is of course a must, but for other systems, where general customer satisfaction is the main criteria, it may be too costly to design the system for a worst-case scenario that may not occur during the system’s lifetime. If we look at the other end of the spectrum, we find the best-effort approach, which is still the dominating approach in the industry. The essence of this approach is to implement the system using some best practice, and then use measurements, testing, and tuning to make sure that the system is of sufficient quality. On one hand such a system will hopefully satisfy some soft real-time requirement; the weakness being that we do not know which. On the other hand, compared with the hard real-time approach, the system can be better optimized for the available resources. A further difference is that hard RTS methods essentially are applicable to static configurations only, whereas it is less problematic to handle dynamic task creation etc., in best-effort systems. Having identified the weaknesses of the hard real-time and best-effort approaches major efforts are now put into more flexible techniques for soft RTSs. These techniques provide analyzability (such as hard real-time), together with flexibility and resource efficiency (such as best-effort). The basis for the flexible techniques are often quantified QoS characteristics. These are typically related to nonfunctional aspects, such as timeliness, robustness, dependability, and performance. To provide a specified QoS, some sort of resource management is needed. Such a QoS management is either handled by the application, by the operating system (OS), by some middleware, or by a mix of the above. The QoS management is often a flexible online mechanism that dynamically adapts the resource allocation to balance between conflicting QoS requirements.

2.2 Design of RTSs The main issue in designing RTSs is timeliness, that is, that the system performs its operations at proper points in time. Not considering timeliness at the design phase will make it virtually impossible to analyze and predict the timely behavior of the RTS. This section presents some important architectural issues for embedded RTSs, together with some supporting commercial tools.

2.2.1 Reference Architecture A generic system architecture for a RTS is depicted in Figure 2.1. This architecture is a model of any computer-based system interacting with an external environment via sensors and actuators. Since our focus is on the RTS we will look more into different organizations of that part of the generic architecture in Figure 2.1. The simplest RTS is a single processor, but in many cases the RTS is a distributed computer system consisting of a set of processors interconnected by a communications network. There could be several reasons for making an RTS distributed, including: • • • •

The physical distribution of the application. The computational requirements that may not be conveniently provided by a single CPU. The need for redundancy to meet availability, reliability, or other safety requirements. To reduce the cabling in the system.

Figure 2.2 shows an example of a distributed RTS. In a modern car, like the one depicted in the figure, there are some 20 to 100 computer nodes (which in the automotive industry are called Electronic Control


2-4


RTS

Environment

Sensors

Actuators

FIGURE 2.1 A generic RTS architecture.

SUM PDM

UEM SCM

MMS RSM SRM

MP1

SHM

PHM

SUB ATM

CCM ICM

MP2 TCM

DEM AUD

GSM

MMM SRS DIM

ICM

ECM BCM

SAS

BSC

SWS

SHM

SWM

PSM

PAS

AEM REM

CPM

FIGURE 2.2

LSM

CEM

ISM

DDM

Network infrastructure of Volvo XC90.

Units [ECUs]) interconnected with one or more communication networks. The initial motivation for this type of electronic architecture in cars was the need to reduce the amount of cabling. However, the electronic architecture has also led to other significant improvements, including substantial pollution reduction and new safety mechanisms, such as computer controlled Electronic Stabilization Programs (ESPs). The current development is toward making the most safety-critical vehicle functions, such as braking and steering, completely computer controlled. This is done by removing the mechanical connections (e.g., between steering wheel and front wheels, and between break pedal and breaks), replacing them with computers and computer networks. Meeting the stringent safety requirements for such functions will require careful introduction of redundancy mechanisms in hardware and communication, as well as software, that is, a safety-critical system architecture is needed (an example of such an architecture is TTA [8]).

2.2.2 Models of Interaction In Section 2.2.1 we presented the physical organization of a RTS, but for an application programmer this is not the most important aspect of the system architecture. Actually, from an application programmers



2-5

perspective the system architecture is given more by the execution paradigm (execution strategy) and the interaction model used in the system. In this section we describe what an interaction model is and how it affects the real-time properties of a system, and in Section 2.2.3 we discuss the execution strategies used in RTSs. A model of interaction describes the rules by which components interact with each other (in this section we will use the term component to denote any type of software unit, such as a task or a module). The interaction model can govern both control flow and data flow between system components. One of the most important design decisions, for all types of systems, is which interaction models to use (sadly, however, this decision is often implicit, and hidden in the system’s architectural description). When designing RTSs, attention should be paid to the timing properties of the interaction models chosen. Some models have a more predictable and robust behavior with respect to timing than other models. Examples of some of the more predictable models that are commonly used in RTSs design, are pipes-and-filters, publisher–subscriber, and blackboard. On the other end of the spectrum of interaction models there are models that increase the (timing) unpredictability of the system. These models should, if possible, be avoided when designing RTSs. The two most notable, and commonly used, are client–server and message boxes. 2.2.2.1 Pipes-and-Filters In this model, both data and control flow is specified using input and output ports of components. A component becomes eligible for execution when data has arrived on its input ports and when the component finishes execution it produces output on its output ports. This model fits well for many types of control programs, and control laws are easily mapped to this interaction model. Hence, it has gained widespread use in the real-time community. The real-time properties of this model are also quite nice. Since both data and control flows unidirectionally through a series of components, the order of execution and end-to-end timing delay usually becomes predictable. The model also provides a high degree of decoupling in time; that is, components can often execute without having to worry about timing delays caused by other components. Hence, it is usually straightforward to specify the compound timing behavior of set of components. 2.2.2.2 Publisher–Subscriber The publisher–subscriber model is similar to the pipes-and-filters model but it usually decouples data and control flow. That is, a subscriber can usually choose different forms for triggering its execution. If the subscriber chooses to be triggered on each new published value, the publisher–subscriber model takes on the form of the pipes-and-filters model. However, in addition, a subscriber could choose to ignore the timing of the published values and decide to use the latest published value. Also, for the publisher–subscriber model, the publisher is not necessarily aware of the identity, or even the existence, of its subscribers. This provides a higher degree of decoupling of components. Similar to the pipes-and-filters model, the publisher–subscriber model provides good timing properties. However, a prerequisite for analysis of systems using this model is that subscriber components make explicit the values they subscribe to (this is not mandated by the model itself). However, when using the publisher–subscriber model for embedded systems, it is the norm that subscription information is available (this information is used, for instance, to decide the values that are to be published over a communications network, and to decide the receiving nodes of those values). 2.2.2.3 Blackboard The blackboard model allows variables to be published on a globally available blackboard area. Thus, it resembles the use of global variables. The model allows any component to read or write values to variables in the blackboard. Hence, the software engineering qualities of the blackboard model is questionable. Nevertheless, it is a model that is commonly used, and in some situations it


2-6


provides a pragmatic solution to problems that are difficult to address with more stringent interaction models. Software engineering aspects aside, the blackboard model does not introduce any extra elements of unpredictable timing. On the other hand, the flexibility of the model does not help engineers to achieve predictable systems. Since the model does not address the control flow, components can execute relatively undisturbed and decoupled from other components. 2.2.2.4 Client–Server In the client–server model, a client asynchronously invokes the service of a server. The service invocation passes the control flow (plus any input data) to the server, and control stays at the server until it has completed the service. When the server is done, the control flow (and any return data) is returned to the client which in turn resumes execution. The client–server model has inherently unpredictable timing. Since services are invoked asynchronously, it is very difficult to a priori asses the load on the server for a certain service invocation. Thus, it is difficult to estimate the delay of the service invocation and, in turn, it is difficult to estimate the response time of the client. This matter is furthermore complicated by the fact that most components often behave both as clients and as servers (a server often uses other servers to implement its own services); leading to very complex and unanalyzable control flow paths. 2.2.2.5 Message Boxes A component can have a set of message boxes, and components communicate by posting messages in each others message boxes. Messages are typically handled in First In First Out (FIFO) order, or in priority order (where the sender specifies a priority). Message passing does not change the flow of control for the sender. A component that tries to receive a message from an empty message box, however, blocks on that message box until a message arrives (often the receiver can specify a timeout to prevent indefinite blocking). From a sender’s point of view, the message box model has similar problems as the client–server model. The data sent by the sender (and the action that the sender expects the receiver to perform) may be delayed in an unpredictable way when the receiver is highly loaded. Also, the asynchronous nature of the message passing makes it difficult to foresee the load of a receiver at any particular moment. Furthermore, from the receiver’s point of view, the reading of message boxes is unpredictable in the sense that the receiver may or may not block on the message box. Also, since message boxes often are of limited size, there is a risk that a highly loaded receiver loose some message. Lost messages are another source of unpredictability.

2.2.3 Execution Strategies There are two main execution paradigms for RTSs: time-triggered and event-triggered. On one hand, when using timed-triggered execution, activities occur at predefined instances of time, for example, a specific sensor value is read exactly every 10 msec and 2 msec, later the corresponding actuator receives an updated control parameter. In an event-triggered execution, on the other hand, actions are triggered by event occurrences, for example, when the toxic fluid in a tank reaches a certain level an alarm will go off. It should be noted that the same functionality, typically, can be implemented in both paradigms, for example, a time-triggered implementation of the above alarm would be to periodically read the levelmeasuring sensor and activate the alarm when the read level exceeds the maximum allowed. If alarms are rare, the time-triggered version will have much higher computational overhead than the eventtriggered one. On the other hand, the periodic sensor readings will facilitate detection of a malfunctioning sensor. Time-triggered executions are used in many safety-critical systems with high dependability requirements (such as avionic control systems), whereas the majority of other systems are event-triggered. Dependability can also be guaranteed in the event-triggered paradigm, but owing to the observability



2-7

provided by the exact timing of time-triggered executions, most experts argue for using time-triggered in ultra-dependable systems. The main argument against time-triggered is its lack of flexibility and the requirement of pre runtime schedule generation (which is a nontrivial and possibly time-consuming task). Time-triggered systems are mostly implemented by simple proprietary table-driven dispatchers [9] (see Section 2.4.2 for a discussion on table-driven execution), but complete commercial systems including design tools are also available [10, 11]. For the event-triggered paradigm a large number of commercial tools and OSs are available (examples are given in Section 2.3.3). There are also examples of systems integrating the two execution paradigms, thereby aiming at getting the best of two worlds: time-triggered dependability and event-triggered flexibility. One example is the Basement system [12] and its associated real-time kernel Rubus [13]. Since computations in time-triggered systems are statically allocated both in space (to a specific processor) and time, some sort of configuration tool is often used. This tool assumes that the computations are packaged into schedulable units (corresponding to tasks or threads in an event-triggered system). Typically, for example, in Basement, computations are control-flow based, in the sense that they are defined by sequences of schedulable units, each unit performing a computation based on its inputs and producing outputs to the next unit in sequence. The system is configured by defining the sequences and their timing requirements. The configuration tool will then automatically (if possible)1 generate a schedule which guarantees that all timing requirements are met. Event-triggered systems typically have a richer and more complex Application Programming Interfaces (APIs), defined by the used OS and middleware, which will be elaborated on in Section 2.3.

2.2.4 Component-Based Design Component-Based Design (CBD) of software systems is an interesting approach for software engineering in general, and for engineering of RTSs in particular. In CBD, a software component is used to encapsulate some functionality. That functionality is only accessed through the interface of the component. A system is composed by assembling a set of components and connecting their interfaces. The reason CBD could prove extra useful for RTSs is the possibility to extend components with introspective interfaces. An introspective interface does not provide any functionality per se, rather the interface can be used to retrieve information about extra-functional properties of the component. Extra-functional properties can include attributes such as memory consumption, execution times, task periods, etc. For RTS, timing properties are of course of particular interest. Unlike the functional interfaces of components, the introspective interfaces can be available offline, that is, during the component assembly phase. This way, the timing attributes of the system components can be obtained at design time and tools to analyze the timing behavior of the system could be used. If the introspective interfaces are also available online they could be used in, for instance, admission control algorithms. An admission control could query new components for their timing behavior and resource consumption before deciding to accept new component to the system. Unfortunately, many industry standard software techniques are based on the client–server or the message-box models of interaction, which we deemed, in Section 2.2.2, unfit for RTSs. This is especially true for the most commonly used component models. For instance, the Corba Component Model (CCM) [14], Microsoft’s COM [15] and .NET [16] models, and Java Beans [17] all have the client–server model as their core model. Also, none of these component technologies allow the specification of extrafunctional properties through introspective interfaces. Hence, from the real-time perspective, the biggest advantage of CBD is void for these technologies. However, there are numerous research projects addressing CBD for real-time and embedded systems (e.g., [18–21]). These projects are addressing the issues left behind by the existing commercial technologies, such as timing predictability (using suitable computational models), support for offline analysis of 1 This

scheduling problem is theoretically intractable, so the configuration tool will have to rely on some heuristics which works well in practice, but which does not guarantee to find a solution in all cases when there is a solution.


2-8


component assemblies, and better support for resource constrained systems. Often, these projects strive to remove the considerable runtime flexibility provided by existing technologies. This runtime flexibility is judged to be the foremost contributor to unpredictability (the flexibility is also adding to the runtime complexity and prevents CBD for resource constrained systems).

2.2.5 Tools for Design of RTSs In the industry the term real-time system is highly overloaded, and can mean anything from interactive systems to superfast systems, or embedded systems. Consequently, it is not easy to judge what tools are suitable for developing RTSs (as we define real-time in this chapter). For instance, UML [22] is commonly used for software design. However, UML’s focus is mainly on client–server solutions, and it has proven inapt for RTSs design. As a consequence, UML-based tools that extend UML with constructs suitable for real-time programs have emerged. The two most known products are Rational’s Rose RealTime [23] and i-Logix’ Rhapsody [24]. These tools provide UML support with the extension of real-time profiles. While giving real-time engineers access to suitable abstractions and computational models, these tools do not provide means to describe timing properties or requirements in a formal way; thus they do not allow automatic verification of timing requirements. TeleLogic provides programming and design support using the language SDL [25]. SDL was originally developed as a specification language for the telecom industry, and is as such highly suitable for describing complex reactive systems. However, the fundamental model of computation is the message-box model, which has an inherently unpredictable timing behavior. However, for soft embedded RTSs, SDL can give very time- and space-efficient implementations. For more resource constrained hard RTSs, design tools are provided by, for example, Arcticus System [13], TTTech [10], and Vector [26]. These tools are instrumental during both system design and implementation, and also provide some timing analysis techniques that allow timing verification of the system (or parts of the system). However, these tools are based on proprietary formats and processes, and have as such reached a limited customer base (mainly within the automotive industry). Within the near future UML2 will become an adopted standard [27]. UML2 has support for computational models suitable for RTSs. This support comes mainly in the form of ports that can have protocols associated to them. Ports are either provided or required, hence allowing type-matching of connections between components. UML2 also includes much of the concepts from Rose RealTime, Rhapsody, and SDL. Other, future design techniques that are expected to have an impact on the design of RTSs include, the EAST/EEA Architecture Description Language (EAST-ADL) [28]. The EAST-ADL is developed by the automotive industry and is a description language that will cover the complete development cycle of distributed, resource constrained, safety critical, RTSs. Tools to support development with EAST-ADL (which is a UML2 compliant language) are expected to be provided by automotive tool vendors such as ETAS [29], Vector [30], and Siemens [31].

2.3 Real-Time Operating Systems A RTOS provides services for resource access and resource sharing, very similar to a general-purpose OS. An RTOS, however, provides additional services suited for real-time development and also supports the development process for embedded systems. Using a general-purpose OS when developing RTSs has several drawbacks: • High resource utilization, for example, large RAM and ROM footprints, and high internal CPU-demand. • Difficult to access hardware and devices in a timely manner, for example, no application level control over interrupts. • Lack of services to allow timing sensitive interactions between different processes.



2-9

2.3.1 Typical Properties of RTOSs The state of practice in RTOSs is reflected in Reference 32. Not all OSs are RTOSs. A RTOS is typically multithreaded and preemptible, there has to be a notion of thread priority, predictable thread synchronization has to be supported, priority inheritance should be supported, and the OS behavior should be known [33]. This means that the interrupt latency, worst-case execution time (WCET) of system calls, and maximum time during which interrupts are masked must be known. A commercial RTOS is usually marketed as the runtime component of an embedded development platform. As a general rule of thumb one can say that RTOSs are: • Suitable for resource constrained environments. RTSs typically operate in such environments. Most RTOSs can be configured pre runtime (e.g., at compile time) to only include a subset of the total functionality. Thus, the application developer can choose to leave out unused portions of the RTOS in order to save resources. RTOSs typically store much of their configuration in ROM. This is done mainly for two purposes: (1) minimize the use of expensive RAM memory and (2) minimize the risk that critical data is overwritten by an erroneous application. • Giving the application programmer easy access to hardware features. These include interrupts and devices. Most often the RTOSs give the application programmer means to install Interrupt Service Routines during compile time and/or during runtime. This means that the RTOS leaves all interrupt handing to the application programmer, allowing fast, efficient, and predictable handling of interrupts. In general-purpose OSs, memory-mapped devices are usually protected from direct access using the MMU (Memory Management Unit) of the CPU, hence forcing all device accesses to go through the OS. RTOSs typically do not protect such devices, but allow the application to directly manipulate the devices. This gives faster and more efficient access to the devices. (However, this efficiency comes at the price of an increased risk of erroneous use of the device.) • Providing services that allow implementation of timing sensitive code. An RTOS typically has many mechanisms to control the relative timing between different processes in the system. Most notably, an RTOS has a real-time process scheduler whose function is to make sure that the processes execute in the way the application programmer intended them to. We will elaborate more on the issues of scheduling in Section 2.4. An RTOS also provides mechanisms to control the processes relative performance when accessing shared resources. This can, for instance, be done by priority queues, instead of plain FIFO queues as used in general-purpose OSs. Typically, an RTOS supports one or more real-time resource locking protocols, such as priority inheritance or priority ceiling (Section 2.3.2 discusses resource locking protocols further). • Tailored to fit the embedded systems development process. RTSs are usually constructed in a host environment that is different from the target environment, so called cross platform development. Also, it is typical that the whole memory image, including both RTOS and one or more applications, is created at the host platform and downloaded to the target platform. Hence, most RTOSs are delivered as source code modules or precompiled libraries that are statically linked with the applications at compile time.

2.3.2 Mechanisms for Real-Time One of the most important functions of an RTOS is to arbitrate access to shared resources in such a way that the timing behavior of the system becomes predictable. The two most obvious resource that the RTOS manages access to are: • The CPU — that is, the RTOS should allow processes to execute in a predictable manner. • Shared memory areas — that is, the RTOS should resolve contention to shared memory in a way that gives predictable timing. The CPU access is arbitrated with a real-time scheduling policy. Section 2.4 will, in more depth, describe real-time scheduling policies. Examples of scheduling policies that can be used in RTSs


2-10


are priority scheduling, deadline scheduling, or rate scheduling. Some of these policies directly use timing attributes (like deadline) of the tasks to perform scheduling decisions, whereas other policies use scheduling parameters (like priority, rate, or bandwidth) that indirectly affect the timing of the tasks. A special form of scheduling, which is also very useful for RTSs, is table-driven (static) scheduling. Table-driven scheduling is described further in Section 2.4.2. To summarize, in table-driven scheduling all arbitration decisions have been made offline and the RTOS scheduler just follows a simple table. This gives very good timing predictability, albeit on the expense of system flexibility. The most important aspect of a real-time scheduling policy is that it should provide means to a priori analyze the timing behavior of the system, hence giving a predictable timing behavior of the system. Scheduling in general-purpose OSs normally emphasizes properties such as fairness, throughput, and guaranteed progress; these properties may be adequate in their own respect, however, they are usually in conflict with the requirement that an RTOS should provide timing predictability. Shared resources (such as memory areas, semaphores, and mutexes) are also arbitrated by the RTOS. When a task locks a shared resource it will block all other tasks that subsequently tries to lock the resource. In order to achieve predictable blocking times special real-time resource locking protocols have been proposed ([34,35] provides more details about the protocols). 2.3.2.1 Priority Inheritance Protocol The priority inheritance protocol (PIP) makes a low priority task inherit the priority of any higher priority task that becomes blocked on a resource locked by the lower priority task. This is a simple and straightforward method to lower the blocking time. However, it is computationally intractable to calculate the worst-case blocking (which may be infinite since the protocol does not prevent deadlocks). Hence, for hard RTSs or when timing performance needs to be calculated a priori, the PIP is not adequate. 2.3.2.2 Priority Ceiling Inheritance Protocol The priority ceiling protocol (PCP) associates, to each resource, a ceiling value that is equal to the highest priority of any task that may lock the resource. By clever use of the ceiling values of each resource, the RTOS scheduler will manipulate task priorities to avoid the problems of PIP. PCP guarantees freedom from deadlocks, and the worst-case blocking is relatively easy to calculate. However, the computational complexity of keeping track of ceiling values and task priorities gives PCP high runtime overhead. 2.3.2.3 Immediate Ceiling Priority Inheritance Protocol The immediate inheritance protocol (IIP) also associates, to each resource, a ceiling value that is equal to the highest priority of any task that may lock the resource. However, different from PCP, in IIP a task is immediately assigned the ceiling priority of the resource it is locking. IIP has the same real-time properties as PCP (including the same worst-case blocking time).2 However, IIP is significantly more easy to implement. It is, in fact, for single node systems easier to implement than any other resource locking protocol (including non-real-time protocols). In IIP no actual locks need to be implemented, it is enough for the RTOS to adjust the priority of the task that locks or releases a resource. IIP has other operational benefits, notably it paves the way for letting multiple tasks use the same stack area. OSs based on IIP can be used to build systems with footprints that are extremely small [36,37].

2.3.3 Commercial RTOSs There is an abundance of commercial RTOSs. Most of them provide adequate mechanisms to enable development of RTSs. Some examples are Tornado/VxWorks [38], LYNX [39], OSE [40], QNX [41], RT-Linux [42], and ThreadX [43]. However, the major problem with these OSs is the rich set of 2 The

average blocking time will however be higher in IIP than in PCP.



2-11

primitives provided. These systems provide both primitives that are suitable for RTSs and primitives that are unfit for RTSs (or that should be used with great care). For instance, they usually provide multiple resource locking protocols; some of which are suitable and some of which are not suitable for real-time. This richness becomes a problem when these OSs are used by inexperienced engineers and/or project managers when projects are large and project management does not provide clear design guidelines/rules. In these situations, it is very easy to use primitives that will contribute to timing unpredictability of the developed system. Rather, an RTOS should help the engineers and project managers by providing only mechanisms that help in designing predictable systems. However, there is an obvious conflict between the desire/need of RTOS manufacturers to provide rich interfaces and stringency needed by designers of RTSs. There is a smaller set of RTOSs that have been designed to resolve these problems, and at the same time also allow extreme lightweight implementations of predictable RTSs. The driving idea is to provide a small set of primitives that guides the engineers toward good design of their system. Typical examples are the research RTOS Asterix [36] and the commercial RTOS SSX5 [37]. These systems provide a simplified task model, in which tasks cannot suspend themselves (e.g., no sleep() primitive) and tasks are restarted from their entry point on each invocation. The only resource locking protocol that is supported is IIP, and the scheduling policy is a fixed priority scheduling. These limitations makes it possible to build an RTOS that is able to run, for example, ten tasks using less than 200 bytes of RAM, and at the same time giving predictable timing behavior [44]. Other commercial systems that follow a similar principle of reducing the degrees of freedom and hence promote stringent design of predictable RTSs include Arcticus Systems’ Rubus OS [13]. Many of the commercial RTOSs provide standard APIs. The most important RTOS standards are RT-POSIX [45], OSEK [46], and APEX [47]. Here we will only deal with POSIX since it is the most widely adopted RTOS standard, but those interested in automotive and avionic systems should take a closer look at OSEK and APEX, respectively. The POSIX standard is based on Unix, and its goal is portability of applications at the source code level. The basic POSIX services include task and thread management, file system management, input and output, and event notification via signals. The POSIX real-time interface defines services facilitating concurrent programming and providing predictable timing behavior. Concurrent programming is supported by synchronization and communication mechanisms that allow predictability. Predictable timing behavior is supported by preemptive fixed priority scheduling, time management with high resolution, and virtual memory management. Several restricted subsets of the standard intended for different types of systems have been defined, as well as specific language bindings, for example, for Ada [48].

2.4 Real-Time Scheduling Traditionally, real-time schedulers are divided into offline and online schedulers. Offline schedulers make all scheduling decisions before the system is executed. At runtime a simple dispatcher is used to activate tasks according to the offline generated schedule. Online schedulers, on the other hand, decide during execution, based on various parameters, which task should execute at any given time. As there are loads of different schedulers developed in the research community, in this section we have focused on highlighting the main categories of schedulers that are readily available in existing RTOSs.

2.4.1 Introduction to Scheduling A RTS consists of a set of real-time programs, which in turn consists of a set of tasks. These tasks are sequential pieces of code, executing on a platform with limited resources. The tasks have different timing


2-12


properties, for example, execution times, periods, and deadlines. Several tasks can be allocated to a single processor. The scheduler decides, at each moment, which task to execute. A RTS can be preemptive or nonpreemptive. In a preemptive system, tasks can preempt each other, letting the task with the highest priority execute. In a nonpreemptive system a task that has been allowed to start will execute until its completion. Tasks can be categorized into either being periodic, sporadic, or aperiodic. Periodic task execute with a specified time (period) between task releases. Aperiodic tasks have no information saying when the task is to be released. Usually aperiodics are triggered by interrupts. Similarly, sporadic tasks have no period, but in contrast with aperiodics, sporadic tasks have a known minimum time between releases. Typically, tasks that perform measurements are periodic, collecting some value(s) every nth time unit. A sporadic task is typically reacting to an event/interrupt that we know has a minimum interarrival time, for example, an alarm or the emergency shut down of a production robot. The minimum interarrival time can be constrained by physical laws, or it can be enforced by some hardware mechanism. If we do not know the minimum time between two consecutive events, we must classify the event-handling task to be aperiodic. A real-time scheduler schedules the real-time tasks sharing the same resource (e.g., a CPU or a network link). The goal of the scheduler is to make sure that the timing requirements of these tasks are satisfied. The scheduler decides, based on the task timing properties, which task has to execute or to use the resource.

2.4.2 Offline Schedulers Offline schedulers, or table-driven schedulers, work as follows: the schedulers create a schedule (the table) before the system is started (offline). At runtime, a dispatcher follows the schedule, and makes sure that tasks are only executing at their predetermined time slots (according to the schedule). Offline schedules are commonly used to implement the time-triggered execution paradigm (described in Section 2.2.3). By creating a schedule offline, complex timing constraints can be handled in a way that would be difficult to do online. The schedule that is created will be used at runtime. Therefore, the online behavior of table-driven schedulers is very deterministic. Because of this determinism, table-driven schedulers are more commonly used in applications that have very high safety-critical demands. However, since the schedule is created offline, the flexibility is very limited, in the sense that as soon as the system changes (due to, e.g., adding of functionality or change of hardware), a new schedule has to be created and given to the dispatcher. To create new schedules is nontrivial and sometimes very time consuming. There also exist combinations of the predictable table-driven scheduling and the more flexible prioritybased schedulers, and there exists methods to convert one policy to another [13,49,50].

2.4.3 Online Schedulers Scheduling policies that make their scheduling decisions during runtime are classified as online schedulers. These schedulers make their scheduling decisions based on some task properties, for example, task priority. Schedulers that base their scheduling decisions on task priorities are also called priority-based schedulers. 2.4.3.1 Priority-Based Schedulers Using priority-based schedulers the flexibility is increased (compared with table-driven schedulers), since the schedule is created online, based on the currently active task’s constraints. Hence, priority-based schedulers can cope with changes in workload and added functions, as long as the schedulability of the task set is not violated. However, the exact behavior of priority-based schedulers is harder to predict. Therefore, these schedulers are not used often in the most safety-critical applications. Two common priority-based scheduling policies are Fixed-Priority Scheduling (FPS) and Earliest Deadline First (EDF). The difference between these scheduling policies is whether the priorities of the real-time tasks are fixed or if they can change during execution (i.e., they are dynamic). In FPS, priorities are assigned to the tasks before execution (offline). The task with the highest priority among all tasks that are available for execution is scheduled for execution. It can be proven that some



2-13

priority assignments are better than others. For instance, for a simple task model with strictly periodic noninterfering tasks with deadlines equal to the period of the task, a Rate Monotonic (RM) priority assignment has been shown by Liu and Layland [5] to be optimal. In RM, the priority is assigned based on the period of the task. The shorter the period is, the higher priority will be assigned to the task. Using EDF, the task with the nearest (earliest) deadline among all available tasks is selected for execution. Therefore the priority is not fixed, it changes with time. It has been shown that for simple task models EDF is an optimal dynamic priority scheme [5]. 2.4.3.2 Scheduling with Aperiodics In order for the priority-based schedulers to cope with aperiodic tasks, different service methods have been presented. The objective of these service methods is to give a good average response time for aperiodic requests, while preserving the timing properties of periodic and sporadic tasks. These services are implemented using special server tasks. In the scheduling literature many types of servers are described. Using FPS, for instance, the Sporadic Server (SS) is presented by Sprunt et al. [51]. The SS has a fixed priority chosen according to the RM policy. Using EDF, Dynamic Sporadic Server (DSS) [52,53] extends SS. Other EDF-based schedulers are the Constant Bandwidth Server (CBS), presented by Abeni and Buttazzo [54], and the Total Bandwidth Server (TBS) by Spuri and Buttazzo [52, 55]. Each server is characterized partly by its unique mechanism for assigning deadlines, and partly by a set of variables used to configure the server. Examples of such variables are bandwidth, period, and capacity. In Section 2.6 we give examples of how timing properties of FPS can be calculated.

2.5 Real-Time Communications Real-time communication aims at providing timely and deterministic communication of data between distributed devices. In many cases, there are requirements to provide guarantees of the real-time properties of these transmissions. There are real-time communication networks of different types, ranging from small fieldbus-based control systems to large Ethernet/Internet distributed applications. There is also a growing interest for wireless solutions. In this section we give a brief introduction to communications in general and real-time communications in particular. We then provide an overview of the currently most popular real-time communication systems and protocols, both in industry and in academia.

2.5.1 Communication Techniques Common access mechanisms used in communication networks are CSMA/CD (Carrier Sense Multiple Access/Collision Detection), CSMA/CA (Carrier Sense Multiple Access/Collision Avoidance), TDMA (Time Division Multiple Access), Tokens, Central Master, and Mini Slotting. These techniques are used in both real-time and non-real-time communication, and each of the techniques have different timing characteristics. In CSMA/CD, collisions between messages are detected, causing the messages involved in the collision to be retransmitted. CSMA/CD is used, for example, in Ethernet. CSMA/CA, on the other hand, is avoiding collisions and is therefore more deterministic in its behavior compared with CSMA/CD. Hence, CSMA/CA is more suitable for hard real-time guarantees, whereas CSMA/CD can provide soft real-time guarantees. Examples of networks that implement CSMA/CA are Controller Area Networks (CAN) and ARINC 629. TDMA is using time to achieve exclusive usage of the network. Messages are sent at predetermined instances in time. Hence, the behavior of TDMA-based networks is very deterministic, that is, very suitable to provide real-time guarantees. One example of a TDMA-based real-time network is TTP. An alternative way of eliminating collisions on the network is to use tokens. In token-based networks only the owner of the (unique within the network) token is allowed to send messages on the network.


2-14


Once the token holder is done or has used its allotted time the token is passed to another node. Tokens are used in, for example, Profibus. It is also possible to eliminate collisions by letting one node in the network be the master node. The master node is controlling the traffic on the network, and it decides which and when messages are allowed to be sent. This approach is used in, for example, LIN and TTP/A. Finally, mini slotting can also be used to eliminate collisions. When using mini slotting, as soon as the network is idle and some node would like to transmit a message, the node has to wait for a unique (for each node) time before sending any messages. If there are several competing nodes wanting to send messages, the node with the longer waiting time will see that there is another node that already has started its transmission of a message. In such a situation the node has to wait until the network becomes idle again. Hence, collisions are avoided. Mini slotting can be found in, for example, FlexRay and ARINC 629.

2.5.2 Fieldbuses Fieldbuses are a family of factory communication networks that have evolved as a response to the demand to reduce cabling costs in factory automation systems. By moving from a situation in which every controller has its own cables connecting the sensors to the controller (parallel interface), to a system with a set of controllers sharing a bus (serial interface), costs could be cut and flexibility could be increased. Pushing for this evolution of technology was both the fact that the number of cables in the system increased as the number of sensors and actuators grew, together with controllers moving from being specialized with their own microchip, to sharing a microprocessor with other controllers. Fieldbuses were soon ready to handle the most demanding applications on the factory floor. Several fieldbus technologies, usually very specialized, were developed by different companies to meet the demands of their applications. Fieldbuses used in the automotive industry are, for example, CAN, TT-CAN, TTP, LIN, and FlexRay. In avionics, ARINC 629 is one of the frequently used communication standards. Profibus is widely used in automation and robotics, while in trains TCN and WorldFIP are very popular communication technologies. We will now present each of these fieldbuses in some more detail, outlining key features and specific properties. 2.5.2.1 Controller Area Network The Controller Area Network (CAN) [56] was standardized by the International Standardisation Organisation (ISO) [57] in 1993. Today CAN is a widely used fieldbus, mainly in automotive systems but also in other real-time applications, for example, medical equipment. CAN is an event-triggered broadcast bus designed to operate at speeds of up to 1 Mbps. CAN is using a fixed-priority based arbitration mechanism that can provide timing guarantees using FPS type of analysis [58,59]. An example of this analysis will be provided in Section 2.6.3. CAN is a collision-avoidance broadcast bus, using deterministic collision resolution to control access to the bus (so-called CSMA/CA). The basis for the access mechanism is the electrical characteristics of a CAN bus, allowing sending nodes to detect collisions in a nondestructive way. By monitoring the resulting bus value during message arbitration, a node detects if there are higher priority messages competing for the access to the bus. If this is the case, the node will stop the message transmission, and try to retransmit the message as soon as the bus becomes idle again. Hence, the bus is behaving like a priority-based queue. 2.5.2.2 Time-Triggered CAN Time-triggered communication on CAN (TT-CAN) [60] is a standardized session layer extension to the original CAN. In TT-CAN, the exchange of messages is controlled by the temporal progression of time, and all nodes are following a predefined static schedule. It is also possible to support original event-triggered CAN traffic together with the time-triggered traffic. This traffic is sent in dedicated arbitration windows, using the same arbitration mechanism as native CAN. The static schedule is based on a time division (TDMA) scheme, where message exchanges may only occur during specific time slots or in time windows. Synchronization of the nodes is done using either a



2-15

clock synchronization algorithm, or by periodic messages from a master node. In the latter case, all nodes in the system are synchronizing with this message, which gives a reference point in the temporal domain for the static schedule of the message transactions, that is, the master’s view of time is referred to as the network’s global time. TT-CAN appends a set of new features to the original CAN, and being standardized, several semiconductor vendors are manufacturing TT-CAN compliant devices. 2.5.2.3 Flexible Time-Triggered CAN Flexible time-triggered communication on CAN (FTT-CAN) [61,62] provides a way to schedule CAN in a time-triggered fashion with support for event-triggered traffic as well. In FTT-CAN, time is partitioned into Elementary Cycles (ECs) which are initiated by a special message, the Trigger Message (TM). This message triggers the start of the EC and contains the schedule for the time-triggered traffic that shall be sent within this EC. The schedule is calculated and sent by a master node. FTT-CAN supports both periodic and aperiodic traffic by dividing the EC into two parts. In the first part, the asynchronous window, the aperiodic messages are sent, and in the second part, the synchronous window, traffic is sent in a time-triggered fashion according to the schedule delivered by the TM. FTT-CAN is still mainly an academic communication protocol. 2.5.2.4 Time-Triggered Protocol The Time-Triggered Protocol Class C for, TTP/C [10, 63], is a TDMA based communication network intended for truly hard real-time communication. TTP/C is available for network speeds of up to 25 Mbps. TTP/C is part of the Time-Triggered Architecture (TTA) by Kopetz [10, 64], which is designed for safetycritical applications. TTP/C has support for fault tolerance, clock synchronization, membership services, fast error detection, and consistency checks. Several major automotive companies are supporting this protocol. For the less hard RTSs (e.g., soft RTSs), there exists a scaled-down version of TTP/C called TTP/A [10]. 2.5.2.5 Local Interconnect Network The Local Interconnect Network LIN [65], was developed by the LIN Consortium (including Audi, BMW, Daimler–Chrysler, Motorola, Volvo, and VW) as a low cost alternative for small networks. LIN is cheaper than, for example, CAN. LIN is using the UART/SCI interface hardware, and transmission speeds are possible up to 20 Kbps. Among the nodes in the network, one node is the master node, responsible for synchronization of the bus. The traffic is sent in a time-triggered fashion. 2.5.2.6 FlexRay FlexRay [66] was proposed in 1999 by several major automotive manufacturers, for example, Daimler– Chrysler and BMW, as a competitive next generation fieldbus replacing CAN. FlexRay is a real-time communication network that provides both synchronous and asynchronous transmissions with network speeds up to 10 Mbps. For the synchronous traffic FlexRay is using TDMA, providing deterministic data transmissions with a bounded delay. For the asynchronous traffic mini-slotting is used. Compared with CAN, FlexRay is more suitable for the dependable application domain, by including support for redundant transmission channels, bus guardians, and fast error detection and signaling. 2.5.2.7 ARINC 629 For avionic and aerospace communication systems, the ARINC 429 [67] standard and its newer ARINC 629 [67] successor are the most commonly used communication systems today. ARINC 629 supports both periodic and sporadic communication. The bus is scheduled in bus cycles, which in turn are divided in two parts. In the first part periodic traffic is sent, and in the second part the sporadic traffic is sent. The arbitration of messages is based on collision avoidance (i.e., CSMA/CA) using mini-slotting. Network speeds are as high as 2 Mbps.


2-16


2.5.2.8 Profibus Profibus [68] is used in process automation and robotics. There are three different versions of Profibus: (1) Profibus-DP is optimized for speed and low cost, (2) Profibus-PA is designed for process automation, and (3) Profibus-FMS is a general purpose version of Profibus. Profibus provides master–slave communication together with token mechanisms. Profibus is available with data rates up to 12 Mbps. 2.5.2.9 Train Communication Network The Train Communication Network (TCN) [69] is widely used in trains, and implements the IEC 61275 standard as well as the IEEE 1473 standard. TCN is composed of two networks: the Wire Train Bus (WTB) and the Multifunction Vehicle BUS (MVB). The WTB is the network used to connect the whole train, that is, all vehicles of the train. Network data rate is up to 1 Mbps. The MVB is the network used within one vehicle. Here the maximum data rate is 1.5 Mbps. Both the WTB and the MVB are scheduled in cycles called basic periods. Each basic period consists of a periodic phase and a sporadic phase. Hence, there is a support for both periodic and sporadic type of traffic. The difference between the WTB and the MVB (apart from the data rate) is the length of the basic periods (1 or 2 msec for the MVB and 25 msec for the WTB). 2.5.2.10 WorldFIP The WorldFIP [70] is a very popular communication network in train control systems. WorldFIP is based on the Producer–Distributor–Consumers (PDC) communication model. Currently, network speeds are as high as 5 Mbps. The WorldFIP protocol defines an application layer that includes PDC- and messaging-services.

2.5.3 Ethernet for Real-Time Communication In parallel with the search for the holy grail of real-time communication, Ethernet has established itself as the de facto standard for non-real-time communication. Comparing networking solutions for automation networks and office networks, fieldbuses was the choice for the former. At the same time, Ethernet developed as the standard for office automation, and owing to its popularity, prices on networking solutions dropped. Ethernet is not originally developed for real-time communication since the original intention with Ethernet is to maximize throughput (bandwidth). However, nowadays, a big effort is being made in order to provide real-time communication using Ethernet. The biggest challenge is to provide real-time guarantees using standard Ethernet components. The reason why Ethernet is not very suitable for real-time communication is its handling of collisions on the network. Several proposals to minimize or eliminate the occurrence of collisions on Ethernet have been proposed. The following sections present some of these proposals. 2.5.3.1 TDMA A simple solution would be to eliminate the occurrence of collisions on the network. This has been explored by, for example, Kopetz et al. [71], using a TDMA protocol on top of Ethernet. 2.5.3.2 Usage of Tokens Another solution to eliminate the occurrence of collisions is the usage of tokens. Token-based solutions [72,73] on the Ethernet also eliminates collisions, but is not compatible with standard hardware. A token-based communication protocol is a way to provide real-time guarantees on most types of networks. This is because they are deterministic in their behavior, although a dedicated network is required. That is, all nodes sharing the network must obey the token protocol. Examples of token-based protocols are the Timed Token Protocol (TTP) [74] and the IEEE 802.5 Token Ring Protocol.



2-17

2.5.3.3 Modified Collision Resolution Algorithm A different approach is to modify the collision resolution algorithm [75, 76]. Using standard Ethernet controllers, the modified collision resolution algorithm is nondeterministic. In order to make a deterministic modified collision resolution algorithm, a major modification of the Ethernet controllers is required [77]. 2.5.3.4 Virtual Time and Window Protocols Another solution to real-time communication using Ethernet is the usage of the Virtual Time CSMA (VTCSMA) [78–80] protocol, where packets are delayed in a deterministic way in order to eliminate the occurrence of collisions. Moreover, Window Protocols [81] are using a global window (synchronized time interval) that also remove collisions. The window protocol is more dynamic and somewhat more efficient in its behavior compared with the VTCSMA approach. 2.5.3.5 Master/Slave A fairly straightforward way of providing real-time traffic on Ethernet is by using a master/slave approach. As a part of the FTT framework [82], FTT Ethernet [83] is proposed as a master/multislave protocol. At the cost of some computational overhead at each node in the system, timely delivery of messages on Ethernet is provided. 2.5.3.6 Traffic Smoothing The most recent work, without modifications to the hardware or networking topology (infrastructure), is the usage of traffic smoothing. Traffic smoothing can be used to eliminate bursts of traffic [84, 85] that have severe impact on the timely delivery of message packets on the Ethernet. By keeping the network load below a given threshold, a probabilistic guarantee of message delivery can be provided. Hence, traffic smoothing could be a solution for soft RTSs. 2.5.3.7 Black Bursts Black burst [86] is implementing a collision avoidance protocol on Ethernet. When a station wants to submit a message, the station waits until the network is idle, i.e., no traffic is being transmitted. Then, to avoid collisions, the transmitting station starts jamming the network. Several transmitting stations might start jamming the network at the same time. However, each station is using unique length jamming signals, always allowing a unique station to win. Winning means that once the jamming signal is over, the network should be idle, i.e., no other stations are jamming the network. If this is the case, the message is transmitted. In other cases, a loosing station will wait until the network is idle again, and the mechanism starts over. Hence, no message collisions will occur on the network. 2.5.3.8 Switches Finally, a completely different approach to achieve real-time communication using Ethernet is by changing the infrastructure. One way of doing this is to construct the Ethernet using switches to separate collision domains. By using these switches, a collision free network is provided. However, this requires new hardware supporting the IEEE 802.1p standard. Therefore it is not an as attractive solution for existing networks as, for example, traffic smoothing.

2.5.4 Wireless Communication There are no commercially available wireless communication protocols providing real-time guarantees.3 Two of the more common used wireless protocols today are the IEEE 802.11 (WLAN) and Bluetooth. However, these protocols are not providing the temporal guarantees needed for hard real-time communication. Today, a big effort is being made (as with Ethernet) to provide real-time guarantees for wireless communication, possibly by using either WLAN or Bluetooth. 3 Bluetooth

provides real-time guarantees limited to streaming voice traffic.


2-18


2.6 Analysis of RTSs The most important property to analyze in a RTS is its temporal behavior, that is, the timeliness of the system. The analysis should provide strong evidence that the system performs as intended at the correct time. This section will give an overview of the basic properties that are analyzed in a RTS. The section concludes with a presentation of trends and tools in the area of RTS analysis.

2.6.1 Timing Properties Timing analysis is a complex problem. Not only are the used techniques sometimes complicated, but also the problem itself is elusive; for instance, what is the meaning of the term “program execution time”? Is it the average time to execute the program, or the worst possible time, or does it mean some form of “normal” execution time? Under what conditions does a statement regarding program execution times apply? Is the program delayed by interrupts or higher priority tasks? Does the time include waiting for shared resources? etc. To straighten out some of these questions, and to be able to study some existing techniques for timing analysis, we categorize timing analysis into three major types. Each type has its own purpose, benefits, and limitations. The types are listed below. 2.6.1.1 Execution Time This refers to the execution time of a singe task (or program, or function, or any other unit of singlethreaded sequential code). The result of an execution-time analysis is the time (i.e., the number of clock cycles) the task takes to execute, when executing undisturbed on a single CPU, that is, the result should not account for interrupts, preemption, background DMA transfers, DRAM refresh delays, or any other type of interfering background activities. At a first glance, leaving out all types of interference from the execution-time analysis would give us unrealistic results. However, the purpose of the execution-time analysis is not to deliver estimates on “real-world” timing when executing the task. Instead, its role is to find out how much computing resources is needed to execute the task. (Hence, background activities that are not related to the task should not be accounted for.) There are some different types of execution times that can be of interest: • Worst-case execution time (WCET). This is the worst possible execution time a task could exhibit, or equivalently, the maximum amount of computing resources required to execute the task. The WCET should include any possible atypical task execution such as exception handling or clean up after abnormal task termination. • Best-case execution time (BCET). During some types of real-time analysis, not only the WCET is used, but also, as we will describe later, having knowledge about the BCET of tasks is useful. • Average execution time (AET). The AET can be useful in calculating throughput figures for a system. However, for most RTS analysis the AET is of less importance, simply since a reasonable approximation of the average case is easy to obtain during testing (where typically, the average system behavior is studied). Also, only knowing the average and not knowing any other statistical parameters such as standard deviation or distribution function makes statistical analysis difficult. For analysis purposes a more pessimistic metric such as the 95% quartile would be more useful. However, analytical techniques using statistical metrics of execution time are scarce and not very well developed. 2.6.1.2 Response Time The response time of a task is the time it takes from the invocation to the completion of the task. In other words, the time from when the task is first placed in the OS’s ready-queue to the time when it is removed from the running state and placed in the idle or sleeping state.



2-19

Typically, for analysis purposes it is assumed that a task does not voluntarily suspend itself during its execution. That is, the task may not call primitives such as sleep() or delay(). However, involuntary suspension, such as blocking on shared resources, is allowed. That is, primitives such as get_semaphore() and lock_database_tuple() are allowed. When a program voluntarily suspends itself, then that program should be broken down into two (or more) analysis tasks. The response time is typically a system level property, in that it includes interference from other, unrelated, tasks and parts of the system. The response time also includes delays caused by contention on shared resources. Hence, the response time is only meaningful when considering a complete system, or in distributed systems, a complete node. 2.6.1.3 End-to-End Delay The described “execution time” and “response time” are useful concepts since they are relatively easy to understand and have well defined scopes. However, when trying to establish the temporal correctness of a system, knowing the WCET and/or the response times of tasks is often not enough. Typically, the correctness criteria is stated using end-to-end latency timing requirements, for instance, an upper bound on the delay between the input of a signal and the output of a response. In a given implementation there may be a chain of events taking place between the input of a signal and the output of a response. For instance, one task may be in charge of reading the input and another task for generating the response, and the two tasks may have to exchange messages on a communications link before the response can be generated. The end-to-end timing denotes timing of externally visible events. 2.6.1.4 Jitter

Interru

pt

Interru pt

The term jitter is used as a metric for variability in time. For instance, the jitter in execution time of a task is the difference between the task’s BCET and WCET. Similarly, the response-time jitter of a task is the difference between its best-case response time and its worst-case response time. Often, control algorithms has requirements that the jitter of the output should be limited. Hence, the jitter is sometimes a metric equally important as the end-to-end delay. Also input to the system can have jitter. For instance, an interrupt which is expected to be periodic may have a jitter (owing to some imperfection in the process generating the interrupt). In this case the jitter value is used as a bound on the maximum deviation from the ideal period of the interrupt. Figure 2.3 illustrates the relation between the period and the jitter for this example. Note that jitter should not accumulate over time. For our example, even though two successive interrupts could arrive closer than one period, in the long run, the average interrupt interarrival time will be that of the period. In the above list of types of time, we only mentioned time to execute programs. However, in many RTSs, other timing properties may also exist. This includes delays on communications network and other resources, such as hard disk drives may be causing delays and need to be analyzed. The above introduced times can all be mapped on to different types of resources, for instance, the WCET of a task corresponds to the maximum size of a message to be transmitted, and the response time of message is defined analogous to the response time of a task.

Period

Earliest time

FIGURE 2.3 Jitter used as a bound on variability in periodicity.


Jitter

Time Latest time

2-20


2.6.2 Methods for Timing Analysis When analyzing hard RTSs it is essential that the estimates obtained during timing analysis are safe. An estimate is considered safe if it is guaranteed that it is not an underestimation of the actual worst-case time. It is also important that the estimate is tight, meaning that the estimated time is close to the actual worst-case time. For the previously defined types of timings (Section 2.6.1) there are different methods available that are given in the following sections. 2.6.2.1 Execution-Time Estimation For real-time tasks the WCET is the most important execution time measure to obtain. Sadly, however, it is also often the most difficult measure to obtain. Methods to obtain the WCET of a task can be divided into two categories: (1) static analysis, and (2) dynamic analysis. Dynamic analysis is essentially equivalent to testing (i.e., executing the task on the target hardware) and has all the drawbacks/problems that testing exhibits (such as being tedious and error prone). One major problem with dynamic analysis is that it does not produce safe results. In fact, the result can never exceed the true WCET and it is very difficult to make sure that the estimated WCET is really the true WCET. Static analysis, on the other hand, can give guaranteed safe results. Static analysis is performed by analyzing the code (source and/or object code is used) and basically counting the number of clock cycles that the task may use to execute (in the worst possible case). Static analysis uses models of the hardware to predict execution times for each instruction. Hence, for modern hardware it may be very difficult to produce static analyzers that give good results. One source of pessimism in the analysis (i.e., overestimation) is hardware caches; whenever an instruction or data item cannot be guaranteed to reside in the cache, a static analyzer must assume a cache miss. And since modeling the exact state of caches (sometimes of multiple levels), branch predictors etc. is very difficult and time consuming, few tools that give adequate results for advanced architectures exists. Also, to perform a program flow and data analysis that exactly calculates, for example, the number of times a loop iterates or the input parameters for procedures is difficult. Methods for good hardware and software modeling do exist in the research community, however, combining these methods into good quality tools has proven tedious. 2.6.2.2 Schedulability Analysis The goal of schedulability analysis is to see whether or not a system is schedulable. A system is deemed schedulable if it is guaranteed that all task deadlines will always be met. For statically scheduled (table driven) systems, calculation of response times are trivially given from the static schedule. However, for dynamically scheduled systems (such as fixed priority or deadline scheduling) more advanced techniques have to be used. There are two main classes of schedulability analysis techniques: (1) response-time analysis, and (2) utilization analysis. As the name suggest, a response-time analysis calculates an (safe) estimate of the worst-case response time of a task. That estimate can then be compared with the deadline of the task and if it does not exceed the deadline then the task is schedulable. Utilization analysis, in contrast, does not directly derive the response times for tasks, rather they give a boolean result for each task telling whether or not the task is schedulable. This result is based on the fraction of utilization of the CPU for a relevant subset of the tasks, hence the term utilization analysis. Both the analyses are based on similar types of task models. However, typically, the task models used for analysis are not the task models provided by commercial RTOSs. This problem can be resolved by mapping one or more OS task on to one or more analysis task. However, this mapping has to be performed manually and requires an understanding of the limitations of the analysis task model and the analysis technique used.



2-21

2.6.2.3 End-to-End Delay Estimation The typical way to obtain end-to-end delay estimations is to calculate the response time for each task/message in the end-to-end chain and to summarize these response times to obtain an end-to-end estimate. When using an utilization based analysis technique (in which no response time is calculated) one has to resort to using the task/message deadlines as safe upper bounds on the response times. However, when analyzing distributed RTSs, it may not be possible to calculate all response times in one pass. The reason for this is that delays on one node will lead to jitter on another node, and that this jitter may in turn affect the response times on that node. Since jitter can propagate in several steps between nodes, in both directions, there may not exist a right order to analyze the nodes. (If A sends a message to B, and B sends a message to A; which node should one analyze first?) Solution to these type of problems are called holistic schedulability analysis methods (since they consider the whole system). The standard method for holistic response-time analysis is to repeatedly calculate response times for each node (and update jitter values in the nodes affected by the node just analyzed) until response times do not change (i.e., a fix point is reached). 2.6.2.4 Jitter Estimation To calculate the jitter one need not only perform a worst-case analysis (of for instance, response time or end-to-end delay), but also perform a best-case analysis. However, even though best-case analysis techniques often are conceptually similar to worst-case analysis techniques, there has been little attention paid to best-case analysis. One reason for not spending too much time on best-case analysis is that it is quite easy to make a conservative estimate of the best-case: the bestcase time is never less than zero (0). Hence, in many tools it is simply assumed that the BCET (for instance) is zero, whereas great efforts can be spent analyzing the WCET. However, it is important to have tight estimates of the jitter, and to keep the jitter as low as possible. It has been shown that the number of execution paths a multi-tasking RTS can take, dramatically increases if jitter increases [87]. Unless the number of possible execution paths is kept as low as possible it becomes very difficult to achieve good coverage during testing.

2.6.3 Example of Analysis In this section we give simple examples of schedulability analysis. We show a very simple example of how a set of tasks running on a single CPU can be analyzed, and we also give an example of how the response times for a set of messages sent on a CAN bus can be calculated. 2.6.3.1 Analysis of Tasks This example is based on some 30 year old task models and is intended to give the reader a feeling for how these types of analysis work. Today’s methods allow for far richer and more realistic task models; with resulting increase of complexity of the equations used (hence they are not suitable for use in our example). In the first example we will analyze a small task set as described in Table 2.2, where T , C, and D denote the tasks’ period, WCET, and deadline, respectively. In this example T = D for all tasks and priorities have been assigned in RM order, that is, the highest rate gives the highest priority.

TABLE 2.2 Example Task Set for Analysis


Task

T

C

D

Prio

X Y Z

30 40 52

10 10 10

30 40 52

High Medium Low

2-22


For the task set in Table 2.2 original analysis techniques of Liu and Layland [5], and Joseph and Pandya [88] are applicable, and we can perform both utilization-based and response-time based schedulability analysis. We start with the utilization based analysis; for this task model Liu and Layland’s result is that a task set of n tasks is schedulable if its total utilization, Utot , is bounded by the following equation: Utot ≤ n(21/n − 1) Table 2.3 shows the utilization calculations performed for the schedulability analysis. For our example, task set n = 3 and the bound is approximately 0.78. However, the utilization (Utot = ni=1 (Ci /Ti )) for our task set is 0.81, which exceeds the bound. Hence, the task set fails the RM test and cannot be deemed schedulable. Joseph and Pandya’s response-time analysis allows us to calculate worst-case response time, Ri , for each task i in our example (Table 2.2). This is done using the following formula: Ri = Ci +

Ri Cj Tj

(2.1)

j∈hp(i)

where hp(i) denotes the set of tasks with priority higher than task i. The observant reader may have noticed that equation 2.1 is not on closed form, in that Ri is not isolated on the left-hand side of the equality. As a matter of fact, Ri cannot be isolated on the left-hand side of the equality; instead equation 2.1 has to be solved using fix-point iteration. This is done with the recursive formula in equation 2.1, starting with Ri0 = 0 and terminating when a fix point has been reached (i.e., when Rim+1 = Rim ). Rm i Rim+1 = Ci + Cj (2.2) Tj j∈hp(i)

For our example task set Table 2.4 shows the results of calculating equation 2.1. From the table we can conclude that no deadlines will be missed and that the system is schedulable. Remarks As we could see for our example task set in Table 2.2, the utilization based test could not deem the task set as schedulable whereas the response-time based test could. This situation is symptomatic for the relation between utilization based and response-time based schedulability tests. That is, the response-time based tests find more task sets schedulable than the utilization based tests. TABLE 2.3 Result of RM Test Task

T

C

D

Prio

U

X Y Z

30 40 52

10 10 10

30 40 52

High Medium Low

0.33 0.25 0.23

Total Bound

0.81 0.81

TABLE 2.4 Result of Response-Time Analysis for Tasks Task

T

C

D

Prio

R

R≤D

X Y Z

30 40 52

10 10 10

30 40 52

High Medium Low

10 20 52

Yes Yes Yes



2-23

TABLE 2.5 Example CAN-Message Set Message X Y Z

T

S

D

Id

350 500 1000

8 6 5

300 400 800

00010 00100 00110

TABLE 2.6 Result of Response-Time Analysis for CAN Message X Y Z

T

S

D

Id

Prio

C

w

R

R≤D

350 500 1000

8 6 5

300 400 800

00010 00100 00110

High Medium Low

130 111 102

130 260 612

260 371 714

Yes Yes Yes

However, as also shown by the example, the response-time based test needs to perform more calculations than the utilization based tests. For this simple example the extra computational complexity of the response-time test is insignificant. However, when using modern task models (that are capable of modeling realistic systems) the computational complexity of response-time based tests is significant. Unfortunately, for these advanced models, utilization based tests are not always available. 2.6.3.2 Analysis of Messages In our second example we show how to calculate the worst-case response times for a set of periodic messages sent over the CAN-bus (CAN is described in Section 2.5.2). We use a response-time analysis technique similar to the one we used when we analyzed the task set in Table 2.2. In this example our message set is given in Table 2.5, where T , S, D, and Id denotes the messages’ period, data size (in bytes), deadline, and CAN identifier, respectively. (The time-unit used in this example is “bit-time,” that is, the time it takes to send one bit. For a 1 Mbit CAN this means that 1 time-unit is 10−6 sec.) Before we attack the problem of calculating response times we extend Table 2.5 with two columns. First, we need the priority of each message; in CAN this is given by the identifier, the lower the numerical value the higher the priority. Second, we need to know the worst-case transmission time of each message. The transmission time is given partly by the message data size but we also need to add time for the frame header and for any stuff bits.4 The formula to calculate the transmission time, Ci for a message i containing Si bytes of pay load data is given below: Ci = 8Si + 47 +

34 + 8Si − 1 4

In Table 2.6 the two columns Prio and C shows the priority assignment and the transmission times for our example message set. Now we have all the data needed to perform the response-time analysis. However, since CAN is a nonpreemptive resource the structure of the equation is slightly different from equation 2.1 which we used for analysis of tasks. The response-time equation for CAN is given in equation 2.3. Ri = wi + Ci w i = Bi +

wi + 1 Cj Tj

(2.3)

∀j∈hp(i) 4 CAN

adds stuff bits, if necessary, to avoid the two reserved bit patterns 000000 and 111111. These stuff bits are never seen by the CAN user but have to be accounted for in the timing analysis.


2-24


In equation 2.3, Bi denotes the blocking time orginating from a lower priority message already in transition when message i enters arbitration (Bi ≤ 135 which is the largest possible message), and hp(i) denotes the set of messages with higher priority than message i. Note that (similar to equation 2.1) wi is not isolated on the left-hand side of the equation, and its value has to be calculated using fix-point iteration (compare with equation 2.2). Applying equation 2.3 we can now calculate the worst-case response time for our example messages. In Table 2.6 the two columns w and R shows the results of the calculations, and the final column shows the schedulablilty verdict for each message. As we can see from Table 2.6, our example message set is schedulable, meaning that the messages will always be transmitted before their deadlines. Note that this analysis was made assuming that there will not be any retransmissions of broken messages. Normally, CAN automatically retransmits any message that has been broken owing to interference on the bus. To account for such automatic retransmissions an error model needs to be adopted and the response-time equation adjusted accordingly, see, for example, Reference 59.

2.6.4 Trends and Tools As discussed earlier, and also illustrated by our example in Table 2.2, there is a mismatch between the analytical task models and the task models provided by commonly used RTOSs. One of the basic problems is that there is no one-to-one mapping between analysis tasks and RTOS tasks. In fact, for many systems there is a N -to-N mapping between the task types. For instance, an interrupt handler may have to be modeled as several different analysis task (one analysis task for each type of interrupt it handles), and one OS task may have to be modeled as several analysis tasks (for instance, one analysis task per call to sleep() primitives). Also, current schedulability analysis techniques cannot adequately model other types of task synchronization than locking/blocking on shared resources. Abstractions such as message queues are difficult to include in the schedulability analysis.5 Furthermore, tools to estimate the WCET are also scarce. Currently only two tools that gives safe WCET estimates are commercially available [90,91]. These problems have led to low penetration of schedulability analysis in industrial softwaredevelopment processes. However, in isolated domains, such real-time networks, some commercial tools that are based on real-time analysis do exist. For instance, Volcano [92, 93] provides tools for the CAN bus that allow system designers to specify signals on an abstract level (giving signal attributes such as size, period, and deadline) and automatically derive a mapping of signals to CAN messages where all deadlines are guaranteed to be met. On the software side, tools provided by, for instance, TimeSys [94], Arcticus Systems [13], and TTTech [10] can provide system development environments with timing analysis as an integrated part of the tool suite. However, all these tools require that the software development processes is under complete control of the respective tool. This requirement has limited the use of these tools. The widespread use of UML [22] in software design has led to some specialized UML products for real-time engineering [23,24]. However, these products, as of today, do not support timing analysis of the designed systems. There is, however, recent work within the OMG that specifies a profile “Schedulability, Performance, and Time” (SPT) [95], which allows specification of both timing properties and requirement in a standardized way. This will in turn lead to products that can analyze UML models conforming to the SPT profile. The SPT profile has, however, not been received without criticism. Critique has mainly come from researchers active in the timing analysis field, claiming both that the profile is not precise enough and that some important concepts are missing. For instance, the Universidad de Cantabria has instead developed 5 Techniques to handle more advanced models include timed logic and model checking. However, the computational

and conceptual complexity of these techniques has limited their industrial impact. However, there are examples of commercial tools for this type of verification, for example, Reference [89].



2-25

the MAST–UML profile and an associated MAST tool for analyzing MAST–UML models [96, 97]. MAST allows modeling of advanced timing properties and requirement, and the tool also provides state-of-the-art timing analysis techniques.

2.7 Component-Based Design of RTS Component-Based Design is a current trend in software engineering. In the desktop-area component technologies like COM [15], .NET [16], and Java Beans [17] have gained widespread use. These technologies give substantial benefits, in terms of reduced development time and software complexity, when designing complex and/or distributed systems. However, for RTSs these, and other, desktop oriented component technologies does not suffice. As stated before, the main challenge of designing RTSs is the need to consider issues that do not typically apply to general-purpose computing systems. These issues include: • Constraints on extra-functional properties, such as timing, QoS, and dependability. • The need to statically predict (and verify) these extra-functional properties. • Scarce resources, including processing power, memory, and communication bandwidth. In the commercially available component technologies today, there is little or no support for these issues. Also on the academic scene, there are no readily available solutions to satisfactorily handle all these issues. In the remainder of this chapter we will discuss how these issues can be addressed in the context of CBD. In doing so, we also highlight the challenges in designing a CBD process and component technology for development of RTS.

2.7.1 Timing Properties and CBD In general, for systems where timing is crucial there will necessarily be at least some global timing requirements that have to be met. If the system is built from components, this will imply the need for timing parameters/properties of the components and some proof that the global timing requirements are met. In Section 2.6 we introduced the following four types of timing properties: • • • •

execution time response time end-to-end delay jitter.

So, how are these related to the use of a CBD methodology? 2.7.1.1 Execution Time For a component used in a real-time context, an execution time measure will have to be derived. This is, as discussed in Section 2.6, not an easy or satisfactorily solved problem. Furthermore, since execution time is inherently dependent on the target hardware, and since reuse is the primary motivation for CBD, it is highly desirable if the execution time for several targets would be available. (Alternatively, that the execution time for new hardware platforms is automatically derivable.) The nature of the applied component model may also make execution-time estimation more or less complex. Consider, for instance, a client–server oriented component model, with a server component that provides services of different types, as illustrated in Figure 2.4(a). What does “execution time” mean for such a component? Clearly, a single execution time is not appropriate, rather the analysis will require a set of execution times related to servicing different requests. On the other hand, for a simple port-based object component model [21] in which components are connected in sequence to form periodically executing transactions (illustrated in Figure 2.4[b]), it could be possible to use a single execution time measure,


2-26

Embedded Systems Handbook (a)

(b) Client comp. Client comp. Client comp. Client comp.

Client comp.

Client comp. Client comp.

Server component

FIGURE 2.4 (a) A complex server component, providing multiple services to multiple users, and (b) a simple chain of components implementing a single thread of control.

(a)

(b)

Task

Task

Component

Component

(c) Component Task

Task (d) Task

Component

Task

Component

Component

Component Task

Task Component

Task Component

Component

Component Task

Task

Task

Component

FIGURE 2.5 Tasks and components: (a) one-to-one correspondence, (b) one-to-many correspondence, (c) manyto-one correspondence, (b + c) many-to-many correspondence, and (d) irregular correspondence.

corresponding to the execution time required for reading the values at the input ports, performing the computation, and writing values to the output ports. 2.7.1.2 Response Time Response times denote the time from invocation to completion of tasks, and response-time analysis is the activity to statically derive response-time estimates. The first question to ask from a CBD perspective is: what is the relation between a “task” and a “component”? This is obviously highly related to the component model used. As illustrated in Figure 2.5(a), there could be a one-to-one mapping between components and tasks, but in general, several components could be implemented in one task (Figure 2.5[b]) or one component could be implemented by several tasks (Figure 2.5[c]), hence there is a many-to-many relation between components and tasks. In principle, there could even be more irregular correspondence between components and tasks, as illustrated in Figure 2.5(d). Furthermore, in a distributed system there could be a many-to-many relation between components and processing nodes, making the situation even more complicated. Once we have sorted out the relation between tasks and components, we can calculate the response times of tasks, given that we have an appropriate analysis method for the used execution paradigm, and that relevant execution time measures are available. However, to relate these response times to components and the application level timing requirements may not be straightforward, but this is an issue for the subsequent end-to-end analysis. Another issue with respect to response times is how to handle communication delays in distributed systems. In essence there are two ways to model the communication, as depicted in Figure 2.6. In Figure 2.6(a) the network is abstracted away and the intercomponent communication is handled by the framework. In this case, response-time analysis is made more complicated since it must account for different delays in intercomponent communication, depending on the physical location of components.



2-27

(a)

(b)

Node Comp. Comp. Comp.



Network




Network Network Network component

FIGURE 2.6 Components and communication delays: (a) communication delays can be part of the intercomponent communication properties, and (b) communication delays can be timing properties of components.

In Figure 2.6(b), on the other hand, the network is modeled as a component itself, and network delays can be modeled as delays in any other component (and intercomponent communication can be considered instantaneous). However, the choice of how to model network delays also has an impact on the software engineering aspects of the component model. In Figure 2.6(a), the communication is completely hidden from the components (and the software engineers), hence giving optimizing tools many degrees of freedom with respect to component allocation, signal mapping, and scheduling parameter selection. Whereas, in Figure 2.6(b) the communication is explicitly visible to the components (and the software engineers), hence putting a larger burden on the software engineers to manually optimize the system. 2.7.1.3 End-to-End Delay End-to-end delays are application level timing requirements relating the occurrence in time of one event to the occurrence of another event. As pointed out earlier, how to relate such requirements to the lower level timing properties of components is highly dependent on both the component model and the timing analysis model. When designing RTS using CBD the component structure gives excellent information about the points of interaction between the RTS and its environment. Since, end-to-end delays is about timing estimates and timing requirements on such interactions, CBD gives a natural way of stating timing requirements in terms of signals received or generated. (In traditional RTS development, the reception and generation of signals is embedded into the code of tasks and are not externally visible, hence making it difficult to relate response times of tasks to end-to-end requirements.) 2.7.1.4 Jitter Jitter is an important timing parameter that is related to execution time, and that will affect response times and end-to-end delays. There may also be specific jitter requirements. Jitter has the same relation to CBD as does end-to-end delay. 2.7.1.5 Summary of Timing and CBD As described earlier, there is no single solution for how to apply CBD to RTSs. In some cases, timing analysis is made more complicated when using CBD, for example, when using client–server oriented component models, whereas in other cases, CBD actually helps timing analysis, for example, identifying interfaces/events associated with end-to-end requirements is facilitated when using CBD. Further, the characteristics of the component model has great impact on the analyzability of CBDed RTSs. For instance, interaction patterns such as client–server does not map well to established analysis methods and makes analysis difficult, whereas pipes-and-filter based patterns (such as the port based objects component model [21]) maps very well to existing analysis methods and allow for tight analysis of timing behavior. Also, the execution semantics of the component model has an impact on the analyzability. The execution semantics gives restrictions on how to map components to tasks, for example, in the Corba Component Model [14] each component is assumed to have its own thread of execution, making it difficult


2-28


to map multiple components to a single thread. On the other hand, the simple execution semantics of pipes-and-filter based models allow for automatic mapping of multiple components to a single task, simplifying timing analysis and making better use of system resources.

2.7.2 Real-Time Operating Systems There are two important aspects regarding CBD and RTOSs: (1) the RTOS may itself be component based, and (2) the RTOS may support or provide a framework for CBD. 2.7.2.1 Component-Based RTOSs Most RTOSs allow for offline configuration where the engineer can choose to include or exclude large parts of functionality. For instance, which communications protocols to include is typically configurable. However, this type of configurability is not the same as the RTOS being component based (even though the unit of configuration is often referred to as components in marketing material). For an RTOS to be component based it is required that the components conform to a component model, which is typically not the case in most configurable RTOSs. There has been some research on component-based RTOSs, for instance, the research RTOS VEST [18]. In VEST, schedulers, queue managers, and memory management is built up out of components. Furthermore, special emphasis has been put on predictability and analyzability. However, VEST is currently still on the research stage and has not been released to the public. Publicly available is, however, the eCos RTOS [98, 99] which provides a component based configuration tool. Using eCos components the RTOS can be configured by the user, and third party extension can be provided. 2.7.2.2 RTOSs that Support CBD Looking at component models in general and those intended for embedded systems in particular, we observe that they are all supported by some runtime executive or simple RTOS. Many component technologies provides frameworks that are independent of the underlying RTOS, and hence, RTOS can be used to support CBD using such an RTOS-independent framework. Examples include Corba’s ORB [100] and the framework for PECOS [20,101]. Other component technologies have a tighter coupling between the RTOS and component framework, in that the RTOS explicitly supports the component model by providing the framework (or part of the framework). Such technologies include: • Koala [19] is a component model and architectural description language from Philips. Koala provides high-level APIs to the computing and audio/video hardware. The computing layer provides a simple proprietary real-time kernel with priority-driven preemptive scheduling. Special techniques for thread sharing is used to limit the number of concurrent threads. • The Chimera RTOS provides an execution framework for the Port-Based-Object component model [21], intended for development of sensor-based control systems, specifically reconfigurable robotics applications. Chimera has multiprocessor support, and handles both static and dynamic scheduling, the latter EDF based. • The Rubus is a RTOS. Rubus supports a component model in which behaviors are defined by sequences of port-based objects [13]. The Rubus kernel supports predictable execution of statically scheduled periodic tasks (termed red tasks in Rubus) and dynamically fixed-priority preemptive scheduled tasks (termed Blue). In addition, support for handling interrupts is provided. In Rubus, support is provided for transforming sets of components into sequential chains of executable code. Each such chain is implemented as a single task. Support is also provided for analysis of response times and end-to-end deadlines, based on execution-time measures that have to be provided, that is, execution-time analysis is not provided by the framework. • The Time-Triggered Operating System (TTOS) is an adapted and extended version of the MARS OS [71]. Task scheduling in TTOS is based on an offline generated scheduling table, and relies on the global time base provided by the TTP/C communication system. All synchronization is handled



2-29

by the offline scheduling. TTOS, and in general the entire TTA is (just as IEC61131-3) well suited for the synchronous execution paradigm. In a synchronous execution the system is considered sequential; computing in each step (or cycle) a global output based on a global input. The effect of each step is defined by a set of transformation rules. Scheduling is done statically by compiling the set of rules into a sequential program implementing these rules and executing them in some statically defined order. A uniform timing bound for the execution of global steps is assumed. In this context, a component is a design level entity. TTA defines a protocol for extending the synchronous language paradigm to distributed platforms, allowing distributed components to interoperate, as long as they conform to imposed timing requirements.

2.7.3 Real-Time Scheduling Ideally, from a CBD perspective, the response time of a component should be independent of the environment in which it is executing (since this would facilitate reuse of the component). However, this is in most cases highly unrealistic, since: 1. The execution time of the task will be different in different target environments. 2. The response time is additionally dependent on the other tasks competing for the same resources (CPU etc.) and the scheduling method used to resolve the resource contention. Rather than aiming for the nonachievable ideal, a realistic ambition could be to have a component model and framework which allows for analysis of response times based on abstract models of components and their compositions. Time-triggered systems goes one step toward the ideal solution, in that components can be timely isolated from each other. While not having a major impact on the component model, time-triggered systems simplify implementation of the component framework since all synchronization between components is resolved offline. Also, from a safety perspective, the time-triggered paradigm gives benefits, in that it reduces the number of possible execution scenarios (owing to the static order of execution of components and owing to the lack of preemption). Also, in time-triggered component models it is possible to use the structure given by the component composition to synthesize scheduling parameters. For instance, in Rubus [13] and TTA [8] this is already done, by generating the static schedule using the components as schedulable entities. In theory, a similar approach could be used also for dynamically scheduled systems; using a scheduler/ task configuration tool to automatically derive mappings of components to tasks and scheduling parameters (such as priorities or deadlines) for the tasks. However, this approach is still on the research stage.

2.8 Testing and Debugging of RTSs According to a recent study by NIST [102] up to 80% of the life cycle cost for software is spent on testing and debugging. Despite the importance, there are few results on RTSs testing and debugging. The main reason for this is that it is actually quite difficult to test and debug RTS. Remember that RTSs are timing critical and that they interact with the real world. Since testing and debugging typically involves some instrumentation of the code, the timing behavior of the system will be different when testing/debugging compared with when executing the deployed system. Hence, the test-cases that were passed during testing may lead to failures in the deployed system, and tests that failed may not cause any problem at all in the deployed system. For debugging the situation is possibly even worse, since in addition to a similar effect when running the system in a debugger, entering a breakpoint will stop the execution for an unspecified time. The problem with this is that the controlled external process will continue to evolve (e.g., a car will not momentarily stop by stopping the execution of the controlling software). The result of this is that we get a behavior of the debugged system which will not be possible in the real system. Also, it is often the case that the external process cannot be completely controlled, which


2-30


means that we cannot reproduce the observed behavior, which means that it will be difficult to use (cyclic) debugging to track down an error that caused a failure. The following are two possible solutions to the presented problems: • To build a simulator that faithfully captures the functional as well as timing behavior of both the RTS and the environment which it is controlling. Since this is both time consuming and costly, this approach is only feasible in very special situations. Since such situations are rare we will not further consider this alternative here. • To record the RTS’s behavior during testing or execution, and then if a failure is detected replay the execution in a controlled way. For this to work it is essential that the timing behavior is the same during testing as in the deployed system. This can either be achieved by using nonintrusive hardware recorders, or by leaving the software used for instrumentation in the deployed system. The latter comes at a cost in memory space and execution time, but gives the additional benefit that it becomes possible to debug also the deployed system in case of a failure [103]. An additional problem for most RTSs is that the system consists of several concurrently executing threads. This is also the case for the majority of nonRTSs. This concurrency will per se lead to a problematic nondeterminism, since owing to race conditions caused by slight variations in execution time the exact preemption points will vary, causing unpredictability, both in terms of the number of scenarios and in terms of being able to predict which scenario will actually be executed in a specific situation. In conclusion we note that testing and debugging of RTSs are difficult and challenging tasks. The following is a brief account of some of the few results on testing of RTSs reported in the literature: • Thane and Hansson [87] proposed a method for deterministic testing of distributed RTSs. The key element here is, to identify the different execution orderings (serializations of the concurrent system) and treat each of these orderings as a sequential program. The main weakness of this approach is the potentially exponential blow-up of the number of execution orderings. • For testing of temporal correctness Tsai et al. [104] provide a monitoring technique that records runtime information. This information is then used to analyze if the temporal constraints are violated. • Schütz [105] has proposed a strategy for testing distributed RTSs. The strategy is tailored for the time-triggered MARS system [71]. • Zhu et al. [106] have proposed a framework for regression testing of real-time software in distributed systems. The framework is based on the Onoma’s [107] regression testing process. When it comes to RTS debugging the most promising approach is record/replay [108–112] as mentioned earlier. Using record/replay, first, a reference execution of the system is executed and observed, second, a replay execution is performed based on the observations made during the reference execution. Observations are performed by instrumenting the system, in order to extract information about the execution. The industrial practice for testing and debugging of multi-tasking RTS is a time consuming activity. At best, hardware emulators, for example, Reference 113, are used to get some level of observability without interfering with the observed system. More often, it is an ad hoc activity, using intrusive instrumentations of the code to observe test results or try to track down intricate timing errors. However, some tools using the above record/replay method is now emerging on the market, for example, Reference 114.

2.9 Summary This chapter has presented the most important issues, methods, and trends in the area of embedded RTSs. A wide range of topics have been covered, from the initial design of embedded RTSs to analysis and testing. Important issues discussed and presented are design tools, OSs, and major underlying mechanisms such



2-31

as architectures, models of interactions, real-time mechanisms, executions strategies, and scheduling. Moreover, communications, analysis, and testing techniques are presented. Over the years, the academics have put an effort in increasing the various techniques used to compose and design complex embedded RTSs. Standards and industry are following a slower pace, while also adopting and developing area-specific techniques. Today, we can see diverse techniques used in different application domains, such as automotive, aero, and trains. In the area of communications, an effort is made in the academic, and also in some parts of industry, toward using Ethernet. This is a step toward a common technique for several application domains. Different real-time demands have led to domain specific OSs, architectures, and models of interactions. As many of these have several commonalities, there is a potential for standardization across several domains. However, as this takes time, we will most certainly stay with application specific techniques for a while, and for specific domains, with extreme demands on safety or low cost, specialized solutions will most likely be used also in the future. Therefore, knowledge of the techniques used in and suitable for the various domains will remain important.

References [1] Tom R. Halfhill. Embedded Markets Breaks New Ground. Microprocessor Report, 17, 2000. [2] H. Kopetz. Introduction. In Real-Time Systems: Introduction and Overview. Part XVIII of Lecture Notes from ESSES 2003 — European Summer School on Embedded Systems. Ylva Boivie, Hans Hansson, and Sang Lyul Min, Eds., Västerås, Sweden, September 2003. [3] IEEE Computer Society. Technical Committee on Real-Time Systems Home Page. http://www.cs. bu.edu/pub/ieee-rts/. [4] Kluwer. Real-Time Systems (Journal). http://www.wkap.nl/kapis/CGI-BIN/WORLD/ journalhome.htm?0922-6443. [5] C. Liu and J. Layland. Scheduling Algorithms for Multiprogramming in a Hard-Real-Time Environment. Journal of the ACM, 20:46–61, 1973. [6] M.H. Klein, T. Ralya, B. Pollak, R. Obenza, and M.G. Harbour. A Practitioners Handbook for Rate-Monotonic Analysis. Kluwer, Dordrecht, 1998. [7] N.C. Audsley, A. Burns, R.I. Davis, K. Tindell, and A.J. Wellings. Fixed Priority Pre-Emptive Scheduling: An Historical Perspective. Real-Time Systems, 8:129–154, 1995. [8] Hermann Kopetz and Günther Bauer. The Time-Triggered Architecture. Proceedings of the IEEE, Special Issue on Modeling and Design of Embedded Software, 91:112–126, 2003. [9] J. Xu and D.L. Parnas. Scheduling Processes with Release Times, Deadlines, Precedence, and Exclusion Relations. IEEE Transactions on Software Engineering, 16:360–369, 1990. [10] Time Triggered Technologies. http://www.tttech.com. [11] H. Kopetz and G. Grünsteidl. TTP — A Protocol for Fault-Tolerant Real-Time Systems. IEEE Computer, 27(1):14–23, 1994. [12] H. Hansson, H. Lawson, and M. Strömberg. BASEMENT a Distributed Real-Time Architecture for Vehicle Applications. Real-Time Systems, 3:223–244, 1996. [13] Arcticus Systems. The Rubus Operating System. http://www.arcticus.se. [14] OMG. CORBA Component Model 3.0, June 2002. http://www.omg.org/technology/documents/ formal/components.htm. [15] Microsoft. Microsoft .COM Technologies. http://www.microsoft.com/com/. [16] Microsoft. .NET Home Page. http://www.microsoft.com/net/. [17] SUN Microsystems. Introducing Java Beans. http://developer.java.sun.com/developer/online/ Training/Beans/ Beans1/index.html. [18] John A. Stankovic. VEST — A Toolset for Constructing and Analyzing Component-Based Embedded Systems. Lecture Notes in Computer Science, 2211:390–402, 2001. [19] Rob van Ommering. The Koala Component Model. In Building Reliable Component-Based Software Systems. Artech House Publishers, July 2002, pp. 223–236.


2-32


[20] P.O. Müller, C.M. Stich, and C. Zeidler. Component-Based Embedded Systems. In Building Reliable Component-Based Software Systems. Artech House Publisher, 2002, pp. 303–323. [21] D.B. Stewart, R.A. Volpe, and P.K. Khosla. Design of Dynamically Reconfigurable Real-Time Software Using Port-Based Objects. IEEE Transactions on Software Engineering, 23(12):759–776, 1997. [22] OMG. Unified Modeling Language (UML), Version 1.5, 2003. http://www.omg.org/technology/ documents/formal/uml.htm. [23] Rational. Rational Rose RealTime. http://www.rational.com/products/rosert. [24] I-Logix. Rhapsody. http://www.ilogix.com/products/rhapsody. [25] TeleLogic. Telelogic tau. http://www.telelogic.com/products/tau. [26] Vector. DaVinci Tool Suite. http://www.vector-informatik.de/. [27] OMG. Unified Modeling Language (UML), Version 2.0 (draft). OMG document ptc/03-09-15, September 2003. [28] ITEA. EAST/EEA Project Site. http://www.east-eea.net. [29] ETAS. http://en.etasgroup.com. [30] Vector. http://www.vector-informatik.com. [31] Siemens. http://www.siemensvdo.com. [32] Comp.realtime FAQ. Available at http://www.faqs.org/faqs/realtime-computing/faq/. [33] Roadmap — Adaptive Real-Time Systems for Quality of Service Management. ARTIST — Project IST-2001-34820, May 2003. http://www.artist-embedded.org/Roadmaps/. [34] G.C. Buttazzo. Hard Real-Time Computing Systems. Kluwer Academic Publishers, Dordrecht, 1997. [35] A. Burns and A. Wellings. Real-Time Systems and Programming Languages, 2nd ed. AddisonWesley, Reading, MA, 1996. [36] The Asterix Real-Time Kernel. http://www.mrtc.mdh.se/projects/asterix/. [37] LiveDevices. Realogy Real-Time Architect, SSX5 Operating System, 1999. http://www.livedevices. com/realtime.shtml. [38] Wind River Systems Inc. VxWorks Programmer’s Guide. http://www.windriver.com/. [39] Lynuxworks. http://www.lynuxworks.com. [40] Enea OSE Systems. Ose. http://www.ose.com. [41] QNX Software Systems. QNX Realtime OS. http://www.qnx.com. [42] List of Real-Time Linux Variants. http://www.realtimelinuxfoundation.org/variants/ variants.html. [43] Express Logic. Threadx. http://www.expresslogic.com. [44] Northern Real-Time Applications. Total Time Predictability. Whitepaper on SSX5, 1998. [45] IEEE. Standard for Information Technology — Standardized Application Environment Profile — POSIX Realtime Application Support (AEP). IEEE Standard P1003.13-1998, 1998. [46] OSEK Group. OSEK/VDX Operating System Specification 2.2.1. http://www.osek-vdx.org/. [47] Airlines Electronic Engineering Committee (AEEC). ARINC 653: Avionics Application Software Standard Interface (Draft 15), June 1996. [48] ISO. Ada95 Reference Manual. ISO/IEC 8652:1995(E), 1995. [49] G. Fohler, T. Lennvall, and R. Dobrin. A Component Based Real-Time Scheduling Architecture. In Architecting Dependable Systems, Vol. LNCS-2677. R. de Lemos, C. Gacek, and A. Romanovsky, Eds., Springer-Verlag, Heidelberg, 2003. [50] J. Mäki-Turja and M. Sjödin. Combining Dynamic and Static Scheduling in Hard Real-Time Systems. Technical report MRTC no. 71, Mälardalen Real-Time Research Centre (MRTC), October 2002. [51] B. Sprunt, L. Sha, and J.P. Lehoczky. Aperiodic Task Scheduling for Hard Real-Time Systems. Real-Time Systems, 1:27–60, 1989.



2-33

[52] M. Spuri and G.C. Buttazzo. Efficient Aperiodic Service under Earliest Deadline Scheduling. In Proceedings of the 15th IEEE Real-Time Systems Symposium (RTSS’94). IEEE Computer Society, San Juan, Puerto Rico, December 1994, pp. 2–11. [53] M. Spuri and G.C. Buttazzo. Scheduling Aperiodic Tasks in Dynamic Priority Systems. Real-Time Systems, 10:179–210, 1996. [54] L. Abeni and G. Buttazzo. Integrating Multimedia Applications in Hard Real-Time Systems. In Proceedings of the 19th IEEE Real-Time Systems Symposium (RTSS’98). IEEE Computer Society, Madrid, Spain, December 1998, pp. 4–13. [55] M. Spuri, G.C. Buttazzo, and F. Sensini. Robust Aperiodic Scheduling under Dynamic Priority Systems. In Proceedings of the 16th IEEE Real-Time Systems Symposium (RTSS’95). IEEE Computer Society, Pisa, Italy, December 1995, pp. 210–219. [56] CAN Specification 2.0, Part-A and Part-B. CAN in Automation (CiA), Am Weichselgarten 26, D-91058 Erlangen, 2002. http://www.can-cia.de. [57] Road Vehicles — Interchange of Digital Information — Controller Area Network (CAN) for High Speed Communications, ISO/DIS 11898, February 1992. [58] K.W. Tindell, A. Burns, and A.J. Wellings. Calculating Controller Area Network (CAN) Message Response Times. Control Engineering Practice, 3:1163–1169, 1995. [59] K. Tindell, H. Hansson, and A. Wellings. Analysing Real-Time Communications: Controller Area Network (CAN). In Proceedings of the 15th IEEE Real-Time Systems Symposium (RTSS). IEEE Computer Society Press, December 1994, pp. 259–263. [60] Road Vehicles — Controller Area Network (CAN) — Part 4: Time Triggered Communication. ISO/CD 11898-4. [61] L. Almeida, J.A. Fonseca, and P. Fonseca. Flexible Time-Triggered Communication on a Controller Area Network. In Proceedings of the Work-In-Progress Session of the 19th IEEE Real-Time Systems Symposium (RTSS’98). IEEE Computer Society, Madrid, Spain, December 1998. [62] L. Almeida, J.A. Fonseca, and P. Fonseca. A Flexible Time-Triggered Communication System Based on the Controller Area Network: Experimental Results. In Proceedings of the IFAC International Conference on Fieldbus Technology (FeT). Springer, 1999, pp. 342–350. [63] TTTech Computertechnik AG. Specification of the TTP/C Protocol v0.5, July 1999. [64] H. Kopetz. The Time-Triggered Model of Computation. In Proceedings of the 19th IEEE RealTime Systems Symposium (RTSS’98). IEEE Computer Society, Madrid, Spain, December 1998, pp. 168–177. [65] LIN. Local Interconnect Network. http://www.lin-subbus.de. [66] R. Belschner, J. Berwanger, C. Ebner, H. Eisele, S. Fluhrer, T. Forest, T. Führer, F. Hartwich, B. Hedenetz, R. Hugel, A. Knapp, J. Krammer, A. Millsap, B. Müller, M. Peller, and A. Schedl. FlexRay — Requirements Specification, April 2002. http://www.flexray-group.com. [67] ARINC/RTCA-SC-182/EUROCAE-WG-48. Minimal Operational Performance Standard for Avionics Computer Resources, 1999. [68] PROFIBUS. PROFIBUS International. http://www.profibus.com. [69] H. Kirrmann and P.A. Zuber. The IEC/IEEE Train Communication Network. IEEE Micro, 21:81–92, 2001. [70] WorldFIP. WorldFIP Fieldbus. http://www.worldfip.org. [71] H. Kopetz, A. Damm, C. Koza, and M. Mullozzani. Distributed Fault Tolerant Real-Time Systems: The MARS Approach. IEEE Micro, 9(1):25–40, 1989. [72] C. Venkatramani and T. Chiueh. Supporting Real-Time Traffic on Ethernet. In Proceedings of the 15th IEEE Real-Time Systems Symposium (RTSS’94). IEEE Computer Society, San Juan, Puerto Rico, December 1994, pp. 282–286. [73] D.W. Pritty, J.R. Malone, S.K. Banerjee, and N.L. Lawrie. A Real-Time Upgrade for Ethernet Based Factory Networking. In Proceedings of the IECON’95. IEEE Industrial Electronics Society, 1995, pp. 1631–1637.


2-34


[74] N. Malcolm and W. Zhao. The Timed Token Protocol for Real-Time Communication. IEEE Computer, 27:35–41, 1994. [75] K.K. Ramakrishnan and H. Yang. The Ethernet Capture Effect: Analysis and Solution. In Proceedings of the 19th IEEE Local Computer Networks Conference (LCNC’94), October 1994, pp. 228–240. [76] M. Molle. A New Binary Logarithmic Arbitration Method for Ethernet. Technical report, TR CSRI-298, CRI, University of Toronto, Canada, 1994. [77] G. Lann and N. Riviere. Real-Time Communications over Broadcast Networks: The CSMA/DCR and the DOD-CSMA/CD Protocols. Technical report, TR 1863, INRIA, 1993. [78] M. Molle and L. Kleinrock. Virtual Time CSMA: Why Two Clocks are Better than One. IEEE Transactions on Communications, 33:919–933, 1985. [79] W. Zhao and K. Ramamritham. A Virtual Time CSMA/CD Protocol for Hard Real-Time Communication. In Proceedings of the 7th IEEE Real-Time Systems Symposium (RTSS’86). IEEE Computer Society, New Orleans, LA, December 1986, pp. 120–127. [80] M. El-Derini and M. El-Sakka. A Novel Protocol Under a Priority Time Constraint for Real-Time Communication Systems. In Proceedings of the 2nd IEEE Workshop on Future Trends of Distributed Computing Systems (FTDCS’90). IEEE Computer Society, Cairo, Egypt, September 1990, pp. 128–134. [81] W. Zhao, J.A. Stankovic, and K. Ramamritham. A Window Protocol for Transmission of TimeConstrained Messages. IEEE Transactions on Computers, 39:1186–1203, 1990. [82] L. Almeida, P. Pedreiras, and J.A. Fonseca. The FTT-CAN Protocol: Why and How? IEEE Transaction on Industrial Electronics, 49(6):1189–1201, 2002. [83] P. Pedreiras, L. Almeida, and P. Gai. The FTT-Ethernet Protocol: Merging Flexibility, Timeliness and Efficiency. In Proceedings of the 14th Euromicro Conference on Real-Time Systems (ECRTS’02). IEEE Computer Society, Vienna, Austria, June 2002, pp. 152–160. [84] S.K. Kweon, K.G. Shin, and G. Workman. Achieving Real-Time Communication over Ethernet with Adaptive Traffic Smoothing. In Proceedings of the Sixth IEEE Real-Time Technology and Applications Symposium (RTAS’00). IEEE Computer Society, Washington DC, USA, June 2000, pp. 90–100. [85] A. Carpenzano, R. Caponetto, L. LoBello, and O. Mirabella. Fuzzy Traffic Smoothing: An Approach for Real-Time Communication over Ethernet Networks. In Proceedings of the Fourth IEEE International Workshop on Factory Communication Systems (WFCS’02). IEEE Industrial Electronics Society, Västerås, Sweden, August 2002, pp. 241–248. [86] J.L. Sobrinho and A.S. Krishnakumar. EQuB-Ethernet Quality of Service Using Black Bursts. In Proceedings of the 23rd IEEE Annual Conference on Local Computer Networks (LCN’98). IEEE Computer Society, Lowell, MA, October 1998, pp. 286–296. [87] H. Thane and H. Hansson. Towards Systematic Testing of Distributed Real-Time Systems. In Proceedings of the 20th IEEE Real-Time Systems Symposium (RTSS). December 1999, pp. 360–369. [88] M. Joseph and P. Pandya. Finding Response Times in a Real-Time System. Computer Journal, 29:390–395, 1986. [89] The Times Tool. http://www.docs.uu.se/docs/rtmv/times. [90] AbsInt. http://www.absint.com. [91] Bound-T Execution Time Analyzer. http://www.bound-t.com. [92] L. Casparsson, A. Rajnak, K. Tindell, and P. Malmberg. Volcano — A Revolution in On-Board Communications. Volvo Technology Report, 1:9–19, 1998. [93] Volcano automotive group. http://www.volcanoautomotive.com. [94] TimeSys. Timewiz — A Modeling and Simulation Tool. http://www.timesys.com/. [95] OMG. UML Profile for Schedulability, Performance, and Time Specification. OMG document formal/2003-09-01, September 2003.



2-35

[96] J.L. Medina, M. González Harbour, and J.M. Drake. MAST Real-Time View: A Graphic UML Tool for Modeling Object-Oriented Real-Time Systems. In Proceedings of the 22nd IEEE Real-Time Systems Symposium (RTSS). IEEE Computer Society, December 2001, pp. 245–256. [97] MAST home-page. http://mast.unican.es/. [98] A. Massa. Embedded Software Development with eCos. Prentice Hall, New York, November 2002, ISBN: 0130354732. [99] eCos Home Page. http://sources.redhat.com/ecos. [100] OMG. CORBA Home Page. http://www.omg.org/corba/. [101] PECOS Project Web Site. http://www.pecos-project.org. [102] U.S. Department of Commerce. The Economic Impacts of Inadequate Infrastructure for Software Testing. NIST report, May 2002. [103] M. Ronsse, K. De Bosschere, M. Christiaens, J. Chassin de Kergommeaux, and D. Kranzlmüller. Record/Replay for Nondeterministic Program Executions. Communications of the ACM, 46:62–67, 2003. [104] J.J.P. Tsai, K.Y. Fang, and Y.D. Bi. On Realtime Software Testing and Debugging. In Proceedings of the 14th Annual International Computer Software and Application Conference. IEEE Computer Society, November 1990, pp. 512–518. [105] W. Schütz. Fundamental Issues in Testing Distributed Real-Time Systems. Real-Time Systems, 7:129–157, 1994. [106] H. Zhu, P. Hall, and J. May. Software Unit Test Coverage and Adequacy. ACM Computing Surveys, 29(4):366–427, 1997. [107] K. Onoma, W.-T. Tsai, M. Poonawala, and H. Suganuma. Regression Testing in an Industrial Environment. Communications of the ACM, 41:81–86, 1998. [108] J.D. Choi, B. Alpern, T. Ngo, M. Sridharan, and J. Vlissides. A Pertrubation-Free Replay Platform for Cross-Optimized Multithreaded Applications. In Proceedings of the 15th International Parallel and Distributed Processing Symposium. IEEE Computer Society Press, Washington, April 2001. [109] J. Mellor-Crummey and T. LeBlanc. A Software Instruction Counter. In Proceedings of the Third International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, April 1989, pp. 78–86. [110] K.C. Tai, R. Carver, and E. Obaid. Debugging Concurrent ADA Programs by Deterministic Execution. IEEE Transactions on Software Engineering, 17:280–287, 1991. [111] H. Thane and H. Hansson. Using Deterministic Replay for Debugging of Distributed Real-Time Systems. In Proceedings of the 12th Euromicro Conference on Real-Time Systems. IEEE Computer Society Press, Washington, June 2000, pp. 265–272. [112] F. Zambonelli and R. Netzer. An Efficient Logging Algorithm for Incremental Replay of MessagePassing Applications. In Proceedings of the 13th International and 10th Symposium on Parallel and Distributed Processing. IEEE, April 1999, pp. 392–398. [113] Lauterbach. Lauterbach. http://www.laterbach.com. [114] ZealCore. ZealCore Embedded Solutions AB. http://www.zealcore.com.


Design and Validation of Embedded Systems 3

Design of Embedded Systems Luciano Lavagno and Claudio Passerone

4

Models of Embedded Computation Axel Jantsch

5

Modeling Formalisms for Embedded System Design Luís Gomes, João Paulo Barros, and Anikó Costa

6

System Validation J.V. Kapitonova, A.A. Letichevsky, V.A. Volkov, and Thomas Weigert


ZURA: “2824_C003” — 2005/6/21 — 20:01 — page 1 — #1

3 Design of Embedded Systems 3.1 3.2 3.3 3.4 3.5 3.6

The Embedded System Revolution . . . . . . . . . . . . . . . . . . . . . 3-1 Design of Embedded Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-2 Functional Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-6 Function–Architecture and Hardware–Software Codesign. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-7 Hardware–Software Coverification and Hardware Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-11 Software Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-12 Compilation, Debugging, and Memory Model • Real-Time Scheduling

Luciano Lavagno Cadence Berkeley Laboratories and Politecnico di Torino

Claudio Passerone Politecnico di Torino

3.7

Hardware Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-16 Logic Synthesis and Equivalence Checking • Placement, Routing, and Extraction • Simulation, Formal Verification, and Test Pattern Generation

3.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-22 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-22

3.1 The Embedded System Revolution The world of electronics has witnessed a dramatic growth of its applications in the last few decades. From telecommunications to entertainment, from automotives to banking, almost any aspect of our everyday life employs some kind of electronic components. In most cases, these components are computer-based systems, which are not, however, used or perceived as a computers. For instance, they often do not have a keyboard or a display to interact with the user, and they do not run standard operating systems and applications. Sometimes, these systems constitute a self-contained product themselves (e.g., a mobile phone), but they are frequently embedded inside another system, for which they provide better functionalities and performance (e.g., the engine control unit of a motor vehicle). We call these computer-based systems embedded systems. The huge success of embedded electronics has several causes. The main one in our opinion is that embedded systems bring the advantages of Moore’s Law into everyday life, that is, an exponential increase in performance and functionality at an ever decreasing cost. This is possible because of the capabilities of integrated circuit technology and manufacturing, which allows one to build more and more complex devices, and because of the development of new design methodologies, which allows one to efficiently and cleverly use those devices. Traditional steel-based mechanical development, on the other hand, has reached

3-1


ZURA: “2824_C003” — 2005/6/21 — 20:01 — page 1 — #3

3-2


a plateau near the middle of the twentieth century, and thus it is not a significant source of innovation any longer, unless coupled to electronic manufacturing technologies (microelectromechanical systems — MEMS) or embedded systems, as argued above. There are many examples of embedded systems in the real world. For instance, a modern car contains tens of electronic components (control units, sensors, and actuators) that perform very different tasks. The first embedded systems that appeared in a car were related to the control of mechanical aspects, such as the control of the engine, the antilock brake system, and the control of suspension and transmission. However, nowadays cars also have a number of components that are not directly related to mechanical aspects, but are mostly related to the use of the car as a vehicle for moving around, or the communication needs of the passengers: navigation systems, digital audio and video players, and phones are just a few examples. Moreover, many of these embedded systems are connected together using a network, because they need to share information regarding the state of the car. Other examples come from the communication industry: a cellular phone is an embedded system whose environment is the mobile network. These are very sophisticated computers whose main task is to send and receive voice, but are also currently used as personal digital assistants, for games, to send and receive images and multimedia messages, and to wirelessly browse the Internet. They have been so successful and pervasive that in just a decade they became essential in our life. Other kinds of embedded systems significantly changed our life as well: for instance, ATM and Point-of-Sale (POS) machines modified the way we do payments, and multimedia digital players changed how we listen to music and watch videos. We are just at the beginning of a revolution that will have an impact on every other industrial sector. Special purpose embedded systems will proliferate and will be found in almost any object that we use. They will be optimized for the application and show a natural user interface. They will be flexible, in order to adapt to a changing environment. Most of them will also be wireless, in order to follow us wherever we go and keep us constantly connected with the information we need and the people we care. Even the role of computers will have to be reconsidered, as many of the applications for which they are used today will be performed by specially designed embedded systems. What are the consequences of this revolution in the industry? Modern car manufacturers today need to acquire a significant amount of skills in hardware and software design, in addition to the mechanical skills that they already had in-house, or they should outsource the requirements they have to an external supplier. In either case, a broad variety of skills needs to be mastered, from the design of software architectures for implementing the functionality, to being able to model the performance, because realtime aspects are extremely important in embedded systems, especially those related to safety critical applications. Embedded system designers must also be able to architect and analyze the performance of networks, as well as validate the functionality that has been implemented over a particular architecture and the communication protocols that are used. A similar revolution has happened or is about to happen to other industrial and socioeconomical areas as well, such as entertainment, tourism, education, agriculture, government, and so on. It is therefore clear that new, more efficient and easy to use embedded electronics design methodologies need to be developed, in order to enable the industry to make use of the available technology.

3.2 Design of Embedded Systems Embedded system are informally defined as a collection of programmable parts surrounded by Application Specific Integrated Circuits (ASICs) and other standard components (Application Specific Standard Parts, ASSPs) that interact continuously with an environment through sensors and actuators. The collection can be physically a set of chips on a board, or a set of modules on an integrated circuit. Software is used for features and flexibility, while dedicated hardware is used for increased performance and reduced power consumption. An example of an architecture of an embedded system is shown in Figure 3.1. The main programmable components are microprocessors and Digital Signal Processors (DSPs), that implement the software partition of the system. One can view reconfigurable components, especially


ZURA: “2824_C003” — 2005/6/21 — 20:01 — page 2 — #4

Design of Embedded Systems

mP/mC

Memory

3-3

CoProc

Bridge

Peripheral

IP Block

Memory

DSP

Dual port memory

FPGA

FIGURE 3.1 A reactive real-time embedded system architecture.

if they can be reconfigured at runtime, as programmable components in this respect. They exhibit area, cost, performance, and power characteristics that are intermediate between dedicated hardware and processors. Custom and programmable hardware components, on the other hand, implement applicationspecific blocks and peripherals. All components are connected through standard and dedicated buses and networks, and data is stored on a set of memories. Often several smaller subsystems are networked together to control, for example, an entire car, or to constitute a cellular or wireless network. We can identify a set of typical characteristics that are commonly found in embedded systems. For instance, they are usually not very flexible and are designed to perform always the same task: if you buy an engine control embedded system, you cannot use it to control the brakes of your car, or to play games. A PC, on the other hand, is much more flexible because it can perform several very different tasks. An embedded system is often part of a larger controlled system. Moreover, cost, reliability, and safety are often more important criteria than performance, because the customer may not even be aware of the presence of the embedded system, and so he looks at other characteristics, such as the cost, the ease of use, or the lifetime of a product. Another common characteristic of many embedded systems is that they need to be designed in an extremely short time to meet their time to market. Only a few months should elapse from conception of a consumer product to the first working prototypes. If these deadlines are not met, the result is a concurrent increase in design costs and decrease of the profits, because fewer items will be sold. So delays in the design cycle may make a huge difference between a successful product and an unsuccessful one. In the current state of the art, embedded systems are designed with an ad hoc approach that is heavily based on earlier experience with similar products and on manual design. Often the design process requires several iterations to obtain convergence, because the system is not specified in a rigorous and unambiguous fashion, and the level of abstraction, details, and design style in various parts are likely to be different. But as the complexity of embedded systems scales up, this approach is showing its limits, especially regarding design and verification time. New methodologies are being developed to cope with the increased complexity and enhance designers’ productivity. In the past, a sequence of two steps has always been used to reach this goal: abstraction and clustering . Abstraction means describing an object (i.e., a logic gate made of metal oxide semiconductor [MOS] transistors) using a model where some of the low-level details are ignored (i.e., the Boolean expression representing that logic gate). Clustering means connecting a set of models at the same level of abstraction, to get a new object, which usually shows new properties that are not part of the isolated models that constitute it. By successively applying these two steps, digital electronic design went from drawing layouts, to transistor schematics, to logic gate netlists, to register transfer level (RTL) descriptions, as shown in Figure 3.2.


ZURA: “2824_C003” — 2005/6/21 — 20:01 — page 3 — #5

3-4


Abstract

System level

Register transfer level

Cluster

RTL

SW

Abstract

RTL

Gate-level model

Abstract

Cluster

Transistor model

Abstract

Cluster

1970s

1980s

1990s

2000+

FIGURE 3.2 Abstraction and clustering levels in hardware design.

The notion of platform is key to the efficient use of abstraction and clustering. A platform is a single abstract model that hides the details of a set of different possible implementations as clusters of lowerlevel components. The platform, for example, a family of microprocessors, peripherals, and bus protocol, allows developers of designs at the higher level (generically called “applications” in the following) to operate without detailed knowledge of the implementation (e.g., the pipelining of the processor or the internal implementation of the Universal Asychronous Receiver/Transmitter [UART]). At the same time, it allows platform implementors to share design and fabrication costs among a broad range of potential users, broader than if each design was a one-of-a-kind type. Today we are witnessing the appearance of a new higher level of abstraction, as a response to the growing complexity of integrated circuits. Objects can be functional descriptions of complex behaviors, or architectural specifications of complete hardware platforms. They make use of formal high-level models that can be used to perform an early and fast validation of the final system implementation, although with reduced details with respect to a lower-level description. The relationship between an “application” and elements of a platform is called a mapping . This exists, for example, between logic gates and geometric patterns of a layout, as well as between RTL statements and gates. At the system level, the mapping is between functional objects with their communication links, and platform elements with their communication paths. Mapping at the system level means associating a functional behavior (e.g., an FFT [fast Fourier transform] or a filter) to an architectural element that can implement that behavior (e.g., a CPU or DSP or piece of dedicated hardware). It can also associate a communication link (e.g., an abstract FIFO [first in first out]) to some communication services available in the architecture (e.g., a driver, a bus, and some interfaces). The mapping step may also need to specify parameters for these associations (e.g., the priority of a software task or the size of a FIFO), in order to completely describe it. The object that we obtain after mapping shows properties that were not directly exposed in the separate descriptions, such as the performance of the selected system implementation.


ZURA: “2824_C003” — 2005/6/21 — 20:01 — page 4 — #6


3-5

Performance is not just timing, but any other quantity that can be defined to characterize an embedded system, either physical (area, power consumption, …) or logical (quality of service [QOS], fault tolerance, …). Since the system-level mapping operates on heterogeneous objects, it also allows one to nicely separate different and orthogonal aspects such as: 1. Computation and communication. This separation is important because refinement of computation is generally done by hand, or by compilation and scheduling, while communication makes use of patterns. 2. Application and platform implementation (also called functionality and architecture, e.g., in Reference 1), because they are often defined and designed independently by different groups or companies. 3. Behavior and performance, which should be kept separate because performance information can either represent nonfunctional requirements (e.g., maximum response time of an embedded controller), or the result of an implementation choice (e.g., the worst-case execution time [WCET] of a task). Nonfunctional constraint verification can be performed traditionally, by simulation and prototyping, or with static formal checks, such as schedulability analysis. All these separations result in better reuse, because they decouple independent aspects, that would otherwise tie, for example, a given functional specification to low-level implementation details, by modeling it as assembler or Verilog code. This in turn allows one to reduce design time, by increasing the productivity and decreasing the time needed to verify the system. A schematic representation of a methodology that can be derived from these abstraction and clustering steps is shown in Figure 3.3. At the functional level, a behavior for the system to be implemented is specified, designed, and analyzed, either through simulation or by proving that certain properties are satisfied (the algorithm always terminates, the computation performed satisfies a set of specifications, the complexity of the algorithm is polynomial, etc.). In parallel, a set of architectures are composed from a clustering of platform elements, and selected as candidates for the implementation of the behavior. These components may come from an existing library or may be specifications of components that will be designed later. Now functional operations are assigned to the various architecture components, and patterns provided by the architecture are selected for the defined communications. At this level we are now able to verify the performance of the selected implementation, with much richer details than at the pure functional

Verify architecture

Verify function Behavioral libraries

Function

Architecture

Verify performance

Mapping

Refinement

Architecture libraries

Functional level

Mapping level

Verify refinements Implementation level

Implementation

FIGURE 3.3 Design methodology for embedded system.


ZURA: “2824_C003” — 2005/6/21 — 20:01 — page 5 — #7

3-6


level. Different mappings to the same architecture, or mapping to different architectures, allow one to explore the design space to find the best solutions to important design challenges. These kinds of analysis let the designer identify and correct possible problems early in the design cycle, thus reducing drastically the time to explore the design space and weed out potentially catastrophic mistakes and bugs. At this stage it is also very important to define the organization of the data storage units for the system. Various kind of memories (e.g., ROM, SRAM, DRAM, Flash, …) have different performance and data persistency characteristics, and must be used judiciously to balance cost and performance. Mapping data structures to different memories, and even changing the organization and layout of arrays can have a dramatic impact on the satisfaction of a given latency in the execution of an algorithm, for example. In particular, a SystemOn-Chip designer can afford to do a very fine tuning of the number and sizes of embedded memories (especially SRAM, but now also Flash) to be connected to processors and dedicated hardware [2]. Finally, at the implementation level, the reverse transformation of abstraction and clustering occurs, that is, a lower-level specification of the embedded system is generated. This is obtained through a series of manual or automatic refinements and modifications that successively add more details, while checking their compliance with the higher-level requirements. This step does not need to generate directly a manufacturable final implementation, but rather produces a new description that in turn constitutes the input for another (recursive) application of the same overall methodology at a lower level of abstraction (e.g., synthesis, placement and routing for hardware, and compilation and linking for software). Moreover, the results obtained by these refinements can be back-annotated to the higher level, to perform a better and more accurate verification.

3.3 Functional Design As discussed in the previous section, system-level design of embedded electronics requires two distinct phases. In a first phase, functional and nonfunctional constraints are the key aspects. In the second phase, the available architectural platforms are taken into account, and detailed implementation can proceed after a mapping phase that defines the architectural component on which every functional model is implemented. This second phase requires a careful analysis of the trade-offs between algorithmic complexity, functional flexibility, and implementation costs. In this section we describe some of the tools that are used for requirements capture, focusing especially on those that permit executable specification. Such tools generally belong to two broad classes. The first class is represented, for example, by Simulink [3], MATRIXx [4], Ascet-SD [5], SPW [6], SCADE [7], and SystemStudio [8]. It includes block-level editors and libraries using which the designer composes data-dominated digital signal processing and embedded control systems. The libraries include simple blocks, such as multiplication, addition, and multiplexing, as well as more complex ones, such as FIR filters, FFTs, and so on. The second class is represented by tools, such as Tau [9], StateMate [10], Esterel Studio [7], StateFlow [3]. It is oriented to control-dominated embedded systems. In this case, the emphasis is placed on the decisions that must be taken by the embedded system in response to environment and user inputs, rather than on numerical computations. The notation is generally some form of Har’el’s Statecharts [11]. The Unified Modeling Language (UML), as standardized by the Object Management Group [12], is in a class by itself, since first of all it focused historically more on general-purpose software (e.g., enterprise and commercial software), rather than on embedded real-time software. Only recently have some embedded aspects, such as performance and time, been incorporated in UML 2.0 [12,13], and emphasis has been placed on model-based software generation. However, tool support for UML 2.0 is still limited (Tau [9], Real Time Studio [14], and Rose RealTime [15] provide some), and UML-based hardware design is still in its infancy. Furthermore, the UML is a collection of notations, some of which (especially Statecharts) are supported by several of the tools listed above in the control-dominated class. Simulink and its related tools and toolboxes, both from Mathworks and from third parties, such as dSPACE [16], is the workhorse of modern model-based embedded system design. In model-based design,


ZURA: “2824_C003” — 2005/6/21 — 20:01 — page 6 — #8


3-7

a functional executable model is used for algorithm development. This is made easier in the case of Simulink by its tight integration with Matlab, the standard tool in DSP algorithm development. The same functional model, with added annotations such as bitwidths and execution priorities, is then used for algorithmic refinements, such as floating-point to fixed-point conversion and real-time task generation. Then automated software generators, such as Real-Time Workshop, Embedded Coder [3], and TargetLink [16], are used to generate task code and sometimes to customize a real-time operating system (RTOS) on which the tasks will run. Ascet-SD, for example, automatically generates a customization of the OSEK automotive RTOS [17] for the tasks that are generated from a functional model. In all these cases, a task is typically generated from a set of blocks that are executed at the same rate or triggered by the same event in the functional model. Task formation algorithms can use either direct user input (e.g., the execution rate of each block in discrete time portions of a Simulink or Ascet-SD design), or static scheduling algorithms for dataflow models (e.g., based on relative block-to-block rate specifications in SPW or SystemStudio [18,19]). Simulink is also tightly integrated with StateFlow, a design tool for control-dominated applications, in order to ease the integration of decisionmaking and computation code. It also allows one to smoothly generate both hardware and software from the very same specification. This capability, as well as the integration with some sort of Statechart-based finite state machine (FSM) editor, is available from most tools in the first class above. The difference in market share can be attributed to the availability of Simulink “toolboxes” for numerous embedded system design tasks (from fixed-point optimization to FPGA [Field Programmable Gate Array]-based implementation) and its widespread adoption in undergraduate university courses, which makes it well known to most of today’s engineers. The second class of tools either plays an ancillary role in the design of embedded control systems (e.g., as StateFlow and EsterelStudio), or is devoted to inherently control-dominated application areas, such as telecommunication protocols. In the latter market the clear dominator today is Tau. The underlying languages, such as the Specification and Description Language (SDL) and Message Sequence Charts, are standardized by the International Telecommunication Union (ITU). They are commonly used to describe in a tool-independent way protocol standards, thus modeling in SDL is quite natural in this application domain, since validation and refinement can proceed formally within a unified environment. Tau also has code generation capabilities for both application code and customization of real-time kernels on which the FSM-generated code will run. The use of Tau for embedded code generation (model-based design) significantly predates that of Simulink-based code generators, mostly due to the highly complex nature of telecom protocols and the less demanding memory and computing power constraints that switches and other networking equipment have. Tau has links to the requirements capture tool Doors [9], also from Telelogic, which allows one to trace dependencies between multiple requirements written in English, and connect them to aspects of the embedded system design files that implement these requirements. The state of the art of such requirement tracing, however, is far from satisfactory, since there is no formal means in Doors to automatically check for violations. Similar capabilities are provided by Reqtify [20]. Techniques for automated functional constraint validation, starting from formal languages, are described in several books, for example, References 21 and 22. Deadline, latency, and throughput constraints are special kinds of nonfunctional requirements that have received extensive treatment in the real-time scheduling community. They are also covered in several books, for example, References 23–25. While model-based functional verification is quite attractive, due to its high abstraction level, it ignores cost and performance implications of algorithmic decisions. These are taken into account by the tools described in the next section.

3.4 Function–Architecture and Hardware–Software Codesign In this section, we describe some of the tools that are available to help embedded system designers to optimally architect the implementation of the system, and choose the best solution for each functional


ZURA: “2824_C003” — 2005/6/21 — 20:01 — page 7 — #9

3-8


component. After these decisions have been made, detailed design can proceed using the languages, tools, and methods described in the following chapters in this book. This step of the design process, whose general structure has been outlined in Section 3.2 by using the platform-based design paradigm, has received various names in the past. Early work [26,27] called it hardware–software codesign (or cosynthesis), because one of the key decisions at this level is what functionality has to be implemented in software and in dedicated hardware, and how the two partitions of the design interact together with minimum cost and maximum performance. Later on, people came to realize that hardware–software was too coarse a granularity, and that more implementation choices had to be taken into account. For example, one could trade-off single versus multiple processors, general-purpose CPUs versus specialized DSPs and Application-Specific Instruction-set Processors (ASIPs), dedicated ASIC versus ASSP (e.g., an MPEG coprocessor or an Ethernet Medium Access Controller), standard cells versus FPGA. Thus the term function–architecture codesign was coined [1], to refer to the more complex partitioning problem of a given functionality onto a heterogeneous architecture such as the one in Figure 3.1. The term system-level design also had some popularity in the industry [6,28], to indicate “the level of design above Register Transfer, at which software and hardware interact.” Other terms, such as timed functional model have also been used [29]. The key problems that are tackled by tools acting as a bridge between the system-level application and the architectural platform are: 1. How to model the performance impact of making mapping decisions from a virtually “implementation-independent” functional specification to an architectural model. 2. How to efficiently drive downstream code generation, synthesis, and validation tools to avoid redoing the modeling effort from scratch at the RTL, C, or assembly code levels respectively. The notion of automated implementation generation from a high-level functional model is called “model-based design” in the software world. In both cases, the notion of what is an “implementation-independent” functional specification, which can be retargeted indifferently to hardware and software implementations, must be carefully evaluated and considered. Taken in its most literal terms, this idea has often been taunted as a myth. However, current practice shows that it is already a reality, at least for some application domains (automotive electronics and telecommunication protocols). It is intuitively very appealing, since it can be considered as a highlevel application of the platform-based design principle, by using a formal system-level platform. Such a platform, embodied in one of the several models of computation that are used in embedded system design, is a perfect candidate to maximize design reuse, and to optimally exploit different implementation options. In particular, several of the tools that have been mentioned in the previous section (e.g., Simulink, TargetLink, StateFlow, SPW, System Studio, Tau, Ascet-SD, StateMate, Esterel Studio) have code generation capabilities that are considered good enough for implementation and not just for rapid prototyping and simulation acceleration. Moreover, several of them (e.g., Simulink, StateFlow, SPW, System Studio, StateMate, Esterel Studio) can generate indifferently C for software implementation, and synthesizable VHDL or Verilog for hardware implementation. Unfortunately, these code generation capabilities often require the laborious creation of implementation models for each target platform (e.g., software in C or assembler for a given DSP, synthesizable VHDL or macroblock netlist for ASIC or FPGA, etc.). However, since these careful implementations are instances of the system-level platform mentioned above, their development cost can be shared among a multitude of designs performed using the tool. Most block diagram or Statechart-based code generators work in a syntax-directed fashion. A piece of C or synthesizable VHDL code is generated for each block and connection, or for each hierarchical state and transition. Thus the designer has tight control over the complexity of the generated software or hardware. While this is a convenient means to bring manual optimization capabilities within the modelbased design flow, it has a potentially significant disadvantage in terms of cost and performance (such as disabling optimizations in the case of a C compiler). On the other hand, more recent tools, such as


ZURA: “2824_C003” — 2005/6/21 — 20:01 — page 8 — #10


3-9

Esterel Studio and System Studio, take a more radical approach to code generation, based on aggressive optimizations [30]. These optimizations, based on logic synthesis techniques also in the case of software implementation, destroy the original model structure, and thus make debugging and maintenance much harder. However, they can result in an order of magnitude improvement in terms of cost (memory size) and performance (execution speed) with respect to their syntax-directed counterparts [31]. Assuming that good automated code generation, or manual design, is available for each block in the functional model of the application, we are now faced with the function–architecture codesign problem. This essentially means tuning the functional decomposition, as well as the algorithms employed by the overall functional model and each block within it, to the available architecture, and vice versa. Several design environments, for example: • POLIS [1], COSYMA [26], Vulcan [27], COSMOS [32], and Roses [33] in the academic world • Real Time Studio [14], Foresight [34], and CARDtools [35] in the commercial world help the designer in this task by using somehow the notion of independence between functional specification on one side, and hardware–software partitioning or architecture mapping choices on the other. The step of performance evaluation is performed in an abstract, approximate manner by the tools listed above. Some of them use estimators to evaluate the cost and performance of mapping a functional block to an architectural block. Others (e.g., POLIS) rely on cycle-approximate simulation to perform the same task in a manner which better reflects real-life effects, such as burstiness of resource occupation and so on. Techniques for deriving both abstract static performance models (e.g., the WCET of a software task) and performance simulation models are discussed below. In all cases, both the cost of computation and that of communication must be taken into account. This is because the best implementation, especially in the case of multimedia systems that manipulate large amounts of image and sound data, is often one that reduces the amount of transferred data between multiple memory locations, rather than one that finds the absolute best trade-off between software flexibility and hardware efficiency. In this area, the Atomium project at IMEC [2,36] has focused on finding the best memory architecture and schedule of memory transfers for data-dominated applications on mixed hardware–software platforms. By exploiting array access models based on polyhedra, they identify the best reorganization of inner loops of DSP kernels and the best embedded memory architecture. The goal is to reduce memory traffic due to register spills, and maximize the overall performance by accessing several memories in parallel (many DSPs offer this opportunity even in the embedded software domain). A very interesting aspect of Atomium, which distinguishes it from most other optimization tools for embedded systems, is the ability to return a set of Pareto-optimal solutions (i.e., solutions which are not strictly better than one another in at least one aspect of the cost function), rather than a single solution. This allows the designer to pick the best point based on the various aspects of cost and performance (e.g., silicon area versus power and performance), rather than forcing him to “abstract” optimality into a single number. Performance analysis can be based on simulation, as mentioned above, or rely on automatically constructed models that reflect the WCET of pieces of software (e.g., RTOS tasks) running on an embedded processor. Such models, which must be both provably conservative and reasonably accurate, can be constructed by using an execution model called abstract interpretation [37]. This technique traverses the software code, while building a symbolic model, often in the form of linear inequalities [38,39], which represents the requests that the software makes to the underlying hardware (e.g., code fetches, data loads and stores, code execution). A solution to those inequalities then represents the total “cost” of one execution of the given task. It can be combined then with processor, bus, cache, and main memory models that in turn compute the cost of each of these requests in terms of time (clock cycles) or energy. This finally results in a complete model for the cost of mapping that task to those architectural resources. Another technique for software performance analysis, which does not require detailed models of the hardware, uses an approximate compilation step from the functional model to an executable model


ZURA: “2824_C003” — 2005/6/21 — 20:01 — page 9 — #11

3-10


(rather than a set of inequalities as above) annotated with the same set of fetch, load, store, and execute requests. Then simulation is used, in a more traditional setting, to analyze the cost of implementing that functionality on a given processor, bus, cache, and memory configuration. Simulation is more effective than WCET analysis in handling multiprocessor implementations, in which bus conflicts and cache pollution can be difficult, if not utterly impossible, to predict statically in a manner that is not too conservative. However, its success in identifying the true worst-case depends on the designer ability to provide the appropriate simulation scenarios. Coverage enhancement techniques from the hardware verification world [40,41] can be extended to help also in this case. Similar abstract models can be constructed in the case of implementation as dedicated hardware, by using high-level synthesis techniques. Such techniques are not yet good enough to generate productionquality RTL code, but can be considered as a reasonable estimator of area, timing, and energy costs for both ASIC and FPGA implementations [42–44]. SystemC [29] and SpecC [45,46], on the other hand, are more traditional modeling and simulation languages, for which the design flow is based on successive refinement rather than codesign or mapping. Finally, OPNET [47] and NS [48] are simulators with a rich modeling library specialized for wireline and wireless networking applications. They help the designer in the more abstract task of generic performance analysis, without the notion of function–architecture separation and codesign. Communication performance analysis, on the other hand, is generally not done using approximate compilation or WCET analysis techniques like those outlined above. Communication is generally implemented not by synthesis but by refinement using patterns and “recipes,” such as interrupt-based, DMA-based, and so on. Thus several design environments and languages at the function–architecture level, such as POLIS, COSMOS, Roses, SystemC, and SpecC, as well as N2C [6], provide mechanisms to replace abstract communication, for example, FIFO-based or discrete-event-based, with detailed protocol stacks using buses, interrupt controllers, memories, drivers, and so on. These refinements can then be estimated either using a library-based approach (they are generally part of a library of implementation choices anyway), or sometimes using the approaches described above for computation. Their cost and performance can thus be combined in an overall system-level performance analysis. However, approximate performance analysis is often not good enough, and a more detailed simulation step is required. This can be achieved by using tools, such as Seamless [49], CoMET [50], MaxSim [51], and N2C [6]. They work at a lower abstraction level, by cosimulating software running on Instruction Set Simulators (ISSs) and hardware running in a Verilog or VHDL simulator. While the simulation is often slower than with more abstract models, and dramatically slower than with static estimators, the precision can now be at the cycle level. Thus it permits close investigation of detailed communication aspects, such as interrupt handling and cache behavior. These approaches are further discussed in the next section. The key advantage of using the mapping-based approach over the traditional design–evaluate–redesign one is the speed with which design space exploration can be performed. This is done by setting up experiments that change either mapping choices or parameters of the architecture (e.g., cache size, processor speed, or bus bandwidth). Key decisions, such as the number of processors and the organization of the bus hierarchy, can thus be based on quantitative application-dependent data, rather than on past experience. If mapping can then be used to drive synthesis, in addition to simulation and formal verification, advantages in terms of time-to-market and reduction of design effort are even more significant. Model-based code generation, as we mentioned in the previous section, is reasonably mature, especially for embedded software in application areas, such as avionics, automotive electronics, and telecommunications. In these areas, considerations other than absolute minimum memory footprint and execution time, for example, safety, sheer complexity, and time-to-market, dominate the design criteria. At the very least, if some form of automated model-based synthesis is available, it can be used to rapidly generate FPGA- and processor-based prototypes of the embedded system. This significantly speeds up verification, with respect to workstation-based simulation. It permits even some hardware-in-the-loop validation for cases (e.g., the notion of “driveability” of a car) in which no formalization or simulation is possible, but a real physical experiment is required.


ZURA: “2824_C003” — 2005/6/21 — 20:01 — page 10 — #12


3-11

3.5 Hardware–Software Coverification and Hardware Simulation Traditionally the term “hardware–software codesign” has been identified with the ability to execute a simulation of the hardware and software at the same time. We prefer to use the term “hardware–software coverification” for this task, and leave codesign for the synthesis- and mapping-oriented approaches outlined in the previous section. In the form of simultaneously running an ISS and a Hardware Description Language (HDL) simulator, while keeping the timing of the two synchronized, the area is not new [52]. In recent years, however, we have seen a number of approaches to speeding up the task, in order to tackle platforms with several processors, and the need, for example, to boot an operating system in order to coverify a platform with a processor and its peripherals. Recent techniques have been devoted to the three main ways in which cosimulation speed can be increased: Accelerate the hardware simulator. Coverification generally works at the “clock cycle accurate” level, meaning that both the hardware simulator and the ISS view time as a sequence of discrete clock cycles, ignoring finer aspects of timing (sometimes clock phases are considered, e.g., for DSP systems, in which different memory banks are accessed in different phases of the same cycle). This allows one to speed up simulation with respect to traditional event-driven logic simulation, and yet retain enough precision to identify, for example, bottlenecks, such as interrupt service latency or bus arbitration overhead. Native-code hardware simulation (e.g., NCSim [28]) and emulation (e.g., QuickTurn [28] and Mentor Emulation [49]) can be used to further speed up hardware simulation, at the expense of longer compilation times and much higher costs, respectively. Accelerate the ISS. Compiled-code simulation has been a popular topic in this area as well [53]. The technique compiles a piece of assembler or C code for a target processor into object code that can be run on a host workstation. This code generally also contains annotations counting clock cycles by modeling the processor pipeline. The speed-up that can be achieved with this technique over a traditional ISS, which fetches, decodes, and executes each target instruction individually, is significant (at least one order of magnitude). Unfortunately this technique is not suitable for self-modifying code, such as that of a RTOS. This means that it is difficult to adapt to modern embedded software, which almost invariably runs under RTOS control, rather than on the bare CPU. However, hybrid techniques involving partial compilation on the fly are reportedly used by companies selling fast ISSs [50,51]. Accelerate the interface between the two simulators. This is the area where the earliest work has been performed. For example, Seamless [49] uses sophisticated filters to avoid sending requests for memory accesses over the CPU bus. This allows the bus to be used only for peripheral access, while memory data are provided to the processor directly by a “memory server,” which is a simulation filter sitting in between the ISS and the HDL simulator. The filter reduces stimulation of the HDL simulator, and thus can result in speed-ups of one or more orders of magnitude, when most of the bus traffic consists of filtered memory accesses. Of course, also precision of analysis drops, since, for example, it becomes harder to identify an overload in the processor bus due to a combination of memory and peripheral accesses, since no simulator component sees both. In the HDL domain, as mentioned above, progress in the levels of performance has been achieved essentially by raising the level of abstraction. A “cycle-based” simulator, that is, one that ignores the timing information within a clock cycle, can be dramatically faster than one that requires the use of a timing queue to manage time-tagged events. This is mainly due to two reasons. The first one is that now most of the simulation can be executed always, at every simulation clock cycle. This means that it is much more parallelizable, while event-driven simulators do not fit well over a parallel machine due to the presence of the centralized timing queue. Of course, there is a penalty if most of the hardware is generally idle, since it has to be evaluated anyway, but clock gating techniques developed for low-power consumption can obviously be applied here. The second one is that the overhead of managing the time queue, which often accounts for 50% to 90% of the event-driven simulation time, can now be completely eliminated.


ZURA: “2824_C003” — 2005/6/21 — 20:01 — page 11 — #13

3-12


Modern HDLs either are totally cycle-based (e.g., SystemC 1.0 [29]) or have a “synthesizable subset,” which is fully synchronous and thus fully compilable to cycle-based simulation. The same synthesizable subset, by the way, is also supported by hardware emulation techniques, for obvious reasons. Another interesting area of cosimulation in embedded system design is analog–digital cosimulation. This is because such systems quite often include analog components (amplifiers, filters, A/D and D/A converters, de-modulators, oscillators, phase locked loops [PLLs], etc.), and models of the environment quite often involve only continuous variables (distance, time, voltage, etc.). Simulink includes a component for simulating continuous-time models, employing a variety of numerical integration methods, which can be freely mixed with discrete-time sampled-data subsystems. This is very useful when modeling and simulating, for example, a control algorithm for automotive electronics, in which the engine dynamics are modeled with differential equations, while the controller is described as a set of blocks implementing a sampled-time subsystem. Simulink is still mostly used to drive software design, despite good toolkits implementing it in reconfigurable hardware [54,55]. Simulators in the hardware design domain, on the other hand, generally use HDLs as their input languages. Analog extensions of both VHDL [56] and Verilog [57] are available. In both cases, one can represent quantities that satisfy either of Kirchhoff ’s Laws (i.e., conserved over cycles or nodes). Thus one can easily build netlists of analog components interfacing with the digital portion, modeled using traditional Boolean or multivalued signals. The simulation environment will then take care of synchronizing the event-driven portion and the continuous time portion. A key problem here is to avoid causality errors, when an event that happens later in “host workstation” time (because the simulator takes care of it later) has an effect on events that preceded it in “simulated time.” In this case, one of the simulators has to “roll back” in time, undoing any potential changes in the state of the simulation, and restart with the new information that something has happened in the past (generally the analog simulator does it, since it is easier to reverse time in that case). Also in this case, as we have seen for hardware–software cosimulation, execution is much slower than in the pure event-driven or cycle-based case, due to the need to take small simulation steps in the analog part. There is only one case in which the performance of the interface between the two domains or of the continuous time simulator is not problematic. It is when the continuous time part is much slower in reality than the digital part. A classical example is automotive electronics, in which mechanical time constants are larger by several orders of magnitude than the clock period of a modern integrated circuit. Thus the performance of continuous time electronics and mechanical cosimulation may not be the bottleneck, except in the case of extremely complex environment models with huge systems of differential equations (e.g., accurate combustion engine models). In that case, hardware emulation of the differential equation solver is the only option (e.g., see Reference 16).

3.6 Software Implementation The next two sections provide an overview of traditional design flows for embedded hardware and software. They are meant to be used as a general introduction to the topics described in the rest of the book, and as a source of references to standard design practice. The software components of an embedded system are generally implemented using the traditional design–code–test–debug cycle, which is often represented using a V-shaped diagram to illustrate the fact that every implementation level of a complex software system must have a corresponding verification level (Figure 3.4). The parts of the V-cycle which relate to system design and partitioning have been described in the previous sections. Here we outline the tools that are available to the embedded software developer.

3.6.1 Compilation, Debugging, and Memory Model Compilation of mathematical formulas into binary machine-executable code followed almost immediately the invention of electronic computer. The first Fortran compiler dates back to 1954, and subroutines


ZURA: “2824_C003” — 2005/6/21 — 20:01 — page 12 — #14


3-13

Requirements

Product

Function and system analysis

System validation

Subsystem and communication testing

System design partitioning

SW design specification

SW integration

Implementation

FIGURE 3.4 V-cycle for software implementation.

were introduced in 1958, resulting in the creation of the Fortran II language. Since then, languages have evolved a little, more structured programming methodologies have been developed, and compilers have improved quite a bit, but the basic method has remained the same. In particular the C language, originally designed by Kernighan and Ritchie [58] between 1969 and 1972, and used extensively for programming the Unix operating system, is now dominant in the embedded system world, almost replacing the more flexible but much more cumbersome and less portable assembler. Its descendants Java and C++ are beginning to make some inroads, but are still viewed as requiring too much memory and computing power for widespread embedded use. Java, although originally designed for embedded applications [59,60], has a memory model based on garbage collection, that still defies effective embedded real-time implementation [61]. The first compilation step from a high-level language is the conversion of the human-written or machine-generated code into an internal format, called Abstract Syntax Tree [62], which is then translated into a representation that is closer to the final output (generally assembler code) and is suitable for a host of optimizations. This representation can take the form of a control/dataflow graph or a sequence of register transfers. The internal format is then mapped, generally via a graph-matching algorithm, to the set of available machine instructions, and written out to a file. A set of assembler files, in which references to data variables and to subroutine names are still based on symbolic labels, are then converted to an absolute binary file, in which all addresses are explicit. This phase is called assembly and loading. Relocatable code generation techniques, which basically permit code and its data to be placed anywhere in memory, without requiring recompilation, are now being used also in the embedded system domain, thanks to the availability of index registers and relative addressing modes in modern microprocessors. Debuggers for modern embedded systems are much more vital than for general-purpose programming, due to the more limited accessibility of the embedded CPU (often no file system, limited display and keyboard, etc.). They must be able to show several concurrent threads of control, as they interact with each other and with the underlying hardware. They must also be able to do so by minimally disrupting normal operation of the system, since it often has to work in real time, interacting with its environment. Both hardware and operating system support are essential, and the main RTOS vendors, such as WindRiver, all provide powerful interactive multitask debuggers. Hardware support takes the form of breakpoint and watchpoint registers, which can be set to interrupt the CPU when a given address is used for fetching or


ZURA: “2824_C003” — 2005/6/21 — 20:01 — page 13 — #15

3-14


data load/store, without requiring one to change the code (which may be in ROM) or to continuously monitor data accesses, which would dramatically slow down execution. A key difference between most embedded software and most general-purpose software is the memory model. In the latter case, memory is viewed as an essentially infinite uniform linear array, and the compiler provides a thin layer of abstraction on top of it, by means of arrays, pointers, and records (or structs). The operating system generally provides virtual memory capabilities, in the form of user functions to allocate and deallocate memory, and by swapping less frequently used pages of main memory to disk. This provides the illusion of a memory as large as the disk area allocated to paging, but with the same direct addressability characteristics as main memory. In embedded systems, however, money is an expensive resource, both in terms of size and speed. Cost, power, and physical size constraints generally forbid the use of virtual memory, and performance constraints force the designer to always carefully lay out data in memory, and match its characteristics (SRAM, DRAM, Flash, ROM) to those of the data and code. Scratchpads [63], that is, manually managed areas of small and fast memory, often on-chip SRAM, are still dominant in the embedded world. Caches are frowned upon in the real-time application domain, since the time at which a computation is performed often matters much more than the accuracy of its result. This is due to the fact that, despite a large body of research devoted to timing analysis of software code in the presence of caches (e.g., see References 64 and 65), their performance must still be assumed to be worst-case, rather than average-case as in general-purpose and scientific computing, thus leading to poor performance at a high cost (large and power-hungry tag arrays). However, compilers that traditionally focused on code optimizations for various underlying architectural features of the processor [66], now offer more and more support for memory-oriented optimizations, in terms of scheduling data transfers, sizing memories of various types, and allocating data to memory, sometimes moving it back and forth between fast and expensive and slow and cheap storage1 [2,63].

3.6.2 Real-Time Scheduling Another key difference with respect to general-purpose software are the real-time characteristics of most embedded software, due to its continual interaction with an environment that seldom can wait. In hard real-time applications, results produced after the deadline are totally useless. On the other hand, in soft realtime applications a merit function measures QOS, allowing one to evaluate trade-offs between missing various deadlines and degrading the precision or resolution with which computations are performed. While the former is often associated with safety-critical (e.g., automotive or avionics) applications and the latter is associated to multimedia and telecommunication applications, algorithm design can make a difference even within the very same domain. Consider, for example, a frame decoding algorithm that generates its result at the end of each execution, and that is scheduled to be executed in real-time every 50th of a second. If the CPU load does not allow it to complete each execution before the deadline, the algorithm will not produce any results, and thus behave as a hard real-time application, without being life-threatening. On the other hand, a smarter algorithm or a smarter scheduler would just reduce the frame size or the frame rate, whenever the CPU load due to other tasks increases, and thus produce a result that has lower quality, but is still viewable. A huge amount of research, summarized in excellent books, such as References 23–25, has been devoted to solving the problems introduced by real-time constraints on embedded software. Most of this work models the system (application, environment, and platform) in very abstract terms, as a set of tasks, each with a release time (when the task becomes ready), a deadline (by which the task must complete), and a WCET. In most cases tasks are periodic, that is, release times and deadlines of multiple instances of the same task are separated by a fixed period. The job of the scheduler is to find an execution order such that each task can complete by its deadline, if it exists. The scheduler may or may not, depending on the underlying hardware and software platform (CPU, peripherals, and RTOS) be able to preempt an executing 1 While

this may seem similar to virtual memory techniques, it is generally done explicitly, always keeping cost, power, and performance under tight control.


ZURA: “2824_C003” — 2005/6/21 — 20:01 — page 14 — #16


3-15

task in order to execute another one. Generally the scheduler bases its preemption decision, and the choice of which task must be run next, on an integer rank assigned to each task and called priority. Priorities may be assigned statically, at compile time, or dynamically, at runtime. The trade-off is between usage of precious CPU resources for runtime (also called online) priority assignment, based on an observation of the current execution conditions, versus the waste of resources inherent in the a priori definition of a priority assignment. A scheduling algorithm is also supposed in general to be able to tell conservatively if a set of tasks is unschedulable on a given platform, and given a set of modeling assumptions (e.g., availability of preemption, fixed or stochastic execution time, and so on). Unschedulability may occur, for example, because the CPU is not powerful enough and the WCETs are too long to satisfy some deadline. In this case the remedy could be either the choice of a faster clock frequency, or a change of CPU, or the transfer of some functionality to a hardware coprocessor, or the relaxation of some of the constraints (periods, deadlines, etc.). A key distinction in this domain is between time-triggered and event-triggered scheduling [67]. The former (also called Time-Division Multiple Access in telecommunications) relies on the fact that the start, preemption (if applicable), and end times of all instances of all tasks are decided a priori, based on worstcase analysis. The resulting system implementation is very predictable, easy to debug, and allows one to guarantee some service even under fault hypotheses [68]. The latter decides start and preemption times based on the actual time of occurrence of the release events, and possibly on the actual execution time (shorter than worst-case). It is more efficient than time-triggering in terms of CPU utilization, especially when release and execution times are not known precisely but subject to jitter. It is, however, more difficult to use in practice, because it requires either some form of conservative schedulability analysis a priori, and the dynamic nature of event arrival makes troubleshooting much harder. Some models and languages listed above, such as synchronous languages and dataflow networks, lend themselves well to time-triggered implementations. Some form of time-triggered scheduling is being, or will most likely be used for both CPUs and communication resources for safety-critical applications. This is already state of the art in avionics (“fly-by-wire,” as used e.g., in the Boeing 777 and in all Airbus models), and it is being seriously considered for automotive applications (“X-by-wire,” where X can stand for brake, drive, or steer). It is considered, coupled with certified high-level language compilers and standardized code review and testing processes, to be the only mechanism to comply with the rules imposed by various governmental certification agencies. Moving such control functions to embedded hardware and software, thus replacing older mechanical parts, is considered essential in order to both reduce costs and improve safety. Embedded electronic systems can analyze continuously possible wearing and faults in the sensors and the actuators, and thus warn drivers or maintenance teams. The simple task-based model outlined above can also be modified in various ways in order to take into account: • The cost of various housekeeping operations, such as recomputing priorities, swapping tasks in and out (also called “context switch”), accessing memory, and so on. • The availability of multiple resources (processors). • The fact that a task may need more than one resource (e.g., the CPU, a peripheral, a lock on a given part of memory), and possibly may have different priorities and different preemptability characteristics on each such resource (e.g., CPU access may be preemptable, while disk or serial line access may not). • Data or control dependencies between tasks. Most of these refinements of the initial model can be taken into account by appropriately modifying the basic parameters of a task set (release time, execution time, priority, and so on). The only exception is the extension to multiple concurrent CPUs, which makes the problem substantially more complex. We refer the interested reader to References 23–25 for more information about this subject. This sort of real-time schedulability is currently replacing manual trial-and-error and extensive simulation as a means to ensure satisfaction of deadlines or a given QOS requirement.


ZURA: “2824_C003” — 2005/6/21 — 20:01 — page 15 — #17

3-16


3.7 Hardware Implementation The modern hardware implementation process [69,70] in most cases starts from the so-called RTL. At this level of abstraction the required functionality of the circuit is modeled with the accuracy of a clock cycle, that is, it is known in which clock cycle each operation, such as addition or data transfer, occurs, but the actual delay of each operation, and hence the stabilization time of data on the inputs of the registers, is not known. At this level the number of registers and their bitwidths are also precisely known. The designer usually writes the model using an HDL, such as Verilog or VHDL, in which registers are represented using special kinds of “clock-triggered” assignments, and combinational logic operations are represented using the standard arithmetic, relational, and Boolean operators that are familiar to software programmers using high-level languages. The target implementation generally is not in terms of individual transistors and wires, but uses the Boolean gate abstraction as a convenient hand-off point between logic designer and technology specialist. Such abstraction can take the form of a standard cell, that is, an interconnection of transistors realized and well characterized on silicon, which implements a given Boolean function, and exhibits a specific propagation delay from inputs to outputs, under given supply, temperature, and load conditions. It can also be a Combinational Logic Block (CLB) in a FPGA. The former, which is the basis of the modern ASIC design flow, is much more efficient than the latter,2 however, it requires a very significant investment in terms of EDA3 tools, mask production costs and engineer training. The advantage of ASICs over FPGAs in terms of area, power, and performance efficiency comes from two main factors. The first one is the broader choice of basic gates: an average standard cell library includes about 100 to 500 gates, with both different logic functions and different drive strengths, while a given FPGA contains only one type of CLB. The second one is the use of static interconnection techniques, that is, wires and contact vias, versus the transistor-based dynamic interconnects of FPGAs. The much higher nonrecurrent engineering cost of ASICs comes first of all from the need to create at least a set of masks for each design (assuming it is correct the first time, that is, there is no need to respin), which can be up to about $ 1 million for current technologies and is growing very fast, and from the long fabrication times, which can be up to several weeks. Design costs are also higher, again in the million dollar range, both due to the much greater flexibility, requiring skilled personnel and sophisticated implementation tools, and due to the very high cost of design failure, requiring sophisticated verification tools. Thus ASIC designs are the most economically viable solution only for very high volumes. The rising mask costs and manufacturing risks are making the FPGA option viable for larger and larger production counts as technology evolves. A third alternative, structured ASICs, has been proposed recently. It features fixed layout schemes, similar to FPGAs, but also implements interconnect using contact vias. A comparison of the alternatives, for a given design complexity and varying production volumes, is shown in Figure 3.5 (the exact points at which each alternative is best are still subject to debate, and they are moving to the right over time).

3.7.1 Logic Synthesis and Equivalence Checking The semantics of HDLs and of languages, such as C or Java, are very different from each other. HDLs were born in the 1970s in order to model highly concurrent hardware systems, built using registers and Boolean gates. They, and the associated simulators that allow one to analyze the behavior of the modeled design in detail, are very efficient in handling fine-grained concurrency and synchronization, which is necessary when simulating huge Boolean netlists. However, they often lack constructs found in modern programming languages, such as recursive functions and complex data types (only recently introduced in Verilog), or objects, methods, and interfaces. An HDL model is essentially meant to be simulated under 2 The difference is about one order of magnitude in terms of area, power, and performance for the current fabrication technology, and the ratio is expected to remain constant over future technology generations. 3 The term EDA, which stands for Electronic Design Automation, is often used to distinguish this class of tools from the CAD tools used for mechanical and civil engineering design.


ZURA: “2824_C003” — 2005/6/21 — 20:01 — page 16 — #18

Total cost


3-17 FPGA

SA

Standard cell

A 1–10,000

B 10,000 –100,000

C

Volume

100,000–. . .

FIGURE 3.5 Comparison between ASIC, FPGA, and Structured ASIC production costs.

a variety of timing models (generally at the register transfer or gate level, even though cosimulation with analog components or continuous time models is also supported, that is, in Verilog-AMS and AHDL). Synthesis from an HDL into an interconnection of registers and gates, normally consists of two substeps. The first one, called RTL synthesis and module generation, transforms high-level operators, such as adders, multiplexers, and so on, into Boolean gates using an appropriate architecture (e.g., ripple carry or carry lookahead). The second one, called logic synthesis, optimizes the combinational logic resulting from the above step, under a variety of cost and performance constraints [71,72]. It is well known that, given a function to be implemented (e.g., 32-bit twos-complement addition), one can use the properties of Boolean algebra in order to find alternative implementations with different characteristics in terms of: 1. Area, for example, estimated as the number of gates, or as the number of gate inputs, or as the number of literals in the Boolean expression representing each gate function, or using a specific value for each gate selected from the standard cell library, or even considering an estimate of interconnect area. This sequence of cost functions increases estimation precision, but is more and more expensive to compute. 2. Delay, for example, estimated as number of levels, or more precisely as a combination of levels and fanout of each gate, or even more precisely as a table that takes into account gate type, transistor size, input transition slope, output capacitance, and so on. 3. Power, for example, estimated as transition activity times capacitance times voltage squared, using the well-known equation valid for Complementary MOS (CMOS) transistors. It is also well known that generally Pareto-optimal solutions to this problem exhibit an area-delay product that is approximately constant for a given function. Modern EDA tools, such as Design Compiler from Synopsys [8], RTL Compiler from Cadence [28], Leonardo Spectrum from Mentor Graphics [49], Synplify from Synplicity [73], and Blast Create from Magma Design Automation [74] and others, perform such task efficiently for designs that today may include a few million gates. Their widespread adoption has enabled designers to tackle huge designs in a matter of months, which would have been unthinkable or extremely inefficient using either manual or purely block-based design techniques. Such logic synthesis systems take into account the required functionality, the target clock cycle, and the set of physical gates that are available for implementation (the standard-cell library or the CLB characteristics, e.g., number of inputs), as well as some estimates of capacitance and resistance of interconnection wires4 and generate efficient netlists of Boolean gates, which can be passed on to the following design steps. 4 Some

such tools also include rough placement and routing steps, which will be described below, in order to increase the precision of such interconnect estimates for current deep submicron (DSM) technologies.


ZURA: “2824_C003” — 2005/6/21 — 20:01 — page 17 — #19

3-18


While synthesis is performed using precise algebraic identities, bugs can creep into any program. Thus, in order to avoid extremely costly respins due to an EDA tool bug, it is essential to verify that the functionality of the synthesized gate netlist is the same as that of the original RTL model. This verification step was traditionally performed using a multilevel HDL simulator, comparing responses to designerwritten stimuli in both representations. However, multimillion gate circuits would require too many very slow simulation steps (a large circuit today can be simulated at the speed of a handful of clock cycles per second). Formal verification is thus used to prove, using algorithms that are based on the same laws as synthesis techniques, but which have been written by different people and thus hopefully have different bugs, that indeed the responses of the two circuit models are identical under all legal input sequences. This verification, however, solves only half of the problem. One must also check that all combinational logic computations complete within the required clock cycle. This second check can be performed using timing simulators, however, complexity considerations also suggest the usage a more static approach. Static Timing Analysis, based on worst-case longest-path search within combinational logic, is today a workhorse of any logic synthesis and verification framework. It can either be based on purely topological information, or consider only so-called true paths along which a transition can propagate [75], or even include the effects of crosstalk on path delay. Crosstalk may alter the delay of a “victim” wire, due to simultaneous transitions of temporally and spatially close “aggressor” wires, as analyzed by tools such as PrimeTime from Synopsys [8] and CeltIc from Cadence [28]. This kind of coupling of timing and geometry makes crosstalk-aware timing analysis very hard, and essentially contributes to the breaking of traditional boundaries between synthesis, placement, and routing. Tools performing these tasks are available from all major EDA vendors (e.g., Synopsys, Cadence) as well as from a host of startups. Synthesis has become more or less a commodity technology, while formal verification, even in its simplest form of equivalence checking, as well as in other emerging forms, such as property checking, which are described below, is still an emerging technology, for which disruptive innovation occurs mostly in smaller companies.

3.7.2 Placement, Routing, and Extraction After synthesis (and sometimes during synthesis) gates are placed on silicon, either at fixed locations (the positions of CLBs) for FPGAs and Structured ASICs, or with a row-based organization for standard cell ASICs. Placement must avoid overlaps between cells, while at the same time satisfying clock cycle time constraints, avoiding excessively long wires on critical paths.5 Placement, especially for multimillion-gate circuits, is an extremely difficult problem, which requires complex constrained combinatorial optimization. Modern algorithms [76] drastically simplify the model, in order to ensure reasonable runtimes. For example, the quadratic placement model used in several modern EDA tools, minimizes the sum of squares of net lengths. This permits very efficient derivation of the cost function and fast identification of a minimum cost solution. However, this quadratic cost only approximately correlates with the true objective, which is the minimization of the clock period, due to parasitic capacitance. True cost first of all depends also on the actual interconnect, which is designed only later by the routing step, and second depends on the maximum among a set of sums (one for each register-to-register path), rather than on the sum over all gate-to-gate interconnects. For this reason, modern placers iterate steps solved using fast but approximate algorithms, with more precise analysis phases, often involving actual routing, in order to recompute the actual cost function at each step. Routing is the next step, and involves generating (or selecting from the available prelaid-out tracks in FPGAs) the metal and via geometries that will interconnect placed cells. It is also extremely difficult in modern submicron technologies, not only due to the huge number of geometries involved (10 million gates can easily involve a billion wire segments and contacts), but also due to the complexity of modern 5 Power

density has recently become a prime concern for placement as well, implying the need to avoid “hot spots” of very active cells, where power dissipation through the silicon substrate would be too difficult to manage.


ZURA: “2824_C003” — 2005/6/21 — 20:01 — page 18 — #20


3-19

interconnect modeling. A wire used to be modeled, in CMOS technology, essentially as a parasitic capacitance. This (or minor variations considering also resistance) is still the model used by several commercial logic synthesis tools. However, nowadays a realistic model of a wire, to be used when estimating the cost of a placement or of a routing solution, must take into account: • Realistic resistance and capacitance, for example, using the Elmore model [77], considering each wire segment separately, due to the very different resistance and capacitance characteristics of different metal layers.6 • Crosstalk noise due to capacitive coupling.7 This means that, exactly as in placement (and sometimes during placement), one needs to alternate between fast routing using approximate cost functions and detailed analysis steps that refine the value of the cost function. Again, all major EDA vendors offer solutions to the routing problem, which are generally tightly integrated with the placement tool, even though in principle the two perform separate functions. The reason for the tight coupling lies in the above-mentioned need for the placer to accurately estimate the detailed route taken by a given interconnect, rather than just estimate it with the square of the distance between its terminals. Exactly as in the case of synthesis, a verification step must be performed after placement and routing. This is required in order to verify that: • All design rules are satisfied by the final layout. • All and only the desired interconnects have been realized by placement and routing. This step is done by extracting electrical and logic models from layout masks, and comparing these models with the input netlist (already verified for equivalence with the RTL). Note that within each standard cell, design rules are verified independently, since the ASIC designer, for reason of intellectual property protection, generally does not see the actual layout of the standard cells, but only an external “envelope” of active (transistor) and interconnect areas, which is sufficient to perform this kind of verification. The layout of each cell is known and used only at the foundry, when masks are finally produced.

3.7.3 Simulation, Formal Verification, and Test Pattern Generation The steps mentioned above create a layout implementation from RTL, while checking simultaneously that no errors are introduced, either due to programming errors, or due to manual modifications, and that performance and power constraints are satisfied. However, they neither ensure that the original RTL model satisfied the customer-defined requirements, nor that the circuit after manufacturing did not have any flaws compromising either its functionality or its performance. The former problem is tackled by simulation, prototyping, and formal verification. None of these techniques is sufficient to ensure that an ill-defined problem has a solution: customer needs are inherently nonformalizable.8 However, they help building up confidence in the fact that the final product will satisfy the requirements. Simulation and prototyping are both trial-and-error procedures, similar to the compile–debug cycle used for software. Simulation is generally cheaper, since it only requires a generalpurpose workstation (nowadays often a PC running Linux), while prototyping is faster (it is based on synthesizing the RTL model into one or several FPGAs). Cost and performance of these options differ by 6 Layers

that are farther away from silicon are best for long-distance wires, due to the smaller substrate and mutual capacitance, as well as due to the smaller sheet resistance [78]. 7 Inductance fortunately is not yet playing a significant role, and many doubt that it ever will, for digital integrated circuits. 8 For example, what is the definition of “a correct phone call”? Does this refer to not dropping the communication? To transferring exactly a certain number of voice samples per second? To setting up quickly a communication path? Since all these desirable characteristics have a cost, what is the maximum price various classes of customers are willing to pay for them, and what is the maximum degree of violation that can be admitted by each class?


ZURA: “2824_C003” — 2005/6/21 — 20:01 — page 19 — #21

3-20


several orders of magnitude. Prototyping on multi-FPGA platforms, such as those offered by Quickturn, is thus limited to the most expensive designs, such as microprocessors.9 Unfortunately, both simulation and prototyping suffer from a basic capacity problem. It is true that cost decreases exponentially and performance increases exponentially over technology generations for the simulation and prototyping platforms (CPUs and FPGAs). However, the complexity of the verification problem grows as a double or even triple exponential (approximately) with technology. The reason is that the number of potential states of a digital design grows exponentially with the number of memory-holding components (flip-flops and latches), and the complexity of the verification problem for a sequential entity (e.g., a FSM) grows even more than exponentially with its state-space. For this reason, the growth in the number of input patterns which are required to prove up to a given level of confidence that a design is correct, grows triply exponentially with each technology generation, while capacity and performance grow “only” as a single exponential. This is clearly an untenable situation, given that the number of engineers is finite, and the size of the verification teams is already much larger than that of the design teams. Formal verification, defined as proving semiautomatically that, under a set of assumptions, a given property holds for a design, is a means of alleviating at least the human aspect of the “verification complexity explosion” problem. Formal verification allows one to state a property, such as, for example, “this protocol never deadlocks” or “the value of this register is never overwritten before being read,” using relatively simple mathematical formulas. Then one can automatically check that the property holds over all possible input sequences. The problem, unfortunately, is inherently extremely complex (the triple exponential mentioned above affects this formulation as well). However, the complexity is now relegated to the automated portion of the flow. Thus manual generation and checking of individual pattern sequences is no longer required. Several EDA companies on the market, such as Cadence, Mentor Graphics, Synopsys, as well as several silicon vendors, such as Intel and IBM, currently offer or internally develop and use such tools. The key barriers to adoption are twofold: 1. The complexity of the task, as mentioned above, is just shifted. While a workstation costs much less than an engineer, exponential growth is never tenable in the long term, regardless of the constant factors. This means that significant human intervention is still required in order to keep within acceptable limits the time required to check each individual property. This involves both breaking properties into simpler subproperties and abstracting away aspects of the system that are not relevant for the property at hand. Abstraction, however, hides aspects of the real design from the automated prover, and thus implies the risk of “false positive” results, that is, of declaring a system correct even when it is not. 2. Specification of properties is much more difficult than identification of input patterns. A property must encompass a variety of possible scenarios and state explicitly all assumptions made (e.g., “there is no deadlock in the bus access protocol only if no master makes requests at every clock cycle”). The language in which properties are specified is often a form of mathematical logics, and thus is even less familiar than software languages to a typical design engineer. However, significant progress is being made in this area every year by researchers, and adoption of such automated formal verification techniques in the specification verification domain is growing. Testing a manufactured circuit to verify that it operates correctly according to the RTL model is a closely related problem. In principle, one would need to prove equivalent behavior under all possible input–output sequences, which is clearly impossible. In practice, test engineers either use a “naturally orthogonal” architecture, such as that of a microprocessor, in order to functionally test small sequences of instructions. Or they decompose testing into that of combinational and sequential logic. Combinational logic testing is a relatively “easy” task, as compared to the formal verification described above. If one considers only Boolean functionality (i.e., delay is not tested), its complexity (assuming that no polynomial 9 Nowadays even microprocessors are mostly designed using a modified ASIC-like flow, except for memories, register

files, and sometimes portions of the ALU, which are still designed by hand down to the polygon level, at least for leading edge CPUs.


ZURA: “2824_C003” — 2005/6/21 — 20:01 — page 20 — #22


3-21

algorithm exists for NP-complete problems) is just a single exponential in the number of combinational circuit inputs. While a priori there is no reason why testing only Boolean equivalence between the specification and the manufactured circuit should be enough to ensure correct functionality, empirically there is a significant amount of evidence that fully testing for a relatively small class of Boolean manufacturing faults, namely stuck-at faults, is sufficient to ensure satisfactory actual yield for ASICs. The stuck-at-fault model assumes that the only problem that can occur during manufacturing is the fact that some gate inputs are fixed at logical 0 or 1. This may have been a physically realistic model in the early days of bipolar-based Transistor–Transistor Logic. However, in DSM CMOS a host of physical defects may short wires together, increase or decrease their resistance and capacitance, short a transistor gate to its source or drain, and so on. At the logic level, a combinational function may become sequential (even worse, may exhibit dynamic behavior, that is, slowly change output values over time, without changing inputs), or it may become faster or slower. Still, full checking for stuck-at faults is excellent to ensure that none of these complex physical problems has occurred, or will affect the operation of the circuit. For this reason, today testing is mostly accomplished by first of all reducing sequential testing to combinational testing, by using special memory elements, the so-called scan flip-flops and latches. Second, combinational test pattern generation is performed only at the Boolean level, using the above-mentioned stuck-at model. Test pattern generation is similar to equivalence checking, because it amounts to proving that two copies of the same circuit, one with and one without a given fault, are indeed not equivalent. The witness to this nonequivalence is the pattern to be applied to the circuit inputs to identify the fault. The problem of actually applying the pattern to the physical fragment of combinational logic, and then observing its outputs to verify if the fault is present, is solved by converting all or most of the registers of the sequential circuit into one (or a handful of) giant shift registers, each including several hundred thousand bits. The pattern (and several others, used to test several CLBs in parallel) is first loaded serially through the shift register. Then a multiplexer at the input of each flip-flop is switched, transforming the serial loading mode into parallel loading mode, using as register inputs the outputs of each CLB. Finally, serial conversion is performed again, and the outputs of the logic are checked for correctness by the test equipment. Figure 3.6 shows an example of this sort of arrangement, in which also the flip-flop clock is changed from normal operation (in which it can be gated) to test mode. The only drawback of this elegant solution, due to the IBM engineers in the 1970s, is the additional time that the circuit needs to spend on very expensive testing machines, in order to shift patterns in and out through very long flip-flop chains. Test pattern generation for combinational circuits is a very well-established area of research, and again the reader is referred to one of many books in the area for a more extensive description [79]. Note that memories are not tested using this mechanism, both because it would be too expensive to convert each cell into a scan register, and because the stuck-at-fault model does not apply to this kind of circuits. Memories are tested using appropriate input–output pattern sequences, which are generated, applied, and verified on-chip, using either self-test software running on the embedded processor, or some

Q Sout

Q Sout

Test_Data Test_Mode Test_Clk User_Clk

FIGURE 3.6 Two scan flip-flops with combinational logic.


ZURA: “2824_C003” — 2005/6/21 — 20:01 — page 21 — #23

3-22


form of Built-In Self-Test (BIST) logic circuitry. Modern RAM generators, that produce directly the layout in a given process, based on the requested number of rows and columns, often produce directly also the BIST circuitry.

3.8 Conclusions This chapter discussed several aspects of embedded system design, including both methodologies that allow one to perform judicious algorithmic and architectural decisions, and tools supporting various steps of these methodologies. One must not forget, however, that often embedded systems are complex compositions of parts that have been implemented by various parties, and thus the task of physical board or chip integration can be as difficult as, and much more expensive than, the initial architectural decisions. In order to support the integration and system testing tasks one must use formal models throughout the design process, and if possible perform early evaluation of the difficulties of integration, by virtual integration and rapid prototyping techniques. These allow one to find or avoid completely subtle bugs and inconsistencies earlier in the design cycle, and thus reduce overall design time and cost. Thus the flow and tools that we described in this chapter help not only with the initial design, but also with the final integration. This is because they are based on executable specifications of the whole system (including models of its environment), early virtual integration, and systematic (often automated) refinement toward implementation. The last part of the chapter summarized the main characteristics of the current hardware and software implementation flows. While complete coverage of this huge topic is beyond its scope, a lightweight introduction can hopefully serve to direct the interested reader, who has only a general electrical engineering or computer science background, toward the most appropriate source of information.

References [1] F. Balarin, E. Sentovich, M. Chiodo, P. Giusto, H. Hsieh, B. Tabbara, A. Jurecska, L. Lavagno, C. Passerone, K. Suzuki, and A. Sangiovanni-Vincentelli. Hardware–Software Co-design of Embedded Systems — The POLIS Approach. Kluwer Academic Publishers, Dordrecht, 1997. [2] F. Catthoor, S. Wuytack, E. De Greef, F. Balasa, L. Nachtergaele, and A. Vandecapelle. Custom Memory Management Methodology: Exploration of Memory Organisation for Embedded Multimedia System Design. Kluwer Academic Publishers, Dordrecht, 1998. [3] The Mathworks Simulink and StateFlow. http://www.mathworks.com. [4] National Instruments MATRIXx. http://www.ni.com/matrixx/. [5] ETAS Ascet-SD. http://www.etas.de. [6] N2C CoWare SPW and LISATek. http://www.coware.com. [7] Esterel Technologies Esterel Studio. http://www.esterel-technologies.com. [8] Design Compiler Synopsys SystemStudio and PrimeTime. http://www.synopsys.com. [9] Telelogic Tau and Doors. http://www.telelogic.com. [10] I-Logix Statemate and Rhapsody. http://www.ilogix.com. [11] D. Har’el, H. Lachover, A. Naamad, A. Pnueli, M. Politi, R. Sherman, A. Shtull-Trauring, and M.B. Trakhtenbrot. STATEMATE: a working environment for the development of complex reactive systems. IEEE Transactions on Software Engineering, 16:403–414, 1990. [12] The Object Management Group UML. http://www.omg.org/uml/. [13] L. Lavagno, G. Martin, and B. Selic, Eds. UML for Real: Design of Embedded Real-Time Systems. Kluwer Academic Publishers, Dordrecht, 2003. [14] Artisan Software Real Time Studio. http://www.artisansw.com/. [15] IBM Rational Rose RealTime. http://www.rational.com/products/rosert/. [16] dSPACE TargetLink and Prototyper. http://www.dspace.de. [17] OSEK/VDX. http://www.osek-vdx.org/.


ZURA: “2824_C003” — 2005/6/21 — 20:01 — page 22 — #24


3-23

[18] E.A. Lee and D.G. Messerschmitt. Synchronous data flow. IEEE Proceedings, 75(9):1235–1245, 1987. [19] J. Buck and R. Vaidyanathan. Heterogeneous modeling and simulation of embedded systems in El Greco. In Proceedings of the International Conference on Hardware Software Codesign, May 2000. [20] TNI Valiosys Reqtify. http://www.tni-valiosys.com. [21] R.P. Kurshan. Automata-Theoretic Verification of Coordinating Processes. Princeton University Press, Princeton, NJ, 1994. [22] K. McMillan. Symbolic Model Checking. Kluwer Academic Publishers, Dordrecht, 1993. [23] G. Buttazzo. Hard Real-Time Computing Systems: Predictable Scheduling Algorithms and Applications. Kluwer Academic Publishers, Dordrecht, 1997. [24] H. Gomaa. Software Design Methods for Concurrent and Real-Time Systems. Addison-Wesley, Reading, MA, 1993. [25] W.A. Halang and A.D. Stoyenko. Constructing Predictable Real Time Systems. Kluwer Academic Publishers, Dordrecht, 1991. [26] R. Ernst, J. Henkel, and T. Benner. Hardware–software codesign for micro-controllers. IEEE Design and Test of Computers, 10:64–75, 1993. [27] R.K. Gupta and G. De Micheli. Hardware–software cosynthesis for digital systems. IEEE Design and Test of Computers, 10:29–41, 1993. [28] CeltIc Cadence Design Systems RTL Compiler and Quickturn. http://www.cadence.com. [29] Open SystemC Initiative. http://www.systemc.org. [30] G. Berry. The foundations of esterel. In Plotkin, Stirling, and Tofte, Eds., Proof, Language and Interaction: Essays in Honour of Robin Milner. MIT Press, Lanchester, 2000. [31] S.A. Edwards. Compiling Esterel into sequential code. In International Workshop on Hardware/Software Codesign. ACM Press, May 1999. [32] T.B. Ismail, M. Abid, and A.A. Jerraya. COSMOS: a codesign approach for communicating systems. In International Workshop on Hardware/Software Codesign. ACM Press, 1994. [33] W. Cesario, A. Baghdadi, L. Gauthier, D. Lyonnard, G. Nicolescu, Y. Paviot, S. Yoo, A.A. Jerraya, and M. Diaz-Nava. Component-based design approach for multicore socs. In Proceedings of the Design Automation Conference, June 2002. [34] Foresight Systems. http://www.foresight-systems.com. [35] CARDtools. http://www.cardtools.com. [36] IMEC ATOMIUM. http://www.imec.be/design/atomium/. [37] P. Cousot and R. Cousot. Abstract interpretation: a unified lattice model for static analysis of programs by construction of approximation of fixpoints. In Proceedings of the ACM Symposium on Principles of Programming Languages. ACM Press, 1977. [38] AbsInt Worst-Case Execution Time Analyzers. http://www.absint.com. [39] Y.T.S. Li and S. Malik. Performance analysis of embedded software using implicit path enumeration. In Proceedings of the Design Automation Conference, June 1995. [40] 0-In Design Automation. http://www.0-in.com/. [41] C. Norris Ip. Simulation coverage enhancement using test stimulus transformation. In Proceedings of the International Conference on Computer Aided Design, November 2000. [42] Forte Design Systems Cynthesizer. http://www.forteds.com. [43] Celoxica DK Design suite. http://www.celoxica.com. [44] K. Wakabayashi. Cyber: high level synthesis system from software into ASIC. In R. Camposano and W. Wolf, Eds., High Level VLSI Synthesis. Kluwer Academic Publishers, Dordrecht, 1991. [45] D. Gajski, J. Zhu, and R. Domer. The SpecC Language. Kluwer Academic Publishers, Dordrecht, 1997. [46] D. Gajski, J. Zhu, R. Domer, A. Gerstlauer, and S. Zhao. SpecC: Specification Language and Methodology. Kluwer Academic Publishers, Dordrecht, 2000. [47] OPNET. http://www.opnet.com. [48] Network Simulator NS-2. http://www.isi.edu/nsnam/ns/.


ZURA: “2824_C003” — 2005/6/21 — 20:01 — page 23 — #25

3-24

[49] [50] [51] [52] [53] [54] [55] [56] [57] [58] [59] [60] [61] [62] [63]

[64]

[65] [66] [67] [68] [69] [70] [71] [72] [73] [74] [75] [76] [77] [78] [79]


Mentor Graphics Seamless and Emulation. http://www.mentor.com. VAST Systems CoMET. http://www.vastsystems.com/. Axys Design Automation MaxSim and MaxCore. http://www.axysdesign.com/. J. Rowson. Hardware/software co-simulation. In Proceedings of the Design Automation Conference, 1994, pp. 439–440. V. Zivojnovic and H. Meyr. Compiled HW/SW co-simulation. In Proceedings of the Design Automation Conference, 1996. Altera DSP Builder. http://www.altera.com. Xilinx System Generator. http://www.xilinx.com. IEEE. Standard 1076.1, vhdl-ams. http://www.eda.org/vhdl-ams. OVI. Verilog-a standard. http://www.ovi.org. B. Kernighan and D. Ritchie. The C Programming Language. Prentice-Hall, New York, 1988. K. Arnold and J. Gosling. The Java Programming Language. Addison-Wesley, Reading, MA, 1996. Sun Microsystem, Inc. Embedded Java Specification. Available at http://java.sun.com, 1998. Real-Time for Java Expert Group. The real time specification for Java. Available at http:// rtsj.dev.java.net/, 1998. A.V. Aho, J.E. Hopcroft, and J.D. Ullman. The Design and Analysis of Computer Algorithms. Addison-Wesley, Reading, MA, 1974. P. Panda, N. Dutt, and A. Nicolau. Efficient utilization of scratch-pad memory in embedded processor applications. In Proceedings of Design Automation and Test in Europe (DATE), February 1997. Y.T.S. Li, S. Malik, and A. Wolfe. Performance estimation of embedded software with instruction cache modeling. In Proceedings of the International Conference on Computer-Aided Design, November 1995. ˜ Whalley. Fast instruction cache analysis via static cache simulation. F. Mueller and D.B. In Proceedings of the 28th Annual Simulation Symposium, April 1995. P. Marwedel and G. Goossens, Eds. Code Generation for Embedded Processors. Kluwer Academic Publishers, Dordrecht, 1995. H. Kopetz. Should responsive systems be event-triggered or time-triggered? IEICE Transactions on Information and Systems, E76-D:1325–1332, 1993. H. Kopetz and G. Grunsteidl. TTP — A protocol for fault-tolerant real-time systems. IEEE Computer, 27:14–23, 1994. G. De Micheli. Synthesis and Optimization of Digital Circuits. McGraw-Hill, New York, 1994. Jan M. Rabaey, Anantha Chandrakasan, and Borivoje Nikolic. Digital Integrated Circuits, 2nd ed. Prentice-Hall, New York, 2003. S. Devadas, A. Ghosh, and K. Keutzer. Logic Synthesis. McGraw-Hill, New York, 1994. G.D. Hachtel and F. Somenzi. Logic Synthesis and Verification Algorithms. Kluwer Academic Publishers, Dordrecht, 1996. Synplicity Synplify. http://www.synplicity.com. Magma Design Automation Blast Create. http://www.magma-da.com. P. McGeer. On the Interaction of Functional and Timing Behavior of Combinational Logic Circuits. Ph.D. thesis, U.C. Berkeley, 1989. Naveed A. Sherwani. Algorithms for VLSI Physical Design Automation, 3rd ed. Kluwer Academic Publishers, Dordrecht, 1999. W.C. Elmore. The transient response of damped linear network with particular regard to wideband amplifiers. Journal of Applied Physics, 19:55–63, 1948. R.H.J.M. Otten and R.K. Brayton. Planning for performance. In Proceedings of the Design Automation Conference, June 1998. M. Abramovici, M.A. Breuer, and A.D. Friedman. Digital Systems Testing and Testable Design. Computer Science Press, Rockville, MD, 1990.


ZURA: “2824_C003” — 2005/6/21 — 20:01 — page 24 — #26

4 Models of Embedded Computation 4.1

Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4-1

Models of Sequential and Parallel Computation • Nonfunctional Properties • Heterogeneity • Component Interaction • Time • The Purpose of an MoC

4.2

The MoC Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4-9

Processes and Signals • Signal Partitioning • Untimed MoCs • The Synchronous MoC • Discrete Timed MoCs

4.3

Integration of MoCs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-16 MoC Interfaces • Interface Refinement • MoC Refinement

Axel Jantsch Royal Institute of Technology

4.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-22 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-23

4.1 Introduction A model of computation (MoC) is an abstraction of a real computing device. Different computational models serve different objectives and purposes. Thus, they always suppress some properties and details that are irrelevant for the purpose at hand, and they focus on other properties that are essential. Consequently, MoCs have been evolving during the history of computing. In the early decades, between 1920 and 1950, the main focus has been on the question: “What is computable?” The Turing machine and the lambda calculus are prominent examples of computational models developed to investigate that question.1 It turned out that several, very different MoCs, such as the Turing machine, the lambda calculus, partial recursive functions, register machines, Markov algorithms, Post systems, etc., [1] are all equivalent in the sense that they all denote the same set of computable mathematical functions. Thus, today the so-called Church–Turing thesis is widely accepted: Church–Turing thesis. If function f is effectively calculable, then f is Turing-computable. If function f is not Turing-computable, then f is not effectively calculable [1, p. 379]. It is the basis for our understanding today what kind of problems can be solved by computers, and what kind of problems principally are beyond a computer’s reach. A famous example of what cannot be solved by a computer is the halting problem for Turing machines. A practical consequence is that there cannot be an 1 The term “model of computation” came in use only much later in the 1970s, but conceptually the computational models of today can certainly be traced back to the models developed in the 1930s.

4-1


4-2


algorithm that given a function f and a C++ program P (or a program in any other sufficiently complex programming language), could determine if P computes f . This illustrates the principal difficulty of programming language teachers in correcting exams, and of verification engineers in validation programs and circuits. Later the focus changed to the question: “What can be computed in reasonable time and with reasonable resources?,” which spun off the theories of algorithmic complexity based on computational models exposing timing behavior in a particular but abstract way. This resulted in a hierarchy of complexity classes for algorithms according to their asymptotic complexity. The computation time (or other resources) for an algorithm is expressed as a function of some characteristic figure of the input, for example, the size of the input. For instance we can state that the function f (n) = 2n, for natural numbers n can be computed in p(n) time steps by any computer for some polynomial function p(n). In contrast, the function g (n) = n! cannot be computed in p(n) time steps on any sequential computer for any polynomial function p(n) and arbitrary n. With growing n the time steps required to compute g (n) grows faster than can be expressed by any polynomial function. This notion of asymptotic complexity allows us to express properties about algorithms in general disregarding details of the algorithms and the computer architecture. This comes at the cost of accuracy. We may only know that there exists some polynomial function p(n) for every computer, but we do not know p(n) since it may be very different for different computers. To be more accurate one needs to take into account more details of the computer architecture. As a matter of fact the complexity theories rest on the assumption that one kind of computational model, or machine abstraction, can simulate another one with a bounded and well-defined overhead. This simulation capability has been expressed in the thesis given below: Invariance thesis. “Reasonable” machines can simulate each other with a polynomially bounded overhead in time and a constant overhead in space [2]. This thesis establishes an equivalence between different machine models and makes results for a particular machine more generally useful. However, some machines are equipped with considerably more resources and cannot be simulated by a conventional Turing machine according to the invariance thesis. Parallel machines have been the subject of a huge research effort and the question, how parallel resources increase the computational power of a machine has lead to a refinement of computational models and an accuracy increase for estimating computation time. The fundamental relation between sequential and parallel machines has been captured by the following thesis. Parallel computation thesis. Whatever can be solved in polynomially bounded space on a reasonable sequential machine model, can be solved in polynomially bounded time on a reasonable parallel machine, and vice versa [2]. Parallel computers prompted researchers to refine computational models to include the delay of communication and memory access, which we review briefly in Section 4.1.1. Embedded systems require a further evolution of computational models due to new design and analysis objectives and constraints. The term “embedded” triggers two important associations. First, an embedded component is squeezed into a bigger system, which implies constraints on size, the form factor, weight, power consumption, cost, etc. Second, it is surrounded by real-world components, which implies timing constraints and interfaces to various communication links, sensors, and actuators. As a consequence, the computational models that are used and are useful in embedded system design are different from those in general purpose sequential and parallel computing. The difference comes from the nonfunctional requirements and constraints and from the heterogeneity.

4.1.1 Models of Sequential and Parallel Computation Arguably, general purpose sequential computing had for a long time a privileged position, in that it had a single, very simple, and effective MoC. Based on the van Neumann machine, the


Models of Embedded Computation

4-3

random access machine (RAM) model [3] is a sufficiently general model to express all important algorithms and reflects the salient nonfunctional characteristics of practical computing engines. Thus, it can be used to analyze performance properties of algorithms in a hardware architecture and implementation independent way. This favorable situation for sequential computing has been eroded over the years as processor architectures and memory hierarchies became ever more complex and deviated from the ideal RAM model. The parallel computation community has been searching in vain for a similar simple and effective model [4]. Without a universal model of parallel computation, the foundations for the development of portable and efficient parallel applications and architectures were lacking. Consequently, parallel computing has not gained as wide acceptance as sequential computing and is still confined to niche markets and applications. The parallel random access machine (PRAM) [5] is perhaps the most popular model of parallel computation and closest to its sequential counterpart with respect to simplicity. A number of processors execute in a lock-step way, that is, synchronized after each cycle governed by a global clock, and access global, shared memory simultaneously within one cycle. The PRAM model’s main virtue is its simplicity but it poorly captures the costs associated with computing. Although the RAM model has a similar cost model, there is a significant difference. In the RAM model the costs (execution time, program size) are in fact well reflected and grow linearly with the size of the program and the length of the execution path. This correlation is in principle correct for all sequential processors. The PRAM model does not exhibit this simple correlation because in most parallel computers the cost of memory access, communication and synchronization can be vastly different depending on which memory location is accessed and which processors communicate. Thus, the developer of parallel algorithms does not have sufficient information from the PRAM model alone to develop efficient algorithms. He or she has to consult the specific cost models of the target machine. Many PRAM variants have been developed to more realistically reflect real cost. Some made the memory access more realistic. The exclusive read–exclusive write (EREW) and the concurrent read–exclusive write (CREW) models [5] serialize access to a given memory location by different processors but still maintain the unit cost model for memory access. The local memory PRAM (LPRAM) model [6] introduces a notion of memory hierarchy while the queued read–queued write (QRQW) PRAM [7] models the latency and contention of memory access. A host of other PRAM variants have factored in the cost of synchronization, communication latency, and bandwidth. Other models of parallel computation, many of which are not directly derived from the PRAM machine, focus on memory. There either the distributed nature of memory is the main concern [8] or the various cost factors of the memory hierarchy are captured [6,9,10]. An introductory survey of models of parallel computation has been written by Maggs et al. [4].

4.1.2 Nonfunctional Properties A main difference between sequential computation and parallel computation comes from the role of time. In sequential computing, time is solely a performance issue which is moreover captured fairly well by the simple and elegant RAM model. In parallel computing the execution time can only be captured by complex cost functions that depend heavily on various details of the parallel computer. In addition, the execution time can also alter the functional behavior, because the changes in the relative timing of different processors and the communication network can alter the overall functional behavior. To counter this danger, different parts of the parallel program must be synchronized properly. In embedded systems the situation is even more delicate if real-time deadlines have to be observed. A system that responds slightly too late may be as unacceptable as a system that responds with incorrect data. Even worse, it is entirely context dependent if it is better to respond slightly too late, incorrectly, or not at all. For instance when transmitting a video stream, incorrect data arriving on time may be preferable to correct data arriving too late. Moreover, it may be better not to send data that arrive too late to save resources. On the other hand, control signals to drive the engine or the breaks in a car must always arrive and a tiny delay may be preferable to no signal at all. These observations lead to the distinction of different kinds of real-time systems, for example, hard versus soft real-time systems, depending on the requirements on the timing.


4-4


Since most embedded systems interact with real-world objects they are subject to some kind of real-time requirements. Thus, time is an integral part of the functional behavior and cannot be abstracted away completely in many cases. So it should not come as a surprise that MoCs have been developed to allow the modeling of time in an abstract way to meet the application requirements while at the same time avoiding the unnecessary burden of too detailed timing. We will discuss some of these models below. In fact, the timing abstractions of different MoCs is a main organizing principle in this chapter. Designing for low power is a high priority for most, if not all, embedded systems. However, power has been treated in a limited way in computational models because of the difficulty to abstract the power consumption from the details of architecture and implementation. For very large-scale integration (VLSI) circuits computational models have been developed to derive lower and upper bounds with respect to complexity measures that usually include both circuit area and computation time for a given behavior. AT 2 has been found to be a relevant and interesting complexity measure, where A is the circuit area and T is the computation time either in clock cycles or in physical time. These models have also been used to derive bounds on the energy consumption by usually assuming that the consumed energy is proportional to the state changes of the switching elements. Such analysis shows for instance that AT 2 optimal circuits, that is, circuits which are optimal up to a constant factor with respect to the AT 2 measure for a given boolean function, utilize their resources to a high degree, which means that on average a constant fraction of the chip changes state. Intuitively this is obvious since, if large parts of a circuit are not active over a long period (do not change state), it can presumably be improved by making it either smaller or faster and thus utilizing the circuit resources to a higher degree on average. Or, to conclude the other way round, an AT 2 optimal circuit is also optimal with respect to energy consumption for computing a given boolean function. One can spread out the consumed energy over a larger area or a longer time period, but one cannot decrease the asymptotic energy consumption for computing a given function. Note, that all these results are asymptotic complexity measures with respect to a particular size metric of the computation, for example, the length in bit of the input parameter of the function. For a detailed survey of this theory see Reference 11. These models have several limitations. They make assumptions about the technology. For instance, in different technologies the correlation between state switching and energy consumption is different. In n-channel metal oxide semiconductor (NMOS) technologies the energy consumption is more correlated with the number of switching elements. The same is true for complementary metal oxide semiconductor (CMOS) technologies if leakage power dominates the overall energy consumption. Also, they provide asymptotic complexity measures for very regular and systematic implementation styles and technologies with a number of assumptions and constraints. However, they do not expose relevant properties for complex modern microprocessors, VLIW (Very Large Instruction Word) processors, DSPs (Digital Signal Processings), FPGAs (Field Programmable Gate Arrays), or ASIC (Application Specific Integrated Circuit) designs in a way useful for system level design decisions. And we are again back at our original question about what exactly is the purpose of a computational model and how general or how specific should it be. In principle, there are two alternatives to integrate nonfunctional properties such as power, reliability, and also time in an MoC: • First, we can include these properties in the computational model and associate every functional operation with a specific quantity of that property. For example, an add operation takes 2 nsec and consumes 60 pW. During simulation or some other analysis we can calculate the overall delay and power consumption. • Second, we can allocate abstract budgets for all parts of the design. For instance, in synchronous design styles, we divide the time axis in slots or cycles and assign every part of the design to exactly one slot. Later on during implementation, we have to find the physical time duration of each slot, which determines the clock frequency. We can optimize for high clock frequency by identifying the critical path and optimizing that design part aggressively. Alternatively, we can move some of the functionality from the slot with the critical part to a neighboring slot, thus balancing the different slots. This budget approach can also be used for managing power consumption, noise, and other properties.



4-5

The first approach suffers from inefficient modeling and simulation when all implementation details are included in a model. Also, it cannot be applied to abstract models since there these implementation details are not available. Recall, that a main idea of computational models is that they should be abstract and general enough to support analysis of a large variety of architectures. The inclusion of detailed timing and power consumption data would obstruct this objective. Even the approach to start out with an abstract model and later on back-annotate the detailed data from realistic architectural or implementation models does not help, because the abstract model does not allow to draw concrete conclusions and the detailed, back-annotated model is valid only for a specific architecture. The second approach with abstract budgets is slightly more appealing to us. On the assumption that all implementations will be able to meet the budgeted constraints, we can draw general conclusions about performance or power consumption on an abstract level valid for a large number of different architectures. One drawback is that we do not know exactly for which class of architectures our analysis is valid, since it is hard to predict which implementations will at the end be able to meet the budget constraints. Another complication is, that we do not know the exact physical size of these budgets and it may indeed be different for different architectures and implementations. For instance an ASIC implementation of a given architecture may be able to meet a cycle constraint of 1 nsec and run at 1 GHz clock frequency, while an FPGA implementation of exactly the same algorithms requires a cycle budget of 5 nsec. But still, the abstract budget approach is promising because it divides the overall problem into more manageable pieces. At the higher level we make assumptions about abstract budgets and analyze a system based on these assumptions. Our analysis will then be valid for all architectures and implementations that meet the stated assumptions. At the lower level we have to ensure and verify that these assumptions are indeed met.

4.1.3 Heterogeneity Another salient feature of many embedded systems is heterogeneity. It comes from various environmental constraints on the interfaces, from heterogeneous applications and from the need to find different tradeoffs among performance, cost, power consumption, and flexibility for different parts of the system. Consequently, we see analog and mixed signal parts, digital signal processing parts, image, and video processing parts, control parts, and user interfaces coexisting in the same system or even on the same VLSI device. We also see irregular architectures with microprocessors, DSPs, VLIWs, custom hardware coprocessors, memories, and FPGAs connected via a number of different segmented and hierarchical interconnection schemes. It is a formidable task to develop a uniform MoC that exposes all relevant properties while nicely suppressing irrelevant details. Heterogeneous MoCs is one way to address heterogeneity at the application, architecture, and implementation level. Different computational models are connected and integrated into a hierarchical, heterogeneous MoC that represents the entire system. Many different approaches have been taken to either connect two different computational models or provide a general framework to integrate a number of different models. It turns out that issues of communication, synchronization, and time representation pose the most formidable challenges. The reason is that the communication, and, in particular, the synchronization semantics between different MoC domains correlates the time representation between the two domains. As we will see below, connecting a timed MoC with an untimed model leads to the import of a time structure from the timed to the untimed model resulting in a heterogeneous, timed MoC. Thus the integration cannot stop superficially at the interfaces leaving the interior of the two computational domains unaffected. Due to the inherent heterogeneity of embedded systems, different MoCs will continue to be used and thus different MoC domains will coexist within the same system. There are two main possible relations, one is due to refinement and the other due to partitioning. A more abstract MoC can be refined into a more detailed model. In our framework, time is the natural parameter that determines the abstraction level of a model. The untimed MoC is more abstract then the synchronous MoC, which in turn is more abstract than the timed MoC. It is in fact common practice that a signal processing algorithm is first


4-6


modeled as an untimed dataflow algorithm, which is then refined into a synchronous circuit description, which in turn is mapped onto a technology dependent netlist of fully timed gates. However, this is not a natural flow for all applications. Control dominated systems or subsystems require some notion of time already at the system level and sensor and actuator subsystems may require a continuous time model right from the start. Thus, different subsystems should be modeled with different MoCs.

4.1.4 Component Interaction A troubling issue in complex, heterogeneous systems is unexpected behavior of the system due to subtle and complex ways of interaction of different MoCs parts. Eker et al. [12] call this phenomenon emergent behavior. Some examples shall illustrate this important point: Priority inversion. Threads in a real-time operating system may use two different mechanism of resource allocation[12]. One is based on priority and preemption to schedule the threads. The second is based on monitors. Both are well defined and predictable in isolation. For instance, priority and preemption based scheduling means that a higher priority thread cannot be blocked by a lower priority thread. However, if the two threads also use a monitor lock, the lower priority thread may block the high priority thread via the monitor for an indefinite amount of time. Performance inversion. Assume there are four CPUs on a bus. CPU1 sends data to CPU2 , CPU3 sends data to CPU4 over the bus [13]. We would expect that the overall system performance improves when we replace one CPU with a faster processor, or at least that the system performance does not decrease. However, replacing CPU1 with a faster CPU1 may mean that data is sent from CPU1 to CPU2 with a higher frequency, at least for a limited amount of time. This means, that the bus is more loaded by this traffic, which may slow down the communication from CPU3 to CPU4 . If this communication performance has a direct influence on the system performance, we will see a decreased overall system performance. Over synchronization. Assume that the upper and lower branches in Figure 4.1 have no mutual functional dependence as the dataflow arrows indicate. Assume further that process B is blocked when it tries to send data to C1 or D1, but the receiver is not ready to accept the data. Then, a delay or deadlock in branch D will propagate back through process B to both A and the entire C branch. These examples are not limited to situations when different MoCs interact. They show that, when separate, seemingly unrelated subsystems interact via a nonobvious mechanism, which is often a shared resource, the effects can be hard to analyze. When the different subsystems are modeled in different MoCs the problem is even more pronounced due to different communication semantics, synchronization mechanisms, and time representation.

4.1.5 Time The treatment of time will serve for us as the most important dimension to distinguish MoCs. We can identify at least four levels of accuracy, which are continuous time, discrete time, clocked time, and causality. In the sequel, we only cover the last three levels. When time is not modeled explicitly, events are only partially ordered with respect to their causal dependences. In one approach, taken for instance in deterministic dataflow networks [14, 15], the system

A

FIGURE 4.1

C1

C2

C3

D1

D2

D3

B

Over synchronization between functionally independent subsystems.



4-7

behavior is independent of delays and timing behavior of computation elements and communication channels. These models are robust with respect to time variations in that any implementation, no matter how slow or fast it is, will exhibit the same behavior as the model. Alternatively, different delays may affect the system’s behavior and we obtain an inherently nondeterministic model since time behavior, which is not modeled explicitly is allowed to influence the observable behavior. This approach has been taken both in the context of dataflow models [16–19] and process algebras [20, 21]. In this chapter we follow the deterministic approach, which can be generalized to approximate nondeterministic behavior by means of stochastic processes as shown in Reference 22. To exploit the very regular timing of some applications, the synchronous dataflow (SDF) [23] has been developed. Every process consumes and emits a statically fixed number of events in each evaluation cycle. The evaluation cycle is the reference time. The regularity of the application is translated into a restriction of the model, which in turn allows efficient analysis and synthesis techniques that are not applicable for more general models. Scheduling, buffer size optimization, and synthesis have been successfully developed for the SDF. One facet related to the representation of time is the dichotomy of dataflow dominated and control flow dominated applications. Dataflow dominated applications tend to have events that occur in very regular intervals. Thus, explicit representation of time is not necessary and in fact often inefficient. In contrast, control dominated applications deal with events occurring at very irregular time instants. Consequently, explicit representation of time is a necessity because the timing of events cannot be inferred. Difficulties arise in systems that contain both elements. Unfortunately, these kind of systems become more common since the average system complexity steadily increases. As a consequence, several attempts to integrate dataflow and control dominated modeling concepts have emerged. In the synchronous piggybacked dataflow model [24] control events are transported on dataflow streams to represent a global state without breaking the locality principle of dataflow models. The composite signal flow [25] distinguishes between control and dataflow processes and puts significant effort to maintain the frame-oriented processing which is so common in dataflow and signal processing applications for efficiency reasons. However, conflicts occur when irregular control events must be synchronized with dataflow events inside frames. The composite signal flow addresses this problem by allowing an approximation of the synchronization and defines conditions when approximations are safe and do not lead to erroneous behavior. Time is divided up into time slots or clock cycles by various synchronous models. According to the perfect synchrony assumption [26, 27] neither communication nor computation takes any noticeable time and the time slots or evaluation cycles are completely determined by the arrival of input events. This assumption is useful because designer and tools can concentrate solely on the functionality of the system without mixing this activity with timing considerations. Optimization of performance can be done in a separate step by means of static timing analysis and local retiming techniques. Even though timing does not appear explicitly in synchronous models, the behavior is not independent of time. The model constrains all implementations such that they must be fast enough to process input events properly and to complete an evaluation cycle before the next events arrive. When no events occur in an evaluation cycle, a special token called absent event is used to communicate the advance of time. In our framework we use the same technique in Sections 4.2.4 and 4.2.5 for both the synchronous MoC and the fully timed MoC. Discrete timed models use a discrete set, usually integers or natural numbers, to assign a time stamp to each event. Many discrete event models fall into this category [28–30] as well as most popular hardware description languages, such as VHDL and Verilog. Timing behavior can be modeled most accurately, which makes it the most general model we consider here and makes it applicable to problems such as detailed performance simulation where synchronous and untimed models cannot be used. The price for this is the intimate dependence of functional behavior on timing details and significantly higher computation costs for analysis, simulation, and synthesis problems. Discrete timed models may be nondeterministic, as mainly used in performance analysis and simulation (see e.g., Reference 30), or deterministic, as more desirable for hardware description languages such as VHDL.


4-8


The integration of these different timing models into a single framework is a difficult task. Many attempts have been made on a practical level with a concrete design task, mostly simulation, in mind [31–35]. On a conceptual level Lee and Sangiovanni-Vincentelli [36] have proposed a tagged time model in which every event is assigned a time tag. Depending on the tag domain we obtain different MoCs. If the tag domain is a partially ordered set, it results in an untimed model according to our definition. Discrete, totally ordered sets lead to timed MoCs and continuous sets result in continuous time MoCs. There are two main differences between the tagged time model and our proposed framework. First, in the tagged time model processes do not know how much time has progressed when no events are received since global time is only communicated via the time stamps of ordinary events. For instance, a process cannot trigger a time-out if it has not received events for a particular amount of time. Our timed model in Section 4.2.5 does not use time tags but absent events to globally order events. Since absent events are communicated between processes whenever no other event occurs, processes are always informed about the advance of global time. We chose this approach because it resembles better the situation in design languages, such as VHDL, C, or SDL (Specification and Description Language) where processes always can experience time-outs. Second, one of our main motivations was the separation of communication and synchronization issues from the computation part of processes. Hence, we strictly distinguish between process interfaces and process functionality. Only the interfaces determine to which MoC a process belongs, while the core functionality is independent of the MoC. This feature is absent from the tagged token model. This separation of concerns has been inspired by the concept of firing cycles in dataflow process networks [37]. Our mechanism for consuming and emitting events based on signal partitionings as described in Sections 4.2.2 and 4.2.3.1 is only slightly more general than the firing rules described by Lee [37] but it allows a useful definition of process signatures based on the way processes consume and emit events.

4.1.6 The Purpose of an MoC As mentioned several times, the purpose of a computational model determines, how it is designed, what properties it exposes, and what properties it suppresses. We argue that MoCs for embedded systems should not address principal questions of computability or feasibility, but should rather aid the design and validation of concrete systems. How this is accomplished best remains a subject of debate, but for this chapter we assume that an MoC should support the following properties: Implementation independence. An abstract model should not expose too much details of a possible implementation, for example, which kind of processor is used, how much parallel resources are available, what kind of hardware implementation technology is used, details of the memory architecture, etc. Since an MoC is a machine abstraction, it should, by definition, avoid unnecessary machine details. Practically speaking, the benefits of an abstract model include that analysis and processing is faster and more efficient, that analysis results are relevant for a larger set of implementations, and that the same abstract model can be directed to different architectures and implementations. On the downside we note diminished analysis accuracy and a lack of knowledge of the target architecture that can be exploited for modeling and design. Hence, the right abstraction level is a fine line that is also changing over time. While many embedded system designers could for long safely assume a purely sequential implementation, current and future computational models should avoid such an assumption. Resource sharing and scheduling strategies become more complex, and an MoC should thus either allow the explicit modeling of such a strategy or restrict the implementations to follow a particular, well-defined strategy. Composability. Since many parts and components are typically developed independently and integrated into a system, it is important to avoid unexpected interferences. Thus, some kind of composability property [38] is desirable. One step in this direction is to have a deterministic computational model such as Kahn process networks that guarantee a particular behavior independent of the time or individual activities and independent of the amount of available resources in general.



4-9

This is of course only a first step since, as argued earlier, time behavior is often an integral part of the functional behavior. Thus, resource sharing strategies, that greatly influence timing, will still have a major impact on the system behavior even for fully deterministic models. We can reconcile good system composability with shared resources by allocating a minimum but guaranteed amount of resources for each subsystem or task. For instance, two tasks get a fixed share of the communication bandwidth of a bus. This approach allows for ideal composability but has to be based on worst case behavior. It is very conservative and hence, does not utilize resources efficiently. We can relax this approach by allocating abstract resource budgets as part of the computational model. Then we require from the implementation to provide the requested resources, and at the same time to minimize the abstract budgets and thus the required resources. As example consider two tasks that have a particular communication need per abstract time slot, where the communication need may be different for different slots. The implementation has to fulfill the communication requirements of all tasks by providing the necessary bandwidth in each time slot, tuning the length of the individual time slots, or by moving communication from one slot to another. These optimizations will also have to consider global timing and resource constraints. In any case, in the abstract model we can deal with abstract budgets and assume that they will be provided by any valid implementation. Analyzability. A general tradeoff exists between the expressiveness of a model and its analyzability. By restricting models in clever ways, one can apply powerful and efficient analysis and synthesis methods. For instance, the SDF model allows all actors only a constant amount of input and output tokens in each activation cycle. While this restricts the expressiveness of the model, it allows to efficiently compute static schedules when they exist. For general dataflow graphs this may not be possible because it could be impossible to ensure that the amount of input and output is always constant for all actors, even if they are in a particular case. Since SDF covers a fairly large and important application domain, it has become a very useful MoC. The key is to understand what are the important properties (finding static schedules, finding memory bounds, finding maximum delays, etc.) and devising an MoC that allows to handle these properties efficiently and does not restrict the modeling power too much. In the following sections we discuss a framework to study different MoCs. The idea is to use different types of process constructors to instantiate processes of different MoCs. Thus, one type of process constructors would yield only untimed processes, while another type results in timed processes. The elements for process construction are simple functions and are in principle independent of a particular MoC. However, the independence is not complete since some MoCs put specific constraints on the functions. But still the separation of the process interfaces from the internal process behavior is fairly far reaching. The interfaces determine the time representation, synchronization, and communication, hence the MoC. In this chapter we will not elaborate all interesting and desirable properties of computational models. Rather we will use the framework to introduce four different MoCs that only differ in their timing abstraction. Since time plays a very prominent role in embedded systems, we focus on this aspect and show how different time abstractions can serve different purposes and needs. Another defining aspect of embedded systems is heterogeneity, which we address by allowing different MoCs to coexist in a model. The common framework makes this integration semantically clean and simple. We study two particular aspects of this coexistence, namely the interfaces between two different MoCs and the refinement of one MoC into another. Other central issues of embedded systems, such as power consumption, global analysis and optimization, are not covered, mostly because they are not very well understood in this context and few advanced proposals exist on how to deal with them from an MoC perspective.

4.2 The MoC Framework In the remainder of this chapter we discuss a framework that accommodates MoCs with different timing abstractions. It is based on process constructor, which is a mechanism to instantiate processes. A process constructor takes one or more pure functions as arguments and creates a process. The functions represent


4-10


the process behavior and have no notion of time or concurrency. They simply take arguments and produce results. The process constructor is responsible for establishing communication with other processes. It defines the time representation, the communication, and synchronization semantics. A set of process constructors determines a particular MoC. This leads to a systematic and clean separation of computation and communication. A function, that defines the computation of a process, can in principle be used to instantiate processes in different computational models. However, a computational model may put constraints on functions. For instance, the synchronous MoC requires a function to take exactly one event on each input and produce exactly one event for each output. The untimed MoC does not have a similar requirement. After some preliminary definitions in this section, we introduce the untimed processes, give a formal definition of an MoC, and define the untimed MoC (Section 4.2.3) the perfectly synchronous and the clocked synchronous MoC (Section 4.2.4), and the discrete time MoC (Section 4.2.5). Based on this we introduce interfaces between MoCs and present an interface refinement procedure in the next section. Furthermore, we discuss the refinement from an untimed MoC to a synchronous MoC and to a timed MoC.

4.2.1 Processes and Signals Processes communicate with each other by writing to and reading from signals. Given is a set of values V , which represents the data communicated over the signals. Events, which are the basic elements of signals, are or contain values. We distinguish among three different kinds of events. Untimed events E˙ are just values without further information, E˙ = V . Synchronous events E¯ include a pseudo-value ⊥ in addition to the normal values, hence E¯ = V ∪ {⊥}. Timed events Eˆ are identical ¯ However, since it is often useful to distinguish them, we use different to synchronous events, Eˆ = E. symbols. Intuitively, timed events occur at much finer granularity than synchronous events and they would usually represent physical time units, such as a nanosecond. In contrast, synchronous events represent abstract time slots or clock cycles. This model of events and time can only accommodate discrete time models. Continuous time would require a different representation of time and events. We use the symbols e˙ , e¯ , and eˆ to denote individual untimed, synchronous, and timed events, respectively. We use E = E˙ ∪ E¯ ∪ Eˆ and e ∈ E to denote any kind of event. Signals are sequences of events. Sequences are ordered and we use subscripts as in ei to denote the ith event in a signal. For example, a signal may be written as e0 , e1 , e2 . In general, signals can be finite or infinite sequences of events and S is the set of all signals. We also distinguish among three kinds of signals and S˙ , S¯ , and Sˆ denote the untimed, synchronous, and timed signal sets, respectively, and s˙, s¯, and sˆ designate individual untimed, synchronous, and timed signals.

is the empty signal and ⊕ concatenates two signals. Concatenation is associative and has the empty signal as its neutral element: s1 ⊕ (s2 ⊕ s3 ) = (s1 ⊕ s2 ) ⊕ s3 , ⊕ s = s ⊕ = s. To keep the notation simple we often treat individual events as one-event sequences, for example, we may write e ⊕ s to denote

e ⊕ s. We use angle brackets, “ ” and “”not only to denote ordered sets or sequences of events, but also to denote sequences of signals if we impose an order on a set of signals. #s gives the length of signal s. Infinite signals have infinite length and # = 0. [] is an index operation to extract an event on a particular position from a signal. For example, s[2] = e2 if s = e1 , e2 , e3 . Processes are defined as functions on signals p : S → S. Processes are functions in the sense that for a given input signal we always get the same output signal, that is, s = s ⇒ p(s) = p(s ). Note, that this still allows processes to have an internal state. Thus, a process does not necessarily react identical to the same event applied at different times. But it will



4-11

s = 1r0,r1, ...2 = 11e0, e1, e22,1e3, e4, e52, ...2 pn(s) = 1ri2 for n(i ) = 3 for all i

p s⬘= 1r ⬘0, r 1⬘ , ...2 = 11e⬘0, e⬘12, 1e⬘2 ,e⬘32, ...2 n⬘(s⬘) = 1r i⬘2 for n⬘(i ) = 2 for all i

FIGURE 4.2 The input signal of process p is partitioned into an infinite sequence of subsignals each of which contains three events, while the output signal is partitioned into subsignals of lengths 2.

produce the same, possibly infinite, output signal when confronted with identical, possibly infinite, input signals provided it starts with the same initial state.

4.2.2 Signal Partitioning We shall use the partitioning of signals into subsequences to define the portions of a signal that is consumed or emitted by a process in each evaluation cycle. A partition π(ν, s) of a signal s defines an ordered set of signals, ri , which, when concatenated together, form “almost” the original signal s. The function ν : N0 → N0 defines the lengths of all elements in the partition. ν(0) = #r0 gives the length of the first element in the partition, ν(1) = #r1 gives the length of the second element, etc. Example 4.1 Let s1 = 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 and ν1 (0) = ν1 (1) = 3, ν1 (2) = 4. Then we get the partition π(ν1 , s1 ) =

1, 2, 3, 4, 5, 6, 7, 8, 9, 10. Let s2 = 1, 2, 3, . . . be the infinite signal with ascending integers. Let ν2 (i) = 2 for all i ≥ 0. The resulting partition is infinite: π(ν2 , s2 ) =

1, 2, 3, 4, . . .. The function ν(i) defines the length of the subsignals ri . If it is constant for all i we usually omit the argument and write ν. Figure 4.2 illustrates a process with an input signal s and an output signal s . s is partitioned into subsignals of length 3 and s into subsignals of length 2.

4.2.3 Untimed MoCs 4.2.3.1 Process Constructors Our aim is to relate functions of events to processes, which are functions of signals. Therefore we introduce process constructors that can be considered as higher-order functions. They take functions on events as arguments and return processes. We define only a few basic process constructors that can be used to compose more complex processes and process networks. All untimed process constructors and processes operate exclusively on untimed signals. Processes with arbitrary number of input and output signals are cumbersome to deal with in a formal way. To avoid this inconvenience we mostly deal with processes with one input and one output. To handle arbitrary processes also, we introduce“zip”and“unzip”processes that merge two input signals into one and split one input signal into two output signals, respectively. These processes together with appropriate process composition allows us to express arbitrary behavior. Processes instantiated with the mealyU constructor resemble Mealy state machines in that they have a next state function and an output encoding function that depend on both the input and the current state. Definition 4.1 Let V be an arbitrary set of values, let g , f : (V × S˙ ) → S˙ the next state and output encoding functions, let γ : V → N be a function defining the input partitioning, and let w0 ∈ V be an initial state. mealyU is a process constructor which, given γ , f , g , and w0 as arguments, instantiates a process p : S˙ → S˙ .


4-12


The function γ determines the number of events consumed by the process in the current evaluation cycle. γ is dependent on the current state. p repeatedly applies g on the current state and the input events to compute the next state. Further, it applies f repeatedly on the current state and the input events, to compute the output events. Processes instantiated by mealyU are general state machines with one input and one output. To create processes with arbitrary inputs and outputs, we also use the following constructors: • zipU instantiates a process with two inputs and one output. In every evaluation cycle this process takes one event from the left input and one event from the right input and packs them into an event pair that is emitted at the output. • unzipU instantiates a process with one input and two outputs. In every evaluation cycle this process takes one event from the input. It requires it to be an event pair. The first event of this pair is emitted to the left output, the second event of the event pair is emitted to the right output. For truly general process networks we would in fact need more complex zip processes, but for the purpose of this chapter the simple constructors are sufficient and we refer the reader for details to Reference 39. 4.2.3.2 Composition Operators We consider only three basic composition operators, namely sequential composition, parallel composition, and feedback. We give the definitions only for processes with one or two input and output signals, because the generalization to arbitrary numbers of inputs and outputs is straightforward. Definition 4.2 Let p1 , p2 : S˙ → S˙ be two processes with one input and one output each, and let s1 , s2 ∈ S˙ be two signals. Their parallel composition, denoted as p1 p2 , is defined as follows. (p1 p2 )( s1 , s2 ) = p1 (s1 ), p2 (s2 ). Since processes are functions we can easily define sequential composition in terms of functional composition. Definition 4.3 Let again p1 , p2 : S˙ → S˙ be two processes and let s ∈ S˙ be a signal. The sequential composition, denoted as p1 ◦ p2 , is defined as follows. (p2 ◦ p1 )(s) = p2 (p1 (s)). Definition 4.4 Given a process p : (S × S) → (S × S) with two input signals and two output signals we define the process µp : S → S by the equation (µp)(s1 ) = s2

where p(s1 , s3 ) = (s2 , s3 ).

The behavior of the process µp is defined by the least fixed point semantics based on the prefix order of signals. The µ operator gives feedback loops (Figure 4.3) a well-defined semantics. Moreover, the value of the feedback signal can be constructed by repeatedly simulating the process network starting with the empty signal until the values on all feedback signals stabilize and do not change any more [39]. Now we are in a position to define precisely what we mean with an MoC. Definition 4.5 An MoC is a 2-tuple MoC = (C, O), where C is a set of process constructors, each of which, when given constructor specific parameters, instantiates a process. O is a set of process composition operators, each of which, when given processes as arguments, instantiates a new process.



4-13

s1 s3

p

mp s2

FIGURE 4.3

Feedback composition of a process.

Definition 4.6 The untimed MoC is defined as untimed MoC = (C, O ), where C = {mealyU, zipU, unzipU} O = {, ◦, µ}. In other words, a process or a process network belongs to the untimed MoC domain iff all its processes and process compositions are constructed either by one of the named process constructors or by one of the composition operators. We call such processes U-MoC processes. Because the process interface is separated from the functionality of the process, interesting transformations can be done. For instance, a process can be mechanically transformed into a process that consumes and produces a multiple number of events of the original process. Processes can be easily merged into more complex processes. Moreover, there may be the opportunity to move functionality from one process to another. For more details on this kind of transformations see Reference 39.

4.2.4 The Synchronous MoC The synchronous languages StateCharts [40], Esterel [41], Signal [42], Argos, Lustre [43], and some others have been developed on the basis of the perfect synchrony assumption. Perfect synchrony hypothesis. Neither computation nor communication takes time. Timing is entirely determined by the arriving of input events because the system processes input samples in zero time and then waits until the next input arrives. If the implementation of the system is fast enough to process all inputs before the next sample arrives, it will behave exactly as the specification in the synchronous language. 4.2.4.1 Process Constructors Formally, we develop synchronous processes as a special case of untimed processes. This will allow us later to easily connect different domains. Synchronous processes have two specific characteristics. First, all synchronous processes consume and produce exactly one event on each input or output in each evaluation cycle, that is, the signature is always {1, . . .}, {1, . . .}. Second, in addition to the value set V events can carry the special value ⊥, which denotes the absence of an event; this is the way we defined synchronous events E¯ and signals S¯ in Section 4.2.1. Both, the processes and their contained functions must be able to deal with these events. All synchronous process constructors and processes operate exclusively on synchronous signals. Definition 4.7 Let V be an arbitrary set of values, E¯ = V ∪ {⊥}, let g , f : (E¯ × S¯ ) → S¯ and let w0 ∈ V be an initial state. mealyS is a process constructor which, given f , g , and w0 as arguments, instantiates a process p : S¯ → S¯ . p repeatedly applies g on the current state and the input event to compute the next state. Further it


4-14


applies repeatedly f on the current state and the input event to compute the output event. p consumes exactly one input event in each evaluation cycle and emits exactly one output event. We only require that g and f are defined for absent input events and that the output signal partitioning is the constant 1. When we merge two signals into one we have to decide how to represent the absence of an event in one input signal in the compound signal. We choose to use the ⊥ symbol for this purpose also, which has the consequence, that ⊥ appears also in tuples together with normal values. Thus, it is essentially used for two different purposes. Having clarified this, the definition for zipS and unzipS is straightforward. zipSbased processes pack two events from the two inputs into an event pair at the output, while unzipS performs the inverse operation. 4.2.4.2 The Perfectly Synchronous MoC Again, we can now make precise what we mean by synchronous MoC. Definition 4.8 The synchronous MoC is defined as synchronous MoC = (C, O), where C = {mealyS, zipS, unzipS} O = {, ◦, µS }. In other words, a process or a process network belongs to the synchronous MoC domain iff all its processes and process compositions are constructed either by one of the named process constructors or by one of the composition operators. We call such processes S-MoC processes. Note, that we do not use the same feedback operator for the synchronous MoC. µS defines the semantics ¯ It is also based on a fixed point semantics of the feedback loop based on the Scott order of the values in E. but it is resolved for each event and not over a complete signal. We have adopted µS to be consistent with the zero-delay feedback loop semantics of most synchronous languages. For our purpose here this is not significant and we do not need to go into more details. For precise definitions and a thorough motivation the reader is referred to Reference 39. Merging of processes and other related transformations are very simple in the synchronous MoC because all processes have essentially identical interfaces. For instance, the merge of two mealyS-based processes can be formulated as follows. mealyS(g1 , f1 , v0 ) ◦ mealyS(g2 , f2 , w0 ) = mealyS(g , f , (v0 , w0 )) where g ((v, w), e¯ ) = (g1 (v, f2 (w, e¯ )), g2 (w, e¯ ))

f ((v, w), e¯ ) = f1 (v, f2 (w, e¯ )).

4.2.4.3 The Clocked Synchronous MoC It is useful to define a variant of the perfectly synchronous MoC, the clocked synchronous MoC that is based on the following hypothesis. Clocked synchronous hypothesis. There is a global clock signal controlling the start of each computation in the system. Communication takes no time and computation takes one clock cycle. First, we define a delay process that delays all inputs by one evaluation cycle. = mealyS( f , g , ⊥) where g (w, e¯ ) = e¯ ,

f (w, e¯ ) = w.

Based on this delay process we define the constructors for the clocked synchronous model.



4-15

Definition 4.9 mealyCS(g , f , w0 ) = mealyS(g , f , w0 ) ◦ zipCS()(¯s1 , s¯2 ) = zipS()((¯s1 ), (¯s2 )) unzipCS() = unzipS() ◦ .

(4.1)

Thus, elementary processes are composed of a combinatorial function and a delay function that essentially represents a latch at the inputs. Definition 4.10 The clocked synchronous MoC is defined as clocked synchronous MoC = (C, O), where C = {mealyCS, zipCS, unzipCS} O = {, ◦, µ}. In other words, a process or a process network belongs to the clocked synchronous MoC Domain iff all its processes and process compositions are constructed either by one of the named process constructors or by one of the composition operators. We call such processes CS-MoC processes.

4.2.5 Discrete Timed MoCs Timed processes are a blend of untimed and synchronous processes in that they can consume and produce more than one event per cycle and they also deal with absent events. In addition, they have to comply with the constraint that output events cannot occur before the input events of the same evaluation cycle. This is achieved by enforcing an equal number of input and output events for each evaluation cycle, and by prepending an initial sequence of absent events. Since the signals also represent the progression of time, the prefix of absent events at the outputs corresponds to an initial delay of the process in reacting to the inputs. Moreover, the partitioning of input and output signals corresponds to the duration of each evaluation cycle. Definition 4.11 mealyT is a process constructor which, given γ , f , g , and w0 as arguments, instantiates a process p : Sˆ → Sˆ . Again, γ is a function of the current state and determines the number of input events consumed in a particular evaluation cycle. Function g computes the next state and f computes the output events with the constraint that the output events do not occur earlier than the input events on which they depend. This constraint is necessary because in the timed MoC each event corresponds to a time stamp and we have a globally total order of time, relating all events in all signals to each other. To avoid causality flaws every process has to abide by this constraint. Similarly zipT-based processes consume events from their two inputs and pack them into tuples of events emitted at the output. unzipT performs the inverse operation. Both have also to comply with the causality constraint. Again, we can now make precise what we mean by timed MoC. Definition 4.12 The timed MoC is defined as timed MoC = (C, O), where C = {mealyT, zipT, unzipT} O = {, ◦, µ}. In other words, a process or a process network belongs to the timed MoC domain iff all its processes and process compositions are constructed either by one of the named process constructors or by one of the composition operators. We call such processes T-MoC processes.


4-16


Merging other transformations as well as analysis of time process networks is more complicated than for synchronous or untimed MoCs, because the timing may interfere with the pure functional behavior. However, we can further restrict the functions used in constructing the processes, to more or less separate behavior from timing also in the timed MoC. To illustrate this we discuss a few variants of the Mealy process constructor. mealyPT. In mealyPT (γ , f , g , w0 ) based processes the functions f and g are not exposed to absent events and they are only defined on untimed sequences. The interface of the process strips-off all absent events of the input signal, hands over the result to f and g , and inserts absent events at the output as appropriate to provide proper timing for the output signal. The function γ , which may depend on the process state as usual, defines how many events are consumed. Essentially, it represents a timer and determines when the input should be checked the next time. mealyST. In mealyST (γ , f , g , w0 ) based processes γ determines the number of nonabsent events that should be handed over to f and g for processing. Again, f and g never see or produce absent events and the process interface is responsible for providing them with the appropriate input data and for synchronization and timing issues on inputs and outputs. Unlike mealyPT processes, functions f and g in mealyST processes have no influence on when they are invoked. They only control how many nonabsent events have appeared before their invocation. f and g in mealyPT processes on the other hand determine the time instant of their next invocation independent of the number of nonabsent events. mealyTT. However, a combination of these two process constructors is mealyTT, which allows to control the number of nonabsent input events and a maximum time period, after which the process is activated in any case independent of the number of nonabsent input events received. This allows to model processes that wait for input events but can set internal timers to provide time-outs. These examples illustrate that process constructors and MoCs could be defined, which allow us to precisely define to which extent communication issues are separated from the purely functional behavior of the processes. Obviously, a stricter separation greatly facilitates verification and synthesis but may restrict expressiveness.

4.3 Integration of MoCs 4.3.1 MoC Interfaces Interfaces between different MoCs determine the relation of the time structure in the different domains and they influence the way a domain is triggered to evaluate inputs and produce outputs. If an MoC domain is time triggered, the time signal is made available through the interface. Other domains are triggered when input data is available. Again, the input data appears through the interfaces. We introduce a few simple interfaces for the MoCs of the previous sections, in order to be able to discuss concrete examples. Definition 4.13 A stripS2U process constructor takes no arguments and instantiates a process p : S¯ → S˙ , which takes a synchronous signal as input and generates an untimed signal as output. It reproduces all data from the input in the output in the same order with the exception of the absent event, which is translated into the value 0. Definition 4.14 An insertU2S process constructor takes no arguments and instantiates a process p : S˙ → S¯ , which takes an untimed signal as input and generates a synchronous signal as output. It reproduces all data from the input in the output in the same order without any change. These interface processes between the synchronous and the untimed MoCs are very simple. However, they establish a strict and explicit time relation between two connected domains. Connecting processes from different MoCs also requires a proper semantic basis, which we provide by defining a hierarchical MoC.



4-17

Definition 4.15 A hierarchical model of computation (HMoC) is a 3-tuple HMoC = (M , C, O), where M is a set of HMoCs or simple MoCs, each capable of instantiating processes or process networks; C is a set of process constructors; O is a set of process composition operators that governs the process composition at the highest hierarchy level but not inside process networks instantiated by any of the HMoCs of M . In the following examples and discussion we will use a specific but rather simple HMoC. Definition 4.16 H = (M , C, O) with M = {U-MoC, S-MoC} C = {stripS2U, insertU2S} O = {, ◦, µ}. Example 4.2 As example, consider the equalizer system of Figure 4.4 [39]. The control part consists of two synchronous MoC processes and the dataflow part, modeled as untimed MoC processes, filter and analyze an audio stream. Depending on the analysis results of the Analyzer process, the Distortion control will modify the filter parameters. The Button control takes also user input into account to steer the filter. The purpose of Analyzer and Distortion control are to avoid dangerously strong signals that could jeopardize the loud speakers. Control and dataflow parts are connected via two interface processes. The dataflow processes can be developed and verified separately in the untimed MoC domain, but as soon as they are connected to the synchronous MoC control part, the time structure of the synchronous MoC domain gets imposed on all the untimed MoC processes. With the simple interfaces of Figure 4.4, the Filter process consumes 4096 data tokens from the primary input, 1 token from the stripS2U process, and it emits 4096 tokens in every synchronous MoC time slot. Similarly, the activity of the Analyzer is precisely defined for every synchronous MoC time slot. Also, the activities of the two control processes are related precisely to the activities of the dataflow processes in every time slot. Moreover, the timing of the two primary inputs and the primary outputs are now related timewise. Their timing must be consistent because the timing of the primary input data determines the timing of the entire system. For example, if the input signal to

S-MoC 1

U-MoC

Button control

1

1

}

Distortion control

1

1

1

1

stripS2U

insertU2S

1

1

1

1

4096

4096 Filter

4096 Analyzer

FIGURE 4.4 A digital equalizer consisting of a dataflow part and control. The numbers annotating process inputs and outputs denote the number of tokens consumed and produced in each evaluation cycle. (From A. Jantsch. Modeling Embedded Systems and SoCs. Morgan Kaufmann Publishers, San Francisco, CA, 2004. With permission.)


4-18


the Button control process assumes that each time slot has the same time duration, the 4096 data samples of the Filter input in each evaluation cycle must correspond to the same constant time period. It is the responsibility of the domain interfaces to correctly relate the timing of the different domains to each other. It is required that the time relation established by all interfaces is consistent with each other and with the timing of the primary inputs. For instance if the stripS2U takes 1 token as input and emits 1 token as output in each evaluation cycle, the insertU2S process cannot take 1 token as input and produce 2 tokens as output. The interfaces in Figure 4.4 are very simple and lead to a strict coupling between the two MoC domains. Could more sophisticated or nondeterministic interfaces avoid this coupling effect? The answer is no because even if the input and output tokens of the interfaces vary from evaluation cycle to evaluation cycle in complex or nondeterministic ways, we still have a very precise timing relation in each and every time slot. Since in every evaluation cycle all interface processes must consume and produce a particular number of tokens, this determines the time relation in that particular cycle. Even though this relation may vary from cycle to cycle, it is still well defined for all cycles and hence for the entire execution of the system. The possibly nondeterministic communication delay between MoC domains, as well as between any other processes, can be modeled, but this should not be confused with establishing a time relation between two MoC domains.

4.3.2 Interface Refinement In order to show this difference and to illustrate how abstract interfaces can be gradually refined to accommodate channel delay information and detailed protocols, we propose an interface refinement procedure, given below: 1. Add a time interface. When we connect two different MoC domains, we always have to define the time relation between the two. This is the case even if the two domains are of the same type, for example, both are synchronous MoC domains, because the basic time unit may or may not be identical in the two domains. In our MoC framework the occurrence of events also represent time in both the synchronous MoC and timed MoC domains. Thus, setting the time relation means to determine the number of events in one domain that correspond to one event in the other domain. For example, in Figure 4.4 the interfaces establish a one-to-one relation while the interface in Figure 4.5 represents a 3/2 relation.

MoC B

MoC A

Q

P

MoC B

MoC A

P

I1 3

2

Q

FIGURE 4.5 Determining the time relation between two MoC domains. (From A. Jantsch. Modeling Embedded Systems and SoCs. Morgan Kaufmann Publishers, San Francisco, CA, 2004. With permission.)



4-19

In other frameworks the establishing of a time relation will take a different form. For instance, if languages such as SystemC or VHDL are used, the time of the different domains have to be related to the common time base of the simulator. 2. Refine the protocol. When the time relation between the two domains is established, we have to provide a protocol that is able to communicate over the final interface at that point. The two domains may represent different clocking regimes on the same chip, or one may end up as software while the other is implemented as hardware, or both may be implemented as software on different chips or cores, etc. Depending on the final implementations we have to develop a protocol fulfilling the requirements of the interface, such as buffering and error control. In our example in Figure 4.6 we have selected a simple handshake protocol with limited buffering capability. Note, however, that this assumes that for every three events arriving from MoC A there are only two useful events to be delivered to MoC B. The interface processes I1 and I2 , and the protocol processes P1 , P2 , Q1 , and Q2 must be designed carefully to avoid both losing data and deadlock. 3. Model the channel delay. In order to have a realistic channel behavior, the delay can be modeled deterministically or stochastically. In Figure 4.7 we have added a stochastic delay varying between 2 and 5 MoC B cycles. The protocol will require more buffering to accommodate the varying delays. To dimension the buffers correctly we have to identify the average and the worst-case behavior that we should be able to handle. This refinement procedure proposed here is consistent with and complementary to other techniques proposed, for example, in the context of SystemC [44]. We only want to emphasize here that the time relation between domains from channel delay and protocol design have to be separated. Often these issues MoC B

MoC A

I1

P2

Q2

Q1

P1 I2

FIGURE 4.6 A simple handshake protocol. (From A. Jantsch. Modeling Embedded Systems and SoCs. Morgan Kaufmann Publishers, San Francisco, CA, 2004. With permission.) MoC B

MoC A

P2

I1

D[2,5]

I2

D[2,5]

Q2

Q1

P2

FIGURE 4.7 The channel delay can vary between 2 and 5 cycles measured in MoC B cycles. (From A. Jantsch. Modeling Embedded Systems and SoCs. Morgan Kaufmann Publishers, San Francisco, CA, 2004. With permission.)


4-20


are not separated clearly making interface design more complicated than necessary. More details about this procedure and the example can be found in Reference 39.

4.3.3 MoC Refinement The three introduced MoCs represent three time abstractions and, naturally, design often starts with higher time abstractions and gradually leads to lower abstractions. It is not always appropriate to start with an untimed MoC because when timing properties are an inherent and crucial part of the functionality, a synchronous model is more appropriate to start with. But if we start with an untimed model, we need to map it onto an architecture with concrete timing properties. Frequently, resource sharing makes the consideration of time functionally relevant, because of deadlock problems and complicated interaction patterns. All the three phenomena discussed in Section 4.1.4, priority inversion, performance inversion, and over-synchronization, emerged due to resource sharing. Example 4.3 We discuss therefore an example for MoC refinement from the untimed through the synchronous to the timed MoC, which is driven by resource sharing. In Figure 4.8 we have two unlimited MoC process pairs, which are functionally independent from each other. At this level, under the assumption of infinite buffers and unlimited resources, we can analyze and develop the core functionality embodied by the process internal functions f and g . In the first refinement step, shown in Figure 4.9, we introduce finite buffers between the processes. Bn,2 and Bm,2 represent buffers of size n and m, respectively. Since the untimed MoC assumes implicitly infinite buffers between two communicating processes, there is no point in modeling finite buffers in the untimed MoC domain. We just would not see any effect. In the synchronous MoC domain, however, we can analyze

P1

R1

FIGURE 4.8

P1 = mealyU (1, f P1, g P1, w P1)

Q1

Q1 = mealyU (1, f Q1, gQ1, w Q1)

R1 = mealyU (1, f R1, g R1, w R1)

S1

S

S

S

S2 = mealyU (1, f 1, g 1, w 1)

Two independent process pairs.

P2

Bn,2

P2 = mealyS: 2:1(f P2, g P2, w P2) Q2

Q

Q

Q

Q2 = mealyS(f2 , g2 , w2 ) Bn,2 = mealyS(f2Bn,

g2Bn,

w2Bn)

R2 = mealyS:2:1(f R2, g R2, w R2 ) R2

Bm,2

S2

S2 = mealyS(f 2S, g 2S, w S2 ) Bm,2 = mealyS(f2Bm, g2Bm, w2Bm)

FIGURE 4.9

Two independent process pairs with explicit buffers.



4-21

the consequences of finite buffers. The processes need to be refined. Processes P2 and R2 have to be able to handle full buffers while processes Q2 and S2 have to handle empty buffers. In the untimed MoC, processes always block on empty input buffers. This behavior can also be modeled in synchronous MoC processes easily. In addition more complicated behavior such as time-outs can be modeled and analyzed. To find the minimum buffer sizes while avoiding deadlock and ensuring the original system behavior is by itself a challenging task. Basten and Hoogerbrugge [45] propose a technique to address this. More frequently, the buffer minimization problem is formulated as part of the process scheduling problem [46,47]. The communication infrastructure is typically shared among many communicating actors. In Figure 4.10 we map the communication links onto one bus, represented as process I3 . It contains an arbiter that resolves conflicts when both processes Bn,3 and Bm,3 try to access the bus at the same time. It also implements a bus access protocol, that has to be followed by connecting processes. The synchronous MoC model in Figure 4.10 is cycle true and the effect of bus sharing on system behavior and performance can be analyzed. A model checker can prove and use the soundness and fairness of the arbitration algorithm and performance requirements on the individual processes can be derived to achieve a desirable system performance. Sometimes, it is a feasible option to synthesize the model of Figure 4.10 directly into a hardware or software implementation, provided we can use standard templates for the process interfaces. Alternatively we can refine the model into a fully timed model. However, we still have various options depending on what exactly we would like to model and analyze. For each process we can decide how much of the timing and synchronization details should be explicitly taken care of by the process and how much can be handled implicitly by the process interfaces. For instance in Section 4.2.5 we have introduced constructors mealyST and mealyPT. The first provides a process interface that strips-off all absent events and inserts absent events at the output as needed. The internal functions have only to deal with the functional events but they have no access to timing information. This means that an untimed mealyU process can be directly refined into a timed mealyST process with exactly the same functions f and g . Alternatively, the constructor mealyPT provides an interface that invokes the internal functions at regular time intervals. If this interval corresponds to a synchronous time slot, a synchronous MoC process can be easily mapped onto a mealyPT type of process, with the only difference, that the functions in a mealyPT process may receive several nonabsent events in each cycle. But in both cases the processes experience a notion of time based on cycles. In Figure 4.11 we have chosen to refine processes P, Q, R, and S into mealyST-based processes to keep them as similar to the original untimed processes. Thus, the original f and g functions can be used without major modification. The process interfaces are responsible to collect the inputs, present them to the f and g functions and emit properly synchronized output. The buffer and the bus processes however have been mapped onto mealyPT processes. The constants λ and λ/2 represent the cycle time for the processes. Process Bm,4 operates with half the cycle time of

P3 = mealyS(f P3, g P3, w P3) P3

Bn,3

Q3

Q

Q

Q

Q2 = mealyS(f3 , g3 , w3 ) Bn,3 = mealyS:2:1(f3Bn,

g3Bn, w3Bn )

I3 R R R P3 = mealyS(f 3, g 3, w 3)

R3

Bm,3

S3

S3 = mealyS(f3S, g3S, w3S ) B

B

B

Bn,3 = mealyS:2:1(f3 m, g3 m, w3 m ) I I3 =mealyS:4:2(f3,

FIGURE 4.10 Two independent process pairs with explicit buffers.


I g3,

I w3 )

4-22


P3

Bn,3

Q3

I3

R3

Bm,3

S3

P

P

P

P4 = mealyST (1, f 4, g 4, w 4) Q Q2 = mealyST (f 4 ,

Q g4 ,

Q w4 ) Bn,4 = mealyPT:2:1 (l,f4Bn, g4Bn, R

R

w4Bn )

R

R4= mealyST (1, f 4, g 4, w 4 ) S4= mealyS (1, f4S, g4S, w4S) l Bm,4 = mealyPT:2:1( ,f4Bm, g4Bm, w4Bm) 2 I

I

I

I4 = mealyPT:4:2 (l, f4 , g4 , w4 )

FIGURE 4.11 All processes are refined into the timed MoC but with different synchronization interfaces.

the other processes, which illustrates that the modeling accuracy can be arbitrarily selected. We can also choose other process constructors and hence interfaces if desirable. For instance, some processes can be mapped onto mealyT-type processes in a further refinement step to expose them to even more timing information.

4.4 Conclusion We tried to motivate that MoC for embedded systems should be different from the many computational models developed in the past. The purpose of model of embedded computation should be to support analysis and design of concrete systems. Thus, it needs to deal with salient and critical features of embedded systems in a systematic way. These features include real-time requirements, power consumption, architecture heterogeneity, application heterogeneity, and real-world interaction. We have proposed a framework to study different MoCs that allow us to appropriately capture some, but unfortunately not all, of these features. In particular power consumption and other nonfunctional properties are not covered. Time is of central focus in the framework but continuous time models are not included in spite of their relevance for the sensors and actuators in embedded systems. Despite the deficiencies of this framework we hope that we were able to argue well for a few important points: • Different computational models should and will continue to coexist for a variety of technical and nontechnical reasons. • To use the “right” computational model in a design and for a particular design task can greatly facilitate the design process and the quality of the result. What is the “right” model depends on the purpose and objectives of a design task. • Time is of central importance and computational models with different timing abstractions should be used during system development.



4-23

From an MoC perspective, several important issues are open research topics and should be addressed urgently to improve the design process for embedded systems: • We need to identify efficient ways to capture a few important nonfunctional properties in MoCs. At least power and energy consumption and perhaps signal noise issues should be attended to. • The effective integration of different MoCs will require (1) the systematic manipulation and refinement of MoC interfaces and interdomain protocols; (2) the crossdomain analysis of functionality, performance, and power consumption; (3) the global optimization and synthesis including migration of tasks and processes across MoC domain boundaries. • In order to make the benefits and the potential of well-defined MoCs available in the practical design work, we need to project MoCs into design languages, such as VHDL, Verilog, SystemC, C++, etc. This should be done by properly subsetting a language and by developing pragmatics to restrict the use of a language. If accompanied by tools to enforce the restrictions and to exploit the properties of the underlying MoC, this will be accepted quickly by designers. In the future we foresee a continuous and steady further development of MoCs to match future theoretical objectives and practical design purposes. But we also hope that they become better accepted as practically useful devices for supporting the design process just like design languages, tools, and methodologies.

References [1] Ralph Gregory Taylor. Models of Computation and Formal Language. Oxford University Press, New York, 1998. [2] Peter van Embde Boas. Machine models and simulation. In J. van Leeuwen, Ed., Handbook of Theoretical Computer Science, Vol. A: Algorithms and Complexity. Elsevier Science Publishers B.V., Amsterdam, 1990, chap. 1, pp. 1–66. [3] S. Cook and R. Reckhow. Time bounded random access machines. Journal of Computer and System Sciences, 7:354–375, 1973. [4] B.M. Maggs, L.R. Matheson, and R.E. Tarjan. Models of parallel computation: a survey and synthesis. In Proceedings of the 28th Hawaii International Conference on System Sciences (HICSS), Vol. 2, 1995, pp. 61–70. [5] S. Fortune and J. Wyllie. Parallelism in random access machines. In Proceedings of the 10th Annual Symposium on Theory of Computing, San Diego, CA, 1978. [6] Alok Aggarwal, Ashok K. Chandra, and Marc Snir. Communication complexity of PRAMs. Theoretical Computer Science, 71:3–28, 1990. [7] Phillip B. Gibbons, Yossi Matias, and Vijaya Ramachandran. The QRQW PRAM: accounting for contention in parallel algorithms. In Proceedings of the 5th Annual ACM-SIAM Symposium on Discrete Algorithms, Arlington, VA, January 1994, pp. 638–648. [8] Eli Upfal. Efficient schemes for parallel communication. Journal of the ACM, 31:507–517, 1984. [9] A. Aggarwal, B. Alpern, A.K. Chandra, and M. Snir. A model for hierarchical memory. In Proceedings of the 19th Annual ACM Symposium on Theory of Computing, May 1987, pp. 305–314. [10] Bowen Alpern, Larry Carter, Ephraim Feig, and Ted Selker. The uniform memory hierarchy model of computation. Algorithmica, 12:72–109, 1994. [11] Thomas Lengauer. VLSI theory. In J. van Leeuwen, Ed., Handbook of Theoretical Computer Science, Vol. A: Algorithms and Complexity, 2nd ed., Elsevier Science Publishers, Amsterdam, 1990, chap. 16, pp. 835–868. [12] Johan Eker, Jörn W. Janneck, Edward A. Lee, Jie Liu, Xiaojun Liu, Jozsef Ludvig, Stephen Neuendorffer, Sonia Sachs, and Yuhong Xiong. Taming heterogeneity? The Ptolemy approach. Proceedings of the IEEE, 91:127–144, 2003. [13] Rolf Ernst. MPSOC Performance Modeling and Analysis. Paper Presented at the 3rd International Seminar on Application-Specific Multi-Processor SoC, Chamonix, France, 2003.


4-24


[14] Gilles Kahn. The semantics of a simple language for parallel programming. In Proceedings of the IFIP Congress 74. North-Holland, Amsterdam, 1974. [15] Edward A. Lee and T.M. Parks. Dataflow process networks. Proceedings of the IEEE, 83:773–801, 1995. [16] Jarvis Dean Brock. A Formal Model for Non-Deterministic Dataflow Computation. Ph.D. thesis, Massachusetts Institute of Technology, Cambridge, MA, 1983. [17] J. Dean Brock and William B. Ackerman. Scenarios: a model of nondeterminate computation. In J. Diaz and I. Ramos, Eds., Formalism of Programming Concepts, Vol. 107 of Lecture Notes in Computer Science. Springer Verlag, Heidelberg, 1981, pp. 252–259. [18] Paul R. Kosinski. A straight forward denotational semantics for nondeterminate data flow programs. In Proceedings of the 5th ACM Symposium on Principles of Programming Languages, 1978, pp. 214–219. [19] David Park. The ‘fairness’ problem and nondeterministic computing networks. In J.W. De Baker and J. van Leeuwen, Eds., Foundations of Computer Science IV, Part 2: Semantics and Logic. Mathematical Centre Tracts, Amsterdam, The Netherlands, 1983, Vol. 159, pp. 133–161. [20] Robin Milner. Communication and Concurrency. International Series in Computer Science. Prentice Hall, New York, 1989. [21] C.A.R. Hoare. Communicating sequential processes. Communications of the ACM, 21:666–676, 1978. [22] Axel Jantsch, Ingo Sander, and Wenbiao Wu. The usage of stochastic processes in embedded system specifications. In Proceedings of the Ninth International Symposium on Hardware/Software Codesign, April 2001. [23] Edward Ashford Lee and David G. Messerschmitt. Static scheduling of synchronous data flow programs for digital signal processing. IEEE Transactions on Computers, C-36:24–35, 1987. [24] Chanik Park, Jaewoong Jung, and Soonhoi Ha. Extended synchronous dataflow for efficient DSP system prototyping. Design Automation for Embedded Systems, 6:295–322, 2002. [25] Axel Jantsch and Per Bjuréus. Composite signal flow: a computational model combining events, sampled streams, and vectors. In Proceedings of the Design and Test Europe Conference (DATE), 2000. [26] Nicolas Halbwachs. Synchronous programming of reactive systems. In Proceedings of Computer Aided Verification (CAV), 2000. [27] Albert Benveniste and Gérard Berry. The synchronous approach to reactive and real-time systems. Proceedings of the IEEE, 79:1270–1282, 1991. [28] Frank L. Severance. System Modeling and Simulation. John Wiley & Sons, New York, 2001. [29] Averill M. Law and W. David Kelton. Simulation, Modeling and Analsysis, 3rd ed., Industrial Engineering Series. McGraw Hill, New York, 2000. [30] Christos G. Cassandras. Discrete Event Systems. Aksen Associates, Boston, MA, 1993. [31] Per Bjuréus and Axel Jantsch. Modeling of mixed control and dataflow systems in MASCOT. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 9:690–704, 2001. [32] Peeter Ellervee, Shashi Kumar, Axel Jantsch, Bengt Svantesson, Thomas Meincke, and Ahmed Hemani. IRSYD: an internal representation for heterogeneous embedded systems. In Proceedings of the 16th NORCHIP Conference, 1998. [33] P. Le Marrec, C.A. Valderrama, F. Hessel, A.A. Jerraya, M. Attia, and O. Cayrol. Hardware, software and mechanical cosimulation for auto-motive applications. In Proceedings of the Ninth International Workshop on Rapid System Prototyping, 1998, pp. 202–206. [34] Ahmed A. Jerraya and K. O’Brien. Solar: an intermediate format for system-level modeling and synthesis. In Jerzy Rozenblit and Klaus Buchenrieder, Eds., Codesign: Computer-Aided Software/Hardware Engineering. IEEE Press, Piscataway, NJ, 1995, chap. 7, pp. 145–175. [35] Edward A. Lee and David G. Messerschmitt. An Overview of the Ptolemy Project. Report from Department of Electrical Engineering and Computer Science, University of California, Berkeley, January 1993.



4-25

[36] Edward A. Lee and Alberto Sangiovanni-Vincentelli. A framework for comparing models of computation. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 17:1217–1229, 1998. [37] Edward A. Lee. A Denotational Semantics for Dataflow with Firing. Technical report UCB/ERL M97/3, Department of Electrical Engineering and Computer Science, University of California, Berkeley, January 1997. [38] Axel Jantsch and Hannu Tenhunen. Will networks on chip close the productivity gap? In Axel Jantsch and Hannu Tenhunen, Eds., Networks on Chip, Kluwer Academic Publishers, Dordrecht, 2003, chap. 1, pp. 3–18. [39] Axel Jantsch. Modeling Embedded Systems and SoCs — Concurrency and Time in Models of Computation. Systems on Silicon. Morgan Kaufmann Publishers, San Francisco, CA, 2003. [40] D. Harel. Statecharts: A visual formalism for complex systems. Science of Computer Programming, 8:231–274, 1987. [41] G. Berry, P. Couronne, and G. Gonthier. Synchronous programming of reactive systems: an introduction to Esterel. In Kazuhiro Fuchi and M. Nivat, Eds., Programming of Future Generation Computers, Elsevier, New York, 1988, pp. 35–55. [42] Paul le Guernic, Thierry Gautier, Michel le Borgne, and Claude le Maire. Programming real-time applications with SIGNAL. Proceedings of the IEEE, 79:1321–1336, 1991. [43] N. Halbwachs, P. Caspi, P. Raymond, and D. Pilaud. The synchronous data flow programming language LUSTRE. Proceedings of the IEEE, 79:1305–1320, 1991. [44] Thorsten Grötker, Stan Liao, Grant Martin, and Stuart Swan. System Design with SystemC. Kluwer Academic Publishers, Dordrecht, 2002. [45] Twan Basten and Jan Hoogerbrugge. Efficient execution of process networks. In Alan Chalmers, Majid Mirmehdi, and Henk Muller, Eds., Communicating Process Architectures. IOS Press, Amsterdam, 2001. [46] Sundararajan Sriram and Shuvra S. Bhattacharyya. Embedded Multiprocessors: Scheduling and Synchronization. Marcel Dekker, New York, 2000. [47] Shuvra S. Bhattacharyya, Praveen K. Murthy, and Edward A. Lee. Software Synthesis from Dataflow Graphs. Kluwer Academic Publishers, Dordrecht, 1996.


5 Modeling Formalisms for Embedded System Design

Luís Gomes Universidade Nova de Lisboa and UNINOVA

João Paulo Barros Instituto Politécnico de Beja and UNINOVA

Anikó Costa Universidade Nova de Lisboa and UNINOVA

5.1 5.2 5.3 5.4

Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Notions of Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Communication Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Common Modeling Formalisms. . . . . . . . . . . . . . . . . . . . . . . .

5-1 5-3 5-4 5-5

Finite State Machines • Finite State Machines with Datapath • Statecharts and Hierarchical/Concurrent Finite State Machines • Program-State Machines • Codesign Finite State Machines • Specification and Description Language • Message Sequence Charts • Petri Nets • Discrete Event • Synchronous/Reactive Models • Dataflow Models


5.1 Introduction The importance of the system specification phase is directly proportional to the respective system complexity. Embedded systems have become more and more complex not only due to the increasing system’s dimension, but also due to the interactions among the different system design aspects. These include, among others, correctness, platform heterogeneity, performance, power consumption, costs, and time-to-market. Therefore, a multitude of modeling formalisms have been applied to embedded system design. Typically, these formalisms strive for a maximum of preciseness, as they rely on a mathematical (formal) model. Modeling formalisms are often referred as models of computation (MoC) [1–5]. An MoC is composed by a notation and by the rules for computation of the behavior. Instead of notation, we talk about the syntax of the model; the rules that define the model semantics. Usage of formal models in embedded system design allows (at least) one of the following [2]: • Unambiguously capture the required system’s functionality. • Verification of functional specification correctness with respect to its desired proprieties. • Support synthesis onto a specific architecture and communication resources. 5-1


5-2


• Use different tools based on the same model (supporting communication among teams involved in design, producing, and maintaining the system). It has to be stressed that model-based verification of proprieties is a subject of major importance in embedded system design (and also in system design in general terms), as far as it allows to verify model correctness even if the system does not exist (physically), or if it is difficult/dangerous/costly to analyze the system directly. The construction of a system model brings several advantages, as it forces a more complete system comprehension and allows the comparison of distinct approaches. Hence, it becomes easier to identify desired and undesired system properties, as the requirements become more precise and complete. Most modeling formalisms for embedded systems design are based on a particular diagrammatic (or graphic) language. Despite known arguments against diagrammatic languages (e.g., Reference 6), they are presently widely acknowledged as extremely useful and popular for software development and also for embedded system development in general. The history of the Unified Modeling Language (UML), Specification and Description Language (SDL), and Message Sequence Charts (MSCs) certainly proves it. Even though diagrammatic languages are often seen as inherently less precise than textual languages, this is certainly not true (see, e.g., References 7 and 8). These diagrammatic representations are usually graph-based. Finite state machines (FSMs), in their different forms (Moore, Mealy) and extensions (hierarchical and concurrent, Statecharts, etc.) are a well-known example. The same is true for dataflows and Petri nets. These formalisms offer a variety of semantics for the modeling of time, communication, and concurrency modeling. Besides distinct graphical syntaxes and semantics, different formalisms also have different analysis and verification capabilities. The plethora of MoCs ready to be used by embedded system designers means that the task of the modeler trying to choose the “best” formalism is a very difficult task. Different embedded systems can, and often do, emphasize different aspects, namely the reactive nature associated with their behavior, the realtime constraints, or data processing capabilities. The same happens with the available MoCs. For example, some MoCs for embedded systems are control dominated (data processing and computation are minimal), emphasizing the reactiveness of system’s behavior. Others emphasize data processing, containing complex data transformations, normally described by dataflows. Reactive control systems are in the first group and digital signal-processing applications are in the second. For example, digital signal-processing applications emphasize the usage of dataflow models, whereas FSMs explicitly emphasize reactiveness. Unfortunately, other aspects are also important to be considered when producing the model for the system; for example, the need to model specific notions of time or different modes of communication among components may further complicate the search for the “right” MoC. So, in some embedded system designs, heterogeneity in terms of the implementation platforms has to be faced, and it is not possible to find a unique formalism to model the whole system. In those situations, the goal is to decompose the system’s model into submodels and to pick up the right formalism for the different submodels; although, at the end, designer has to be able to integrate all those models in a coherent way [9]. Several formalisms allow the modeler to partition the system’s model and describe it as a collection of communicating modules (components). In this sense, behavior’s modeling and communication among components are often interdependent. Yet, separating behavior and communication is a sound attitude as it allows handling system design complexity and reusability of components. In fact, it is very difficult to reuse components if behavior and communication are intertwined, as behavior is dependent on the communication mechanisms with other components of the system design [2]. Modeling formalisms for embedded system design have been widely studied, and several reviews and textbooks about MoCs can be found in the literature [1–5]. This chapter surveys several modeling formalisms for embedded system design taking Reference 5 as the main reference and expanding it to encompass a set of additional modeling formalisms widely used by embedded system designers in several application areas. The following sections address aspects of time representation and communication support. Afterwards, several selected modeling formalisms are presented.


Modeling Formalisms

5-3

5.2 Notions of Time Embedded systems are often characterized as real-time systems. Thus the notion of time is extremely important in many of the modeling formalisms for embedded system design. Generally speaking, we may identify three approaches to time modeling: 1. Continuous time and differential equations. 2. Discrete time and difference equations. 3. Discrete events. The first approach (see Figure 5.1[a]) uses differential equations to model continuous time functions. This attitude is mostly used for the modeling of specific interface components, where the continuous nature of signal evolution is present, such as analog circuits modeling and physical systems modeling in a broad sense. In the second approach (Figure 5.1[b]), it is assumed that the time is discrete; in this sense, difference equations replace differential equations. A global clock (the tick) defines the specific points in time where signals have values. For some applications, involving heterogeneous components, it is also useful to consider multirate difference equations (which mean that several clock signals are available). Digital signal processing is one of the main application areas. In the third approach, a signal is seen as a sequence of events (see Figure 5.1[c]). This concept of events can be associated to physical signals evolution as presented in Figure 5.2 for a Boolean signal; there event a is associated with the rising edge of the signal x, while event x− is generated at all falling edges of signal x. Extension to other useful types of signals is straightforward, namely for signals that can hold multivalued, enumerated or integer values. Each event has a value and a time tag. The events are processed in a chronological order, based on a predefined precedence. If the time tags are totally ordered [10], we are in presence of a timed system: for any distinct t1 and t2 , either t1 < t2 or t2 < t1 (this is called a total order). It is possible to define an associated metric, for instance f (t1 , t2 ) = |t1 − t2 |. If the metric is a continuum we have a continuous time system. A discrete-event system is a timed system where the time tags are totally ordered. (a)

(b)

A

(c) A

Events

Time (sec)

ab

Time (sec)

cde

Time

FIGURE 5.1 Time representations. (From Luís Gomes and João Paulo Barros, Models of Computation for Embedded Systems. In The Industrial Information Technology Handbook, Richard Zurawski Ed., Section VI — Real Time and Embedded Systems, chapter 83, CRC Press, Boca Raton, FL, 2005. With permission.)

Boolean signal x Associated events

Associated holding conditions

x+ a

x=0

x–

x=1

FIGURE 5.2 From signals to events and conditions.


x+

t

a

x=0

x=1

5-4


Two events are synchronous if they have the same time tag attached (they occur simultaneously). Similarly, two signals are synchronous if for each event in one signal, there is a synchronous event in the other signal, and vice versa. A system is synchronous if every signal in the system is synchronous with every other signal in the system. In this sense, a discrete-time system is a synchronous discrete-event system. Totally ordered events are used with digital hardware simulators, namely the ones associated with VHDL and Verilog hardware description languages. Any two events are either simultaneous, which means that they have the same time tag, or any one of them can precede the other. Events can be considered partially ordered in the sense that the order does not include all the events in the system. When tags are partially ordered instead of totally ordered, the system is untimed. This means that we can build several event sequences that do not contain all the system events. These missing events are included in other completely ordered event sequences. It is known [11] that total order of the events cannot be maintained in distributed systems, where a partial order is sufficient to analyze system behavior. Partial orders have also been used to analyze Petri nets [12]. An asynchronous system is a system in which no two events can have the same tag [1]. The system is asynchronous interleaved if tags are totally ordered, and is asynchronous concurrent if tags are partially ordered. As time is intrinsically continuous, real systems are asynchronous by nature. Yet, synchronicity is a very convenient abstraction, allowing efficient and robust implementations, through the use of a reference “clock” signal.

5.3 Communication Support The embedded systems complexity usually motivates their decomposition in several interacting components. These can be more or less independent, for example, they can be executed in true concurrency or in an interleaved way, but probably all will have to communicate with some other components. Therefore, communication is of topmost importance. It can be classified as implicit or explicit [3]: • Implicit communication generally requires totally ordered tag events, normally associated with physical time. In order to support this form of communication it is necessary to have a physically shared signal (for instance, a clock signal), whose availability may be difficult or unfeasible in a large number of embedded system applications. • Explicit communication imposes an order on the events: the sender process will guarantee that all the receiver processes are informed about some part of its internal state. The following models of communication are normally considered: • Handshake using a synchronization mechanism; all intervening components are blocked, waiting for conclusion. • Message passing using a send–receive pattern where the receiver will wait for the message. • Shared variables; the blocking is decided by the control part of the memory where the shared variable is stored. The referred communication modes are supported by a set of communication primitives (or by some combination), namely [3]: • Unsynchronized: producer and consumer(s) are not synchronized; there are no guarantees that the producer does not overwrite previously produced data, or that the consumer(s) will get all produced data. • Read-modify-write: this is the common way to get access to shared data structures from different processes in software; access to the data structure is locked during a data access (either read–write or read-modify-write), it is an atomic action (indivisible, and thus uninterruptible). • Unbounded FIFO (first in first out) buffered: producer generates a sequence of data tokens and the consumer will get those tokens using a FIFO discipline.


Modeling Formalisms

5-5

• Bounded FIFO buffered: as in the latter but the buffer size is limited, so the difference between writings and readings will be bounded by some value. This means that writings can be blocked if the buffer is full. • Petri net places: producers generate sequences of data tokens and consumers will read those tokens. • Rendezvous: the writing process and the reading process must simultaneously be at the point where the write and the read occur.

5.4 Common Modeling Formalisms Most modeling formalisms are control dominated or data dominated. However, as already referred, embedded systems are composed of a mixture of reactive behavior, control functions, and data processing, especially those targeted for networking and multimedia applications. In the following sections, a set of selected formalisms are presented, taking FSMs as the starting point, which proved to be adequate for low to medium complexity control-dominated system modeling. We can find in the literature numerous proposals extending FSMs in several directions. Each extension tries to overcome one or more intrinsic FSMs shortages, from concurrency modeling inability and the associated state-space explosion problem, to data processing modeling and the absence of hierarchical structuring mechanisms (supporting specification at different levels of abstraction). After control-dominated formalisms (emphasizing the reactive nature of embedded systems), dataflow-dominant formalisms will be presented.

5.4.1 Finite State Machines Finite state machines are common computational models that have been used by system designers for decades. It is common to represent FSM in different ways: from graphical-based representations (like state diagrams and flowcharts), to textual-based representations. In this chapter, state diagrams are used. The modeling attitude is based on the characterization of the system in terms of the global states that the system can exhibit, and also in terms of the conditions that can cause a change in those states (transitions between states). A basic FSM consists of a finite set of states S (with a specified initial state, sis ), a set of input signals I , a set of output signals O, an output function f , and a next-state function h. Output and next-state functions ( f and h, respectively) map a crossproduct of S and I into S and O, respectively ( f : S × I → S, h: S × I → O). Two basic models can be considered for output modeling: Moore-type machine [13], also called state-based FSM, where outputs are associated with state activation (and where the output function f only maps states S into outputs O), and Mealy-type machine [14], also called transition-based FSM, where outputs are associated with transitions between states. It is important to note that both models have the same modeling capabilities. The referred FSM model can be limited or extended to accommodate different needs (specific modeling capabilities or target architectures), as analyzed in some of the following sections. Figure 5.3 illustrates a basic notation for a state diagram. Circles or ellipses represent states; transitions between states use a directed arc. Each arc has an attached expression, potentially containing reference to the input event and/or to an external condition that will cause the change of state. Outputs can be modeled as Moore-type output actions (associated with states, such as z in state S2 ), or as Mealy-type output events (associated with transitions, such as x in the presented transition expression). FSMs are a control-dominated MoC, so intrinsically adequate to model the embedded system reactive component. We will introduce a running example, adapted from Reference 15, and we will start using an FSM to model the system controller. The system to be modeled is the controller of an electric car, which is installed in an industrial plant. The electric car has to carry goods from one point to another, and come back. The controller receives commands from the operator, namely actuation on key “GO” to start the movement from home position, and actuation on the key “BACK” to force returning the car to home position after


5-6


S1

S0

Transition expression

State

S2 Transition

FIGURE 5.3

a (C ) / x z

State diagram basic notation.

(a)

(b) A From plant

M

B GO

From operator

M Controller

DIR

GO

DIR

BACK

To motor

BACK A

FIGURE 5.4

B

Electric car plant running example.

(a)

(b) B=0 S1 GO = 1

GO = 0

M=1 DIR = right

B=1

GO S2 M=0

S0 M=0

M=1 DIR = left

BACK = 1

B S2 M=0

S0 M=0

S3 A=1

S1 M=1 DIR = right

A

S3 M=1 DIR = left

BACK

BACK = 0

A=0

FIGURE 5.5 State diagram models of an electric car plant controller. (From Luís Gomes and João Paulo Barros, Models of Computation for Embedded Systems. In The Industrial Information Technology Handbook, Richard Zurawski, Ed., Section VI — Real Time and Embedded Systems, chapter 83, CRC Press, Boca Raton, FL, 2005. With permission.)

end position reached. After receiving an order, the car motor is adequately activated, while the initial, or the final, position is not reached. There are two sensors available for detecting home and end position reached, A and B, respectively. Figure 5.4(a) represents the external view of the controller in terms of inputs and outputs, and Figure 5.4(b) illustrates the layout of the plant. Figure 5.5(a) and (b) present two possible (and equivalent) models for the control of the referred system. The first relies on the evaluation of external conditions (signal values are explicitly checked), while the second relies on external events (obtained through the preprocessing of external signals). It is clear that events usage will produce a lighter model, with less arcs and inscriptions (it is assumed in this representation that an event associated to a signal is generated when the signal changes its state from “0” to “1”).


Modeling Formalisms

5-7

Mealy outputs Output function

e

at

t ts

x Ne

State variables

Moore outputs Output function

Cu

rr en

ts

ta

te

Inputs

Next-state function

FIGURE 5.6

FSM implementation reference model.

From the point of view of the implementation model, it is common to decompose the system into a set of functions to compute next state and outputs, and a set of state variables, as presented in Figure 5.6. From the execution semantics point of view, one of two reference approaches can be chosen (which correspond to different MoCs) [2]: 1. Synchronous FSMs. 2. Asynchronous FSMs. In synchronous FSMs, both computation and communication happen instantaneously at discrete-time instants (under the control of clock ticks). In this sense, from the point of view of active state changes, each transition arc expression is implicitly “ANDed” with a rising (or falling) edge event of the clock signal. Referring to Figure 5.6, the clock signal will be connected to the “State variables” block (for hardware implementations, a register will be used to implement this block, while for software implementation the clock will be used to trigger the execution cycle). One strong aspect toward synchronous FSMs usage is its implementation robustness, especially when using synchronous hardware. However, when heterogeneous implementations are foreseen, some difficulties or inefficiencies may arise (namely synchronous clock signal distribution). For distributed heterogeneous systems, it is also of interest to consider a globally asynchronous locally synchronous attitude (GALS systems), where the interaction between components is asynchronous, although the implementation of each component is synchronous. So, within a synchronous implementation island (a component), it is possible to rely on robust compilation techniques, either to optimally map FSMs into Boolean and sequential circuits (hardware) or into software code (supported by specific tools). In asynchronous FSMs, process behavior is similar to the one on synchronous FSMs, but without dependency on a clock tick. An asynchronous system is a system in which two events cannot have the same time tag. In this sense, two asynchronous FSMs never execute a transition at the same time (asynchronous interleaving). For heterogeneous architectures or for multirate specifications, implementation can be easier than in synchronous case. The difficulties come from the need to synchronize communicating transitions, and to assure that they occur at the same instant, which is essential for a correct implementation of rendezvous on a distributed architecture. FSMs have well-known strengths and weaknesses. Among the strengths, we should mention that they are simple and intuitive to understand; also that they benefit from the availability of robust compilation tools. These are some of the reasons why designers have extensively used them in the past, and continue to use. Unfortunately, several weaknesses prevent their usage for complex systems modeling. Namely, FSMs do not provide data processing capabilities, support for concurrency modeling, (practical) support


5-8

Embedded Systems Handbook External control inputs ...

External data inputs ... Datapath control inputs ...

Control part

Datapath ... Datapath control outputs

FIGURE 5.7

...

...

External control outputs

External data outputs

Control and datapath decomposition.

for data memory, and hierarchical constructs. Several of the modeling formalisms to be presented try to overcome one, some, or all the referred weaknesses.

5.4.2 Finite State Machines with Datapath One common extension to FSMs trying to cope with the lack of support for data memory and data processing capabilities are Finite State Machines with Datapath (FSMD) [16]. For instance, to model an 8-bit variable with 256 possible values through an FSM, it is necessary to use 256 states; the model looses its expressiveness and the designer cannot manage the specification. An FSMD adds to a basic FSM a set of variables and redefines the next-state and output functions. So, an FSMD consists of a finite set of states S (with a specified initial state, sis ), a set of input signals I , a set of output signals O, a set of variables V , an output function f , and a next-state function h. The next-state function h maps a crossproduct of S, I , and V into S ( f : S × I × V → S). Output function f maps current states to outputs and variables (h: S → O + V ). As defined, output function f only supports Moore-type outputs; it can also be easily extended to accommodate Mealy-type outputs. From the implementation point of view, an FSMD model is decomposed as presented in Figure 5.7, where the control part can be represented by a simple FSM model, and the datapath part can be characterized through a register transfer architecture. So, the datapath is decomposed into a set of variables to store operands and results, and a set of processing blocks to perform computation on those values. It has to be stressed that this is the common reference architecture for single-purpose processor and simple microprocessor designs. As a simple example, Figure 5.8 presents the decomposition associated with the modeling of a multiplier of two numbers, A and B, producing result C through successive additions. Figure 5.8(a) presents top-level decomposition and interconnections of control and data blocks, while Figure 5.8(c) presents a simple FSM to model the control part and Figure 5.8(d) shows the register transfer architecture to support the required computations (left-hand side is responsible for counting B times, while right-hand side is responsible for the successive additions of A into C).

5.4.3 Statecharts and Hierarchical/Concurrent Finite State Machines A second common extension to FSMs tries to cope with the lack of support for concurrency and hierarchical structuring mechanisms of the model (still emphasizing the reactive part of the model).


Modeling Formalisms

5-9

(a)

(b) GO

A

B

N bits

C=A×B C = A + A + ... + A

N bits

LOAD_A LOAD_B CLEAR_B

Control part

B times

Datapath

INC_B CLEAR_C LO AD_C STOP

(d) Clock

A

2N bits OK

B

C

(c)

N bits N bits

LOAD_B S0 OK

GO = 0

GO = 1

GO = 0

S1 LOAD_A LOAD_B CLEAR_B CLEAR_C STOP = 1

STOP = 1

S4 OK

LOAD_A

RA RA

RB RB

CLEAR_B S2

INC_B

INC_B

N bits

STOP = 0

S3 INC_B LOAD_C

CB CB

STOP = 0

N bits

Σ 2N bits CLEAR_C RC LOAD_C RC

= STOP

2N bits

GO = 1

C

FIGURE 5.8

Decomposition of a multiplier into control and datapath.

Several formalisms can be included in the group of hierarchical/concurrent finite state machines (HCFSMs), all of them including mechanisms for concurrency and hierarchy support, but having different execution semantics. Among them, Statecharts [7,17] are the most well-known modeling formalism providing a MoC to specify complex reactive systems. One main advantage of Statecharts over FSMs is the structuring of the specification, magnifying the legibility, and improving the system maintenance. Those characteristics were key points that supported its adoption as one of the specification formalisms within the UML [18–20]. Statecharts are based on state diagrams, plus the notions of hierarchy, parallelism, and communication between parallel components. Statecharts were informally defined in [17] as “Statecharts = state-diagrams + depth + orthogonality + broadcast-communication.” Depth concept encapsulates the multilevel hierarchical structuring mechanism and is supported by the XOR refinement mechanism, while orthogonality concept allows concurrency modeling and is supported by the AND refinement mechanism. Unfortunately, the broadcast-communication mechanism semantics is not similar in all Statecharts variants as it was defined in different ways by several authors. This fact had a strong impact on possible Statecharts operational semantics, as discussed later in this section. Statecharts define three types of state instances: the set (implementing the AND refinement mechanism), the cluster (implementing the XOR refinement mechanism), and the simple state. The cluster supports the hierarchy concept, through encapsulation of state machines. The set supports the concurrency concept, through parallel execution of clusters. Figure 5.9 illustrates the usage of the cluster mechanism, adopting a bottom-up approach. Starting with the SYS_C model, the state diagram composed by states C and D, and associated arcs can be encapsulated by the state A, as represented in SYS_B model. This provides us with a top-level view of the model composed only by the states A and B, complemented by the inner level if one wants to get further details about the system behavior, as represented in SYS_A model. In this sense, the designer has the possibility to describe the system at different levels of abstraction. The designer is free to follow a top-down or


5-10


SYS_A w B

A z

SYS_B

SYS_C

A w

w C

C

x

y

x D

FIGURE 5.9

B

z

B y

z

D

z

Usage of XOR refinement in Statecharts basic models.

A H*

B

C

H

w

D

E F x y

G

v

L

p I

q

J

H q

r K

z

p M

FIGURE 5.10 Usage of AND refinement in Statecharts basic models. (From Luís Gomes and João Paulo Barros, Models of Computation for Embedded Systems. In The Industrial Information Technology Handbook, Richard Zurawski, Ed., Section VI — Real Time and Embedded Systems, chapter 83, CRC Press, Boca Raton, FL, 2005. With permission.)

a bottom-up approach while producing the system’s model by applying the hierarchical decomposition constructs available through the XOR refinement mechanism. Figure 5.10 presents a simple model containing a set A composed by three AND components (B, C, and D); whenever A is activated/deactivated the associated components B, C, and D will also be activated/deactivated. Apart from the referred main characteristics, the Statecharts formalism presents some interesting features, such as: • The default state that defines which state will take control in the case where a transition reaches a cluster state. In Figure 5.9, SYS_B model, the system initial state is state B, and, after the occurrence of w, states A and C will become active.


Modeling Formalisms

5-11

• The notion of history, simple or deep, can be associated to cluster state instances. When the system enters a cluster with history property, the state that will be active upon entrance will be the one that was active upon the last exit from that cluster. In the case of the first entrance in the cluster, the active state will be the default one. This is the case for the cluster C of Figure 5.10, which holds the “H” attribute inside a circle. The history property can also be deep history, meaning that all the clusters inside the cluster with the deep-history property, also have that property; this is the case for cluster B in Figure 5.10, which holds the “H∗ ” attribute inside a circle. • The activation expressions present in transitions can also have special events, special conditions, and special actions. Special events can be the entered(state) event and the exited(state) event, which are generated whenever the system enters or exits some state, respectively. One special condition is the in(state), which indicates if the system is currently in the specified state. Examples of special actions are: clear_history(state) and clear_deep_history(state). These initialize a cluster history state to the default state. So far we have only characterized syntactic issues for the Statecharts formalism. Its semantic characterization includes, among other things, the characterization of the step algorithm and the definition of the criteria for solving conflicts (which set of triggering transitions to choose among a set of conflicting transitions), in order to obtain a deterministic behavior. The step algorithm is of fundamental importance as it dictates the way the system evolves from one global state to another. The triggering of a finite set of transitions causes the evolution of the system. In Reference 21, the trigger of this set of transitions is called microstep (conceptually, can be seen as equivalent to the delta concept, used in discrete event formalism). Transitions fired in a microstep can generate internal events that, in turn, can fire other transitions. Thus, a step is defined as a finite sequence of microsteps. This means that, after all the microsteps have taken place, there will be no more active transitions and the system will remain in a stable state. External events are not considered during a step, so one must assume that a system can completely react to an external event before the occurrence of the next external event. The initial Statecharts proposal left their semantics partially open [22]. As a consequence, different semantics associated with the handling of broadcasted event have been considered possible. Statecharts’ semantics proposals are discussed in Reference 23. Yet, as pointed by Harel and Naamad [22] that survey does not mention the semantics implemented in the STATEMATE tool (in the following citation the original reference citations were replaced by the citation numbering used in this chapter): “the survey [23] does not mention the STATEMATE implementation of Statecharts or the semantics adopted for it at all, although this semantics is different from the ones surveyed therein (and was developed earlier than all of them except for [22]).” Even so, most common semantics consider that a generated event will only be valid during the next microstep (more used) or for the remaining microsteps within the same step (persistent events). Coming back to the example of the electric car controller, if one considers controlling two or three of those cars using the same controller, and tries to use a state diagram to model the whole system, he or she will face the well-known problem of state explosion, due to the orthogonal composition of state diagrams: if we have two independent state diagrams with N 1 and N 2 states, the composed state diagram will contain N 1 × N 2, as for each state of machine 1, one may have machine 2 in any of its states. This illustrates the already presented weakness resulting from the lack of support for concurrency modeling of FSM. Now consider that we want to keep the cars synchronized at both ends, which means that we will wait for all cars at each end, before launching them in the opposite direction. The associated FSM model is presented in Figure 5.11, where the state explosion phenomenon is observed, with the lack of intuitive legibility of the resulting model. Figure 5.12 presents a Statecharts model (TWO_CARS) enabling a compact modeling of the referred system. The system is composed by an AND set (TWO_CARS) of two XOR clusters (CAR1 and CAR2), modeling each one of the cars. It has also to be noted that the rounded rectangle representation for states takes advantage of the area, allowing a more readable graphical notation (compared with the circle representation of common state diagrams). Additional constraints associated with global start and global


5-12


B1=0 B2=0 M1=1 M2=1 DIR 1 = right DIR 2 = right

S1 GO = 1 GO = 0

B2=0 B1=1 B2=0

M1=0 M2=0

M1=0 M2=1 DIR 2 = right

S3

M1=1 M2=0

A2 = 0 A2 = 1

S6 S6 M1=0 M 2 =1 DIR 2 = left DI R2=left

B2=1

B1=1 B2=1 B1=0 B2=1

S0

S2

B1 = 1

S4 M1=0 M2=0

DIR 1= right A1 = 1 A2=0

B1 = 0 BACK = 0

S7

A1 = 1 A2=1

M1=1 M2=0

A1 = 1

DIR 1 = left A1 = 0

A1 = 0 A2=0

BACK = 1 S5

M1=1 M2=1 DIR 1= left DIR 2 = left

A1 = 0 A2=0

FIGURE 5.11 An FSM model for two cars controller system.

TWO_CARS GO[in(S02)]

S11

B1

DIR 1= right M 1=1

S21

S01 M1=0

S31 A1

DIR 1= left M 1=1

M1 = 0 BACK[in(S22)]

CAR1 CAR2 GO[in(S01)]

S12 DIR 2 = right M2=1

S02

B2 S22

M2=0

S32 A2

DIR 2 = right M2=1

M2=0 BACK[in(S 21)]

FIGURE 5.12 Simple Statecharts model for electric cars example.

return are explicitly modeled through specific arc inscriptions; for instance, arc expression leaving state S01 is augmented with the condition [in(S02)], imposing that CAR1 only starts its travel if CAR2 is also in its original position (identical constraint for CAR2). The reader should refer to References 7, 17, and 21 for further details concerning the Statecharts formalism.

5.4.4 Program-State Machines A third extension to FSMs is the program-state machine (PSM) model, which allows the use of sequential program code to define a state action. A PSM includes the hierarchy and concurrency extensions of


Modeling Formalisms Leaf

PS1

5-13 Sequential

PS3

Concurrent

PS5

x=0 y=1 y=0

PS2

PS4

PS6

a = 0; b = 0; while(c Z st .trans 1 0 0 1 0

0 1 1 0 – –

0 0 0 1 1 2 2

1 0 0 1 0 1 0

0 1 2 0 1 # trivial 0 2 # trivial

FIGURE 5.14 Simple CFSM and associated SHIFT representation.

From the implementation point of view, a system described by a set of CFSMs can be decomposed into a set of CFSMs to be implemented in software, a set of CFSMs to be implemented in hardware, and a communication interface between them. Any high-level language with precise semantics based on extended FSMs can be used to model individual CFSMs; currently Esterel is directly supported. The CFSM model is strongly connected with the POLIS approach [27], and can be represented using a textual notation named SHIFT (Software–Hardware Intermediate FormaT), which extends to BLIF-MV [28], which in turn is a multivalued version and an extension of the Berkeley Logic Interchange Format (BLIF). A CFSM representation in SHIFT is composed of a set of input signals, a set of output signals, a set of states or feedback signals, a set of initial values for output and state signals, and a transition relation describing reactive behavior. Figure 5.14 presents a simple CFSM and associated SHIFT textual representation.

5.4.6 Specification and Description Language Specification and Description Language (SDL) is a modeling language standardized by the ITU (International Telecommunication Union) [29]. Basically, an SDL model is a set of FSMs running in parallel. These machines have their own input queues and communicate using discrete signals. The signals can be asynchronous or synchronous. The latter correspond to synchronous remote procedure calls. All signals can have associated parameters to interchange and synchronize information between SDL processes and also between these and the environment. SDL has been designed for the development of concurrent, reactive, real-time, distributed, and heterogeneous systems. Yet, by large, the major application area for SDL is in telecommunications. It is the usually chosen language to define and standardize communication protocols. The specifications of many communications systems use SDL. These include the GSM second generation mobile telephony system, the UMTS third-generation mobile telephony system, the ETSI HiperLAN 2 Broadband Radio Access Network or the IEEE 802.11 wireless Ethernet local area network [30]. The language has been evolving since the first ITU Z.100 Recommendation in 1980 with updates in 1984, 1988, 1992, 1996, and 1999. The 1992 version added some object-oriented concepts. These were significantly expanded in the latter version (SDL-2000) toward a better integration with the UML [31]. This trend should continue as SDL is increasingly being used together with UML. An article by Rick Reed [32] provides a quite comprehensive overview of SDL’s history up to 1999. More recent developments, and all other types of SDL related information, are available at the SDL Forum Society Internet site [33]. In particular SDL-2000 [29] breaks the compatibility with SDL-96 and adds numerous significant modifications. Here, we restrict to the main fundamental concepts thus avoiding the main incompatibility issues.


5-16


The SDL has four structuring concepts: system, blocks, processes, and procedures. Besides these structuring concepts, SDL has two communication-related constructions: 1. Channels (operate at system level connecting blocks). 2. Signal routes (operate inside blocks connecting processes). The system is the entry point in an SDL specification. Its main components are blocks and channels. Channels connect blocks with each other and also with the environment. Communication is only possible along the specified channels. These can be unidirectional or bi-directional. By default, channels are free of errors and preserve the order of the transmitted signals. These properties can be modified through the definition of channel substructures. A block is a static structure defined as a set of interrelated blocks or as a set of interrelated processes. In SDL-2000, system, blocks, and processes were harmonized as agents. The state machines are addressed by the name of the enclosing agent. Agents may include their own state machine, as well as other agents’, and shared data declarations. In SDL-2000 blocks can contain other blocks together with processes. Yet, processes still cannot contain blocks or systems. The main difference between blocks and processes is in the implicit concurrency type: blocks are true concurrent while agents inside processes have an interleaved semantics. Each process has its own signal queue and usually only contains a state machine. Differently, blocks are usually used as a structuring diagram showing other blocks and other processes. The system is the top-level block that interfaces with the environment. Figure 5.15 shows an SDL block named controller. This block, together with the process car in Figure 5.16, models the example system of one electric car controller, already presented in Figure 5.4. The controller is defined as an SDL block. As it is the only block, it can be seen as the system (more precisely, the system contains a single block named controller). The numbers on the top-right corner specify the page number being shown and the total number of pages in the current level. The block defines the signals inside a text element (a rectangle with a folded top-right corner). Note that the two output signals M and DIR carry data (an Integer parameter). These are used to assign values to the outputs. Besides the text element, the controller block contains a single process named car. This process is connected to the environment through two input channels (passage and gb) and two output channels

Block

1(1)

controller

signal A, B; signal GO, BACK; signal M(Integer); signal DIR(Integer);

[A,B]

Passage

[GO, BACK]

gb

FIGURE 5.15 A controller for one electric car.


Car

Motor

Direction

[M]

[DIR]

Modeling Formalisms

Process

5-17

car

1(1) dcl right Integer := 0; dcl left Integer := 1;

S0

GO

M(1)

DIR(right)

S1

S1

S2

S3

B

BACK

A

M(0)

M(1)

M(0)

S2

DIR(left)

S0

S3

FIGURE 5.16 The process car.

(motor and direction). The signals handled by the channels are specified as signals’ lists (inside square brackets). For example, the passage channel handles two signals: A and B. Notice the graphical similarity between the process car notation and the controller external view in terms of inputs and outputs, as in Figure 5.4(a). Figure 5.16 shows the car process. It starts with a start symbol in the top-left corner and immediately proceeds to state S0. The process waits for the reception of signal GO on its input queue and emits two output signals: one assigns 1 to M and the other assigns right (defined as a value of zero) to the DIR output signal. It stops in state S1. Afterward, the process waits for the B signal, assigns zero to the M output and arrives at state S2. It then proceeds accordingly to the presented example until the initial state S0. We now present a generalized version for N car processes. We consider N = 3. Figure 5.17 shows a block controller able to model three cars, which synchronize on the GO and BACK signals. Now, the process car has the notation car(3,3), which specifies that three instances are created on startup and a maximum of three can exist. Additionally, we define another process named Sync. This process forces the synchronization of the three parallel car processes on the GO and BACK events. A car process (see Figure 5.18) starts by registering itself in the Sync process by passing its process identifier (PId) in the signal carPId through channel register. When each car process is ready to receive the GOSync or the BACKSync signals from process Sync (which means that the car is ready at its home or end position), it sends a GOok or a BACKok signal through channel ok. Afterwards, each car process can receive the respective GOSync or BACKSync signals from process Sync through channel gbSync. The remaining signals are now indexed by the respective process identifiers (a PId type parameter). This allows a simple way to model an arbitrary number of car processes. In the presented example, the initial, and also the maximal, number car process instances is three. The Sync process (see Figure 5.19) starts by registering the N process car instances. It receives N carPId signals carrying the pIds. Each of these signals is sent by a different car process. To this end, the process defines an Array data structure named cars. It will act as a list: when all the N car processes have sent their GOok signal, the Sync process sends a GOSync process to each of the pIds in the cars Array, after reception of the GO signal. The same happens for the BACKok and BACKSync signals. Differently, from


5-18


Block

controller

1(1) signal A(PId), B(PId); signal carPId(PId); signal GO, BACK; signal GOok, BACKok; signal GOSync, BACKSync; signal M(PId, Integer); signal DIR(PId, Integer);

Passage

car(3,3)

Motor

[A,B] Sync

[M] Register [carPId]

gb [GO,BACK]

Direction ok [DIR] [GOok,BACKok] gbSync [GOSync,BACKSync]

FIGURE 5.17 A controller for three cars synchronizing in the GO and BACK events.

the previous processes, this Sync process uses tasks: the rectangles with language instructions, for example, c := c + 1. It also uses decision nodes. These are the lozenge with an associated condition (c = N ) and two possible outputs (true and false). A process can also invoke procedures. A procedure is executed sequentially while the caller process waits for it to terminate. When the procedure terminates, the process continues from the symbol following the procedure invocation. This mimics what happens in a typical procedural programming language (e.g., PASCAL). Accordingly, SDL procedures can also return a value and have parameters passed by value and one parameter passed by reference. Specification and Description Language is usually used in combination with other languages, namely MSCs [34], ASN.1 (Abstract Syntax Notation One), and TTCN (Testing and Test Control Notation). The ASN.1 is an international standard for describing the structure of data exchanged between communicating systems (ITU-T X.680 to X.683 | ISO/IEC 8824-1 to 4). TTCN is a language used to write detailed test specifications. The latest version of the language, TTCN version 3 (TTCN-3), is standardized by ETSI (ES 201 873 series) and the ITU-T (Z.140 series). The use of the object model notation of SDL-2000 in combination with MSC, traditional SDL state models and ASN.1 is a powerful combination that covers most system engineering aspects. These languages have been studied in the same group within the ITU: • ITU Z.105(11/99) and Z.107(11/99) standards define the use of SDL with ASN.1. • ITU Z.109(11/99) standard defines a UML profile for SDL. • ITU Z.120(11/99) standard defines MSCs.


Modeling Formalisms

Process

5-19

car

1(1)

dcl right Integer := 0; dcl left Integer := 1; dcl p PId; carPId(self)

GOok

Initial

S1

B(p)

GOSync

S2

BACKSync

p = self M(self,1)

S3

A(p)

p = self M(self,1)

M(self,0)

M(self,0) DIR(self,left)

DIR(self,right) BACKok

S1

S2

S3

FIGURE 5.18 The new process car.

5.4.7 Message Sequence Charts Message Sequence Chart is a scenario language standardized by the ITU [34]. Although originally invented in the telecommunications domain, nowadays it is widely used in embedded, real-time, and distributed systems among others. It is not tailored to any particular application domain. Before the approval at the ITU meeting at Geneva in 1992, the MSC recommendation was used merely as an informal language. After being approved as a standard, MSC has increased its usage in numerous application areas. After the first revision in 1996 the object-oriented community has shown an increasing interest in MSC standard, namely for use case formalization [35]. Basically, we can say that MSC is a graphical and textual language, which are used for the description and specification of interaction between system components. It is used in the early stages of the system specification to represent system behavior. In the present chapter, we will focus on the MSC graphical representation. It is a simple, twodimensional diagram, giving an overview of the system behavior. The behavior is characterized through message exchanged between the system entities. Usually, these entities are called instances or processes and could be any part of the system: a subsystem, a component, a process, an object, or a user among others. The entities are represented by vertical lines and the messages between them are oriented arrows from the sender to the receiver.


5-20


Process

Sync

1(1) dcl N Integer := 3; dcl c Integer := 0; dcl pid PId; dcl cars Array(Integer, PId);

S1

S3

S2

S4

S0 BACK

GO BACKok

GOok

carPId(pid) false cars(c) := pid

c := c + 1

false

GOSync to cars(c)

false c := c + 1

BACKSync to cars(c)

false

false

c := c + 1 c=N

c := c + 1 c=N

c=N

true c=N

true

true c := 0

S1

c=N true

true c := 0

c := 0

c := 0

c := 0

S2

S3

S4

S1

A

B

FIGURE 5.19 Sync process.

C

m_ab m_ac

m_b m_c

FIGURE 5.20 An elementary MSC.

Figure 5.20 shows an example of an elementary MSC in graphical form. Here, we have three entities, A, B, and C, represented by vertical lines (axes). The axis starts in the “instance head symbol” (a nonfilled rectangle) and ends in the “instance end symbol” (a filled rectangle). The exchanged messages between the instances are represented by oriented labeled arrows (e.g., A sends the message m_ab to B). Time has no explicit representation on MSC diagrams, but we can assume that time flows on the vertical axis, top-down direction. Yet, it is not possible to represent the exact time when the events or messages arrives; it is only possible to establish a partial order between messages. Within basic MSC models, we can also represent conditions and timers. Conditions are states that must be reached before the“execution”can continue. They are represented as an informal text in a hexagonal box.


Modeling Formalisms

5-21 A

B

C

Condition m_a m_b m_c

FIGURE 5.21

Using a synchronizing condition in an MSC model.

A

B

C

m_a m_b1 t1 m_b2

t2

t1 m_c

FIGURE 5.22

Using timers in an MSC model.

It is also possible to define a shared condition for synchronization purposes. This means that two or more processes must reach a specific state before continuing with further procedures (see Figure 5.21). On the other hand, three special events can be associated with timers: timer set, timer reset and time-out. In the graphical representation the timer set symbol has the form of an hour glass connected to the instance axis. The time-out is a message arrow attached to the hour glass symbol and pointing to the instance axis. The timer reset is represented by a cross symbol connected to the instance axis by a line (see Figure 5.22). Message Sequence Chart also supports structured design. This means that beyond representation of simple scenarios using basic MSCs, it is possible to represent more complex specifications by means of high-level message sequence charts (HMSCs). HMSCs are used for the hierarchical top-down specification of scenarios as a directed graph. An HMSC model has a start node and an end node, and the vertices can represent other “lower-level” HMSCs, basic MSCs (“MSC reference”) or conditions. All conditions at HMSC level are considered as global. Each “MSC reference” refers to a specific MSC, which defines it. Figure 5.23 presents a simple HMSC model with associated basic MSCs. A possible scenario description for our electric car controller running example, already presented before, is the following: Considering that the car is at its home position, the operator activates the car motor in order to move it to the end of the line. Detection of car presence by the sensor_B means that the car reached the


5-22


(a)

(b)

msc Reference

msc False

A

B

Start node

C

A

B

C

when MSC

Initial condition

m_baf

m m_ab

Initial condition

m_caf m_ac

m_f

Reference Initial condition

when MSC

False

when MSC Connection point True

msc True when MSC true

msc Do

A

B

C

when MSC true

Do when MSC

Check_A

m_bat End node

Check_B

m_cat m_t when MSC true

FIGURE 5.23 An HMSC model: (a) top-level model, (b) associated inner level MSCs. Operator

Sensor_A

Motors

Sensor_B

Go Car at the end position

Stop

Back

Car at the start position

Stop

FIGURE 5.24 An MSC model for the electric car controller system.

end of the line, and the car must be stopped. After that, the operator can activate the motor in the reverse way, in order to take the car back to home position. When the car presence is detected by the sensor_A, the car must be stopped, because it is again on its home position. This scenario can be represented graphically by the MSC of Figure 5.24. For the case when we want to consider more than one car in the system, it is necessary to replicate instances and messages. In order to activate each car, it is necessary to send a message to each of them. The basic MSC for two electric cars controller system is presented in Figure 5.25. Alternatively, we could also get some benefit from the HMSC representation.


Modeling Formalisms Operator

5-23

Sensor_A1

Motor1

Sensor_B1

Sensor_A2

Go car1 Car1 at the end position

Sensor_B2

Go car2 Stop car1

Car2 at the end position

Back car1 Car1 at the start position

Motor2

Stop2

Back car2

Stop

Car2 at the start position

Stop car2

FIGURE 5.25 An MSC model for two electric cars controller system.

(a) p1

(b) p1 t1

t1 p4

p2

p4 p2

p5 p3

p5 p3

FIGURE 5.26 Transition t1 firing: (a) before and (b) after.

5.4.8 Petri Nets Carl Adam Petri proposed Petri nets in 1962 [36]. Petri nets can be viewed as a generalization of a state machine where several states can be active and transitions can start at a set of “states” and end in another set of “states.” More formally, Petri nets are directed graphs with two distinct sets of nodes: places, drawn as circles, are the passive entities; and transitions, drawn as bars, rectangles, or squares, are the active entities. Places and transitions are connected by arcs in an alternate way: places can only be connected to transitions and transitions can only be connected to places. Places can contain tokens, also called marks, specified by small circles, inside places. Figure 5.26 shows two nets, each with one transition (t1 ), five places (p1 to p5 ), and five arcs. The nets have distinct markings: the net (b) corresponds to the net (a) marking after the transition firing. A transition can only fire if all its input places contain, at least, one token. From the transition point of view, one token is taken from each input place and one token is put in each output place; these destructions and creations of tokens are an atomic operation, in the sense that it cannot be decomposed. Notice that a transition can have only input arcs (sink transition) or only output arcs (source transition). More exactly, this Petri net model is a Place/Transition net [12,37]. They are the most well-known and best-studied class of Petri nets. In fact, this class is sometimes just called Petri nets. Generalized Petri nets simply add the possibility of weights in arcs with the expected change in the semantics: each arc must now take or deposit the specified number of tokens.


5-24


(a)

(b)

A[1] GO

A[2] A[3]

BACK

A [i ] (i : 1,..,3)

GO 3

3

B [1]

BACK 3 3

B [2] B [3]

B [i ] (i : 1,..,3)

FIGURE 5.27 Petri net models for our electric cars example. (From Luís Gomes and João Paulo Barros, Models of Computation for Embedded Systems. In The Industrial Information Technology Handbook, Richard Zurawski Ed., Section VI — Real Time and Embedded Systems, chapter 83, CRC Press, Boca Raton, FL, 2005. With permission.)

Coming back to our electric cars example, Figure 5.27(a) presents a Petri net model suitable for the modeling of a system with three electric cars, maintaining the constraint of synchronized starts and returns (in order to improve legibility, annotations associated with outputs are omitted, but can be associated with different places or transitions, as well in state diagrams and Statecharts). It should be noted that transitions have the dynamics of the model associated with them, while places hold the current states of each car. Figure 5.27(b) uses a generalized Petri net model (arcs have weights associated with them), enabling a more compact representation of the system. While thinking about introducing more cars into our system (scalability problem), it is clear that, considering this last model, it is only necessary to change the initial marking of the net: the left-most place should contain as many tokens as the number of electric cars in the system. As even this simple example shows, Petri nets do not give special emphasis to states, as state diagrams, nor to actions, as dataflows: states and actions are both “first-class citizens.” In other words, Petri nets provide a pure dual interpretation of the modeled world: each entity is either passive or active. As already stated, passive entities (e.g., states, resources, buffers, channels, conditions) are modeled by places; active entities (e.g., actions, events, signals, execution of procedures) are modeled as transitions. Another fundamental characteristic of Petri nets is their locality of cause and effect. More concretely, this means transitions are only fire based on their vicinity, namely their input places and, possibly (for some classes of Petri nets), in their output places. Two other fundamental aspects of Petri nets are their concurrent and distributed execution, and the instantaneous firing of transitions. The concurrent and distributed execution results from the locality of cause and effect: all transitions with no input or output places in common, fire independently. This implies the model is intrinsically distributed, as all transitions with disjoint locality do not depend on other transitions. The instantaneous firing means that each transition fires instantaneously: there are no intermediate states between token destruction in input places and token creation in output places. Petri nets are well known for their inherent simplicity due to the small number of elementary concepts. Yet, simplicity of Petri nets is a two-edge sword [38]: on one hand, it avoids the addition of further complications in the model resulting from the language complexity; on the other hand, it invariably implies that some desired feature is missing. As such, this basic type of Petri net has been modified in numerous ways to the point that Petri net is really a general name for a large group of different definitions and extensions, although all claim some close relation to the seminal work by Carl Petri [36]. A very readable and informal discussion, about the existence of many different Petri nets classes can be found in the paper by Desel and Juhás [39]. This evidence has its origins in the fact, easily confirmed by experience, that conventional Petri nets, also known as low-level Petri nets, are sometimes difficult to use in practice due not only to the problem of rapid model growth, but also to the lack of more specific constructions. Namely, time modeling, interface with the environment, and structuring mechanisms are frequent causes for extensions. While the former two are specific to some types of system, structuring mechanisms are useful for even the simplest system model.


Modeling Formalisms

5-25

It is important to note that, contrary to state machines, the number of places grows linearly with the size of the system model as not all the states have to be represented explicitly but can result from the combination of several already existent “states” (modeled by places). This implies that Petri net models “scale better” than state machines (including Statecharts). Even so, model growth is still a significant problem for Petri net models due to their graphical nature and low-level concepts. To minimize this, several abstraction and composition mechanisms exist. These can be classified into two groups: 1. Refinement/abstraction mechanisms. 2. Composition mechanisms. Refinement/abstraction constructs correspond, roughly, to the programming languages concepts of macros or procedure invocation [53]. Intuitively macro expansion is an “internal” composition where we get nets inside nets through the use of nodes (places or transitions). The use of places and transitions as encapsulated Petri nets goes back to the works of Carl Adam Petri [36] and was extensively presented in Reference 15. They are usually named macros (or macronodes) [40]. These constructs support hierarchical specification of systems, in the traditional sense of module decomposition/aggregation and “divide to conquer” attitude. Figure 5.28 presents a two-level model of a typical producer–consumer system. At the top-level both producer and consumer are modeled by transitions (associated with the dynamics of the system), while storage is associated with a place (static part of the system). It has to be noted that the executable model is obtained through the “flat” model, the one resulting from the merging of the different subsystems, as also presented in Figure 5.28. Several proposals have even added object-oriented features to Petri nets (see Reference 41 for up-to-date detailed survey of some approaches). On the other hand, a common compositional mechanism is the folding of nets found in high-level nets. High-level nets can be seen as an internal composition made possible by structural symmetries inside a low-level net model. This is achieved by the use of high-level tokens. These tokens are no longer undistinguishable, but have data structures attached to them. Figure 5.29 shows the high-level Petri net model associated with our three electric cars system. Tokens presented at the left-most place (initial marking) now contain integer values, which enable the identification of a specific car status inside the model.

Producer

Storage

Consumer

p1c

p1p

p1s

t1p

t1c

t2c

t2p p2s p2p

p2c

FIGURE 5.28 Hierarchical decomposition of a producer–consumer Petri net model.


5-26

Embedded Systems Handbook A[i ]

GO 112 122 132

1i 2+1 j 2+1k 2

1i 2

1i 2+1 j 2+1k 2

BACK 1i 2

1i 2+1 j 2+1k 2

B [i ] 1i 2

1i 2

1i 2+1 j 2+1k2

FIGURE 5.29 High-level Petri net models for our electric cars example. (From Luís Gomes and João Paulo Barros, Models of Computation for Embedded Systems. In The Industrial Information Technology Handbook, Richard Zurawski Ed., Section VI — Real Time and Embedded Systems, chapter 83, CRC Press, Boca Raton, FL, 2005. With permission.)

t

(a)

t

(b)

t X

t Y

Z t

(c) t X

Y

t+∆

t

X

Y

X

Y

Z

(d) Z

t+∆

Z

FIGURE 5.30 Handling of simultaneous events in a discrete event system. (From Luís Gomes and João Paulo Barros, Models of Computation for Embedded Systems. In The Industrial Information Technology Handbook, Richard Zurawski Ed., Section VI — Real Time and Embedded Systems, chapter 83, CRC Press, Boca Raton, FL, 2005. With permission.)

As tokens can carry data, it is necessary to add some inscriptions to the arcs, in order to properly select the right token to involve in a specific transition firing. Taking transition GO of Figure 5.29 as an example, we will need three tokens to enable the transition to fire, obtained through the adequate binding of the variables i, j, and k with data from tokens presented at the incoming place. On transition A[i], inscription at the incoming and outgoing arcs ([i]) simply means that no transformation is performed in the token attributes. For some classes of high-level nets, it is also possible to add within those arc inscriptions references to functions that support token attributes transformation. This is the case for Colored Petri nets [42]. In this way, data processing capabilities can also be embedded in the model.

5.4.9 Discrete Event In a discrete event MoC, events are sorted by a time stamp stating when they occurred and analyzed in chronological order. A discrete event simulator usually maintains a global event queue. The operational semantics of a discrete event system consists of the sequential firing of the processes accordingly to the chronological time stamps of queued events. Each specific queued event will be removed from the queue and the processes, that have it as an input, will be fired. New events will be produced due to the previous firing and should be added to the event queue, according to its time tag. Like most digital hardware simulation engines, hardware description languages, such as VHDL and Verilog, use a discrete-event-based simulation approach. Simultaneous events, especially those originated by zero-time delay, can bring some problems, due to the ambiguity resulting from equal time tags (they can not be ordered). Figure 5.30 illustrates one simple situation.


Modeling Formalisms

5-27

Considering that X, Y, and Z represent some processing modules/processes/components, and that module Y has zero-time delay, it will produce output events with the same time stamp as input events. If module X produces an event with time stamp t (Figure 5.30[a]), both modules Y and Z will receive that event; there is ambiguity about whether Y or Z should be executed first. If Z is executed first, the event presented at its input will be consumed; afterwards Y will be executed and will produce an event that will be consumed in turn by Z. On the other hand, if Y is executed first, an event will be produced with the same time stamp, because module Y has zero-time delay (Figure 5.30[b]). At this point, execution of Z can be accomplished in two ways: (1) taking both events at the same invocation step, or (2) taking one of the events first and the other afterwards (it is also not clear which should be processed first). In order to enable a coherent system simulation (producing the same result even if Y or Z are executed first), the concept of delta delay is introduced. Events with the same time stamp are ordered accordingly to the delta value. Now, the event produced by module Y will have a time stamp of t + (Figure 5.30[c]), resulting in a execution of Z first consuming the event originated by X with time stamp t , and afterwards (in the next delta step) consuming the event originated by Y, with time stamp t + (Figure 5.30[d]). If a different zero-time delay process receives the event with time stamp t + , it will generate a further event with time stamp t + 2. All events with the same time tag − delta tag will be consumed at the same time. This means a two-level model of time was introduced: on top of the totally ordered events, each time instant can be further decomposed into (totally ordered) delta steps applicable to generated events. However, from the simulated time point of view reported to the user, no delta information is included. In this sense, delta steps do not relate to time as perceived by the observer. They are simply a mechanism to compute the behavior of the system at a point in time. Similar mechanisms can also be used for synchronous/reactive models. These are analyzed in the next section.

5.4.10 Synchronous/Reactive Models Reactive systems are those that interact continuously with their environment. David Harel and Amir Pnueli gave the name in the early 1980s. In the synchronous MoC, all events are synchronous: signals have events with equal tags to other events in other signals. The tags are totally ordered. In this computational model, the output responses are produced simultaneously with the input stimuli. Conceptually, synchrony means that reactions take no time, so, there is no observable time delay in the reaction (outputs become available as inputs become available). Unlike the discrete event model, all signals have events at all clock ticks, simplifying the simulator, as far as no sorting is needed. Simulators that exploit this characterization are called cycle based (in opposition to the so-called discrete-event simulators). A cycle is the processing of all the events at a given clock tick. Cycle based simulation is excellent for clocked synchronous circuits. Some applications use a multirate cycle-based model, where every nth event in one signal aligns with the events in another signal. Synchronous languages [43,44], such as Esterel [45], Lustre [46], and Signal [47], use the synchronous/reactive MoC. Statecharts graphical formalism belongs to the same language family and is sometimes referred as quasi-synchronous. Statecharts will be presented in a separated section. Esterel is a language for describing a collection of interacting synchronous FSMs supporting the description of concurrent and sequential statements: • S1;S2 represents the sequential execution of S1 followed by S2. • S1||S2 represents the parallel execution of S1 and S2. • [S1||S2] executes until S1 and S2 terminates. The synchronized modules communicate through signals, using an instantaneous broadcast mechanism. As an introductory example, we consider an FSM that implements the following behavior [48]: “Emit an output O as soon as two inputs A and B have occurred. Reset this behavior each time the input R occurs.”


5-28


R

R

A and not (B)

not (A) and B A and B/O

B/O

A/O R

FIGURE 5.31 A simple Mealy state machine.

Figure 5.31 presents the associated state diagram. The respective Esterel representation is presented below: module SMSM: input A, B, R; output O; loop [ await A || await B ]; emit O each R end module

Waiting for the occurrences of A and B is accomplished through “[await A || await B].” The “emit O” statement emits the signal O and terminates at the time it starts. So, O is emitted exactly at the time where the last of A and B occurs. The handling of the reset condition is accomplished by the loop “loop p each R.” When the loop starts, its body p runs freely until the occurrence of R. At that time, the body is aborted and it is immediately restarted. If the body terminates before the occurrence of R, one simply waits for R to restart the body. It has to be noted that Esterel code grows linearly with the size of the specification (while FSM complexity grows exponentially). For instance, if we want to wait for the occurrence of three events, A, B, and C, the code change is minimal: “[await A || await B || await C].” Esterel is a control flow oriented language. On the other hand, Lustre is a language that adopts a dataoriented flavor. It follows a synchronous dataflow approach and supports multirate clocking mechanism. In addition to the ability to translate these languages into finite state descriptions, it is possible to compile these languages directly into hardware (execution in software of these languages is made possible through the simulation of the generated circuit).


Modeling Formalisms

5-29

5.4.11 Dataflow Models Dataflow models are graphs where nodes represent operations (also called actors) and arcs represent datapaths (also called channels). These datapaths contain totally ordered sequences of data (also called tokens). A dataflow is a distributed modeling formalism, in the sense that there is no single locus of control. Yet, the conventional dataflow model is not suitable for representing the control part of the system (only what is implied by the graph structure). Regarding time handling, dataflow systems are asynchronous concurrent. Dataflow graphs have been extensively used in modeling data-dominated systems, namely digital signal-processing applications, and applications dealing with streams of periodic/regular data samples. Computationally intensive systems, carrying complex data transformation, can be conveniently represented by a directed graph where the nodes represent computation (to be coded in a suitable programming language) and the arcs represent the order in which the computations are performed. This is the case for signal-processing algorithms, for example, encode/decode, filtering, convolution, compression, etc., which are expressed as block diagrams and coded in a specific programming language, such as C. Several dataflow graphs have been referred in the literature, each one with its specific semantics, namely: • Kahn Process Network [49]. • Dataflow Process Networks [50]. • Synchronous dataflow graphs [51]. In Kahn Process Networks, and also in Dataflow Process Networks (which constitute a special case of Kahn Process Network), processes communicate by passing data tokens through unidirectional unbounded FIFO channels. A process can fire according to a set of firing rules. During each atomic firing, the actor consumes tokens from input channels, executes some behavior using a selected programming language, and produces tokens that are put on output channels. Writing to the channel is nonblocking, although reading from the channel blocks the process until there is sufficient data in it. Kahn Process Networks have guaranteed deterministic behavior: for a certain sequence of inputs, there is only one possible sequence of outputs. This is independent from the computation and communication durations, and also from the actors’ firing order. The dataflow process analysis can only be based on graph inspection. The possibility of choosing the actors firing order allows very efficient implementation. Hence, it is frequent for the signal-processing applications to have as a goal the scheduling of the actor firings, at compile time. This results in an interleaved implementation for the concurrent model, represented by a list of processes to be executed, allowing its implementation in single-processor architecture. Accordingly, for multiprocessors architectures, a per processor list is obtained. Kahn Process Networks cannot be scheduled statically, as their firing rules do not allow us to build, at compile time, a firing sequence such that the system does not block under any circumstances. For this modeling formalism, one must use dynamic scheduling (with associated implementation overhead). Another disadvantage of Kahn Process Networks is associated with the unbounded FIFO buffers, and the potential growth of memory needs. Therefore, Kahn Process Networks cannot be efficiently implemented without considering some limitations. Synchronous dataflow graphs are a kind of Kahn Process Networks with additional restrictions, namely: • At each firing, a process consumes and produces a fixed amount of tokens on its incoming and outgoing channels, respectively. • For a process to fire, it must have at least as many tokens on its input channels as it has to consume. • Arcs are marked with the number of tokens produced or consumed. Figure 5.32 presents the basic firing rule. Figure 5.33 shows a simple dataflow model and an associated static scheduling list. This scheduling is a list of firings that can be repeated indefinitely. One cycle through the schedule should return the graph to its original state (by state, we mean the number of tokens in every arc).


5-30


1

1

1

1

1

1

Fired

FIGURE 5.32 Basic firing rule. 1

2

D

A 2

2

2

4 B

A

1

A

2 C

1

B

2

A

A

B

C

D

D

FIGURE 5.33 Simple dataflow model and associated static schedule list. (From Luís Gomes and João Paulo Barros, Models of Computation for Embedded Systems. In The Industrial Information Technology Handbook, Richard Zurawski Ed., Section VI — Real Time and Embedded Systems, chapter 83, CRC Press, Boca Raton, FL, 2005. With permission.)

This static schedule can be determined through the analysis of the graph in order to find the paths of firing sequences that satisfy, on each arc, the same amount of tokens to be produced and consumed. One balance equation is built per arc, considering that outgoing_weight ∗ origin_node = incoming_weight ∗ destination node For the referred example, the following balance equations can be obtained (one per arc): 2a − 4b = 0 b − 2c = 0 2c − d = 0 2b − 2d = 0 2d − a = 0 The root of the set of presented equations is (a = 4, b = 2, c = 1, d = 2), which gives information about the firing vector associated with the static scheduling list. Yet, in other cases, it is possible that the mentioned balance equations do not have a root, implying that no static schedule is possible.


Modeling Formalisms

5-31

(a)

(b)

Fired

True

Fired

SWITCH

SWITCH

T

T

F

F

(c)

True

False

SWITCH

SWITCH T

F

T

F

T

F

T

F

(d)

T

F

T

SELECT

F

SELECT Fired

False

SELECT

SELECT Fired

FIGURE 5.34 Dynamic dataflow actors and dataflow firings. (From Luís Gomes and João Paulo Barros, Models of Computation for Embedded Systems. In The Industrial Information Technology Handbook, Richard Zurawski Ed., Section VI — Real Time and Embedded Systems, chapter 83, CRC Press, Boca Raton, FL, 2005. With permission.)

Many possible variants of dataflow models have been explored. Especially interesting, among them, is an extension to synchronous dataflow networks allowing dynamic dataflow actors [52]; dynamic actors are actors with at least one conditional input or output port. The canonical dynamic actors are “switch” and “selector,” enabling the test of tokens on one specific incoming arc, and the consumption and production of tokens in a data dependent way. A “Switch” process, for example, has one regular incoming arc, one control incoming arc, and two outgoing arcs; whenever tokens are presented in the incoming arcs, the value carried by the token presented in the control arc will dynamically select the active outgoing arc: the process firing will produce one single token at the selected outgoing arc. Figure 5.34 illustrates dynamic dataflow actor behavior before and after firing. It has to be noted that control ports are never conditional ports. More complex dynamic actors can be used, namely dynamic actors with integer control (instead of Boolean control), originating CASE and ENDCASE actors (generalizations of the SWITCH and SELECT actors, respectively). Aggregation of the referred actors using specific patterns is commonly applied. For instance, integercontrolled dataflow graphs can result from cascading a CASE actor, a set of mutually exclusive processing nodes, and an ENDCASE actor. Control ports of the CASE and ENDCASE actors receive the same value; only one of the processing nodes is executed at a time depending on the value presented at the control port. Most simulation environments targeted for signal processing, such as Matlab-Simulink or Khoros (for image processing), use dataflow models.

5.5 Conclusions As system complexity increases, the modeling formalism used for systems’ specification becomes more important. It is through the usage of a formal MoC that the designer can simulate and analyze the behavior


5-32


of the system, and anticipate, impose, or avoid several proprieties and features. This chapter presented a set of selected modeling formalisms, also called MoCs, addressing embedded system design. Due to the different types of embedded systems applications, some of the formalisms underscore the reactive (control) part of system, while others emphasize the data processing modeling capability. The final goal and ultimate challenge to the designer is to choose the right formalism for the system in hands, allowing the design of more robust and reliable systems, obtained in less development time, and allowing an easier maintainability and lower cost.

References [1] L. Lavagno, A. Sangiovanni-Vincentelli, and E. Sentovich, Models of Computation for Embedded System Design. NATO ASI Proceedings on System Synthesis, Il Ciocco (Italy), 1998. [2] M. Sgroi, L. Lavagno, and A. Sangiovanni-Vincentelli, Formal Models for Embedded System Design. IEEE Design and Test of Computers, 17: 14–17, 2000. [3] Stephen Edwards, Luciano Lavagno, Edward A. Lee, and A. Sangiovanni-Vincentelli, Design of Embedded Systems: Formal Models, Validation, and Synthesis. Proceedings of the IEEE, 85: 366–390, 1997. [4] Frank Vahid and Tony Givargis, Embedded System Design — A Unified Hardware/Software Introduction. John Wiley & Sons, Inc., New York, 2002. [5] Luís Gomes and João Paulo Barros, Models of Computation for Embedded Systems. In The Industrial Information Technology Handbook, Richard Zurawski, Ed., Section VI — Real Time and Embedded Systems, chapter 83, CRC Press, Boca Raton, FL, 2005. [6] Edsger W. Dijkstra, On the Economy of Doing Mathematics. In Mathematics of Program Construction. Second International Conference, Oxford, UK. Lecture Notes in Computer Science, Vol. 669. R.S. Bird, C.C. Morgan, and J.C.P. Woodcock, Eds., Springer-Verlag, Heidelberg, 1993, pp. 2–10. [7] David Harel, On Visual Formalisms. Communications of the ACM, 31: 514–530, 1988. [8] D. Harel and B. Rumpe, Meaningful Modeling: What’s the Semantics of “Semantics”?, Computer 37:10. IEEE Press, October 2004, pp. 64–72. [9] J. Buck, S. Ha, E.A. Lee, and D.G. Messerschmitt, Ptolemy: A Framework for Simulating and Prototyping Heterogeneous Systems. International Journal of Computer Simulation, Special Issue on Simulation Software Development, 4: 155–182, 1994. [10] Edward A. Lee and A. Sangiovanni-Vincentelli, Comparing Models of Computation. Proceedings of ICCAD’, 234–241, 1996. [11] L. Lamport, Time, Clocks, and the Ordering of Events in a Distributed System. Communications of the ACM, 21: 558–565, 1978. [12] Wolfgang Reisig, Petri Nets: An Introduction. Springer-Verlag, New York, 1985. [13] E.F. Moore, Gedanken-Experiments on Sequential Machines, Automata Studies. Princeton University Press, Princeton, NJ, 1956. [14] G.H.A. Mealy, Method for Synthesizing Sequential Circuits. Bell System Technical Journal, 34: 1045–1079, 1955. [15] Manuel Silva, Las Redes de Petri: en la Automática y la Informática. Editorial AC, Madrid, 1985. [16] Daniel Gajski, Nikil Dutt, Allen Wu, and Steve Lin, High-Level Synthesis — Introduction to Chip and System Design. Kluwer Academic Publishers, Dordrecht, 1992. [17] David Harel, Statecharts: A Visual Formalism for Complex Systems. Science of Computer Programming, 8: 231–274, 1987. [18] Bruce Powel Douglass, Doing Hard Time — Developing Real-Time Systems with UML, Objects, Frameworks, and Patterns. Object Technology Series, Addison-Wesley, Reading, MA, 1999.


Modeling Formalisms

5-33

[19] Bruce Powel Douglass, Real-Time UML — Developing Efficient Objects for Embedded Systems. Object Technology Series, Addison-Wesley, Reading, MA, 1998. [20] Grady Booch, James Rumbaugh, and Ivar Jacobson, The Unified Modeling Language User Guide. Object Technology Series, Addison-Wesley, Reading, MA, 1999. [21] David Harel and Michal Politi, Modeling Reactive Systems with Statecharts — The STATEMATE Approach. McGraw-Hill, New York, 1998. [22] David Harel and Amnon Naamad, The Statemate Semantics of Statecharts. ACM Transactions on Software Engineering and Methodology (TOSEM), 5: 293–333, 1996. [23] Michael von der Beeck, A Comparison of Statecharts Variants. In Formal Techniques in RealTime and Fault-Tolerant Systems. Lecture Notes in Computer Science, Vol. 863, Hans Langmaack, Willem P. de Roever, and Jan Vytopil, Eds., Springer-Verlag, Heidelberg, 1994, pp. 128–148. [24] F. Vahid, S. Narayan, and D.D. Gajski, Speccharts: A VHDL Frontend for Embedded Systems. IEEE Transactions on Computer-Aided Design, 14: 694–706, 1995. [25] http://www.specc.org, 2004. [26] M. Chiodo, P. Giusto, H. Hsieh, A. Jurecska, L. Lavagno, and A. Sangiovanni-Vincentelli, Hardware/Software Codesign of Embedded Systems. IEEE Micro, 14: 26–36, 1994. [27] F. Balarin, M. Chiodo, P. Giusto, H. Hsieh, A. Jurecska, L. Lavagno, C. Passerone, A. Sangiovanni-Vincentelli, E. Sentovich, K. Suzuki, and B. Tabbara, Hardware–Software Co-Design of Embedded Systems — The POLIS Approach. Kluwer Academic Publishers, Dordrecht, 1997. [28] R.K. Brayton, M. Chiodo, R. Hojati, T. Kam, K. Kodandapani, R.P. Kurshan, S. Malik, A. Sangiovanni-Vincentelli, E.M. Sentovich, T. Shiple, and H.Y. Wang, BLIF-MV: An Interchange Format for Design Verification and Synthesis. Technical report UCB/ERL M91/97, U.C. Berkeley, November 1991. [29] ITU-T, Z.100 Specification and Description Language (SDL). ITU, 2002. [30] Laurent Doldi, Validation of Communications Systems with SDL: The Art of SDL Simulation and Reachability Analysis. John Wiley & Sons, New York, 2003. [31] OMG, UML™ Resource Page, http://www.uml.org/, 2004. [32] Rick Reed, Notes on SDL-2000 for the New Millennium. Computer Networks, 35: 709–720, 2001. [33] SDL Forum Society, http://www.sdl-forum.org/, 2004. [34] ITU-T, Z.120 Message Sequence Chart (MSC). ITU, 1999. [35] E. Rudolph, J. Grabowski, and P. Graubmann, Tutorial on Message Sequence Charts (MSC’96). In Tutorials of the First Joint International Conference on Formal Description Techniques for Distributed Systems and Communication Protocols, and Protocol Specification, Testing, and Verification (FORTE/PSTV’96). FORTE’96. Kaiserslautern, Germany, October 1996. [36] Carl Adam Petri, Kommunikation mit Automaten. Ph.D. thesis, University of Bonn, Bonn, West Germany, 1962. [37] Jörg Desel and Wolfgang Reisig, Place/Transition Petri Nets. In Lectures on Petri Nets I: Basic Models. Lecture Notes in Computer Science, Vol. 1491. Springer-Verlag, 1998, pp. 122–173. [38] Eike Best, Design Methods Based on Nets: Esprit Basic Research Action {DEMON}. In Lecture Notes in Computer Science; Advances in Petri Nets 1989, Vol. 424, G. Rozenberg, Ed., Springer-Verlag, Berlin, Germany, 1990, pp. 487–506. [39] J. Desel and G. Juhas, What is a Petri Net? Informal Answers for the Informed Reader. In Unifying Petri Nets, Lecture Notes in Computer Science, Vol. 2128, H. Ehrig, G. Juhas, J. Padberg, and G. Rozenberg, Eds., Springer-Verlag, Heidelberg, 2001, pp. 122–173. [40] Luís Gomes and João-Paulo Barros, On Structuring Mechanisms for Petri Nets Based System Design. IEEE Conference on Emerging Technologies and Factory Automation Proceedings (ETFA’ 2003) Lisbon, Portugal, September 16–19, 2003. [41] Gul Agha, Fiorella de Cindio, and Grzegorz Rozenberg, Eds., Concurrent Object-Oriented Programming and Petri Nets, Advances in Petri Nets. Lecture Notes in Computer Science, Vol. 2001. Springer-Verlag, Heidelberg, 2001.


5-34


[42] Kurt Jensen, Coloured Petri Nets. Basic Concepts, Analysis Methods and Practical Use, Vol. 1–3. Monographs in Theoretical Computer Science. An EATCS Series, Springer-Verlag, Berlin, Germany, 1992–1997. [43] N. Halbwachs, Synchronous Programming for Reactive Systems. Kluwer Academic Publishers, Dordrecht, 1993. [44] A. Benveniste and G. Berry, The Synchronous Approach to Reactive and Real-Time Systems. Proceedings of the IEEE, 79: 1270–1282, 1991. [45] F. Boussinot and R. De Simone, The ESTEREL Language. Proceedings of the IEEE, 79: 1293–1304, 1991. [46] N. Halbwachs, P. Caspi, P. Raymond, and D. Pilaud, The Synchronous Data Flow Programming Language LUSTRE. Proceedings of the IEEE, 79: 1305–1319, 1991. [47] A. Benveniste and P. Le Guernic, Hybrid Dynamical Systems Theory and the SIGNAL Language. IEEE Transactions on Automatic Control, 35: 525–546, 1990. [48] G. Berry, The Esterel v5 Language Primer, 2000. Available from http://www-sop.inria.fr/esterel.org. [49] G. Kahn, The Semantic of a Simple Language for Parallel Programming. In Proceedings of the IFIP Congress 74. North-Holland Publishing Co., Amsterdam, 1974. [50] E.A. Lee and T.M. Parks, Dataflow Process Networks. Proceedings of the IEEE, 83: 773–801, 1995. [51] E.A. Lee and D.G. Messerschmitt, Synchronous Data Flow. Proceedings of the IEEE, 75: 1235–1245, 1987. [52] E.A. Lee, Consistency in Dataflow Graphs. IEEE Transactions on Parallel and Distributed Systems, 2: 223–235, 1991. [53] P. Huber, K. Jensen, and R.M. Shapiro, Hierarchies in Coloured Petri Nets. In Proceedings of the 10th International Conference on Applications and Theory of Petri Nets. Bonn, Springer-Verlag, London, 1991, pp. 313–341.


6 System Validation 6.1 6.2

Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mathematical Models of Embedded Systems . . . . . . . . . .

6-1 6-2

Transition Systems • Agents • Environments • Classical Theories of Concurrency

6.3

Requirements Capture and Validation . . . . . . . . . . . . . . . . . 6-19 Approaches to Requirements Validation • Tools for Requirements Validation

6.4

J.V. Kapitonova, A.A. Letichevsky, and V.A. Volkov National Academy of Science of Ukraine

Thomas Weigert Motorola

Specifying and Verifying Embedded Systems . . . . . . . . . . 6-29 System Descriptions and Initial Requirements • Static Requirements • Dynamic Requirements • Example: Railroad Crossing Problem • Requirement Specifications • Reasoning about Embedded Systems • Consistency and Completeness

6.5

Examples and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-40 Example: Embedded Operating System • Experimental Results in Various Domains

6.6 Conclusions and Perspectives . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-50 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-50

6.1 Introduction Toward the end of the 1960s system designers and software engineers faced what was then termed as “software crisis.” This crisis was the direct outcome of the introduction of a new generation of computer hardware. The new machines were substantially more powerful than the hardware available until then, making large applications and software systems feasible. The strategies and skills employed in building software for the new systems did not match the new capabilities provided by the enhanced hardware. The results were delayed projects, sometimes for years, considerable cost overruns, and unreliable applications with poor performance. The need arose for new techniques and methodologies to implement large software systems. The now classic “waterfall” software life-cycle model was then introduced to meet these needs. The 1960s have long gone by, but the software crisis still remains. In fact the situation has worsened — the implementation disasters of the 1960s are being succeeded by design disasters. Software systems have reached levels of complexity at which it is extremely difficult to arrive at complete, or even consistent, specifications and it is nearly impossible to know all the implications of one’s requirement decisions. Further, the availability of hardware and software components may change during the course of the development of a system, forcing a change in the requirements. The customer may be unsure of the requirements altogether. The situation is even worse for embedded systems: the real time and distributed aspects of such systems impose additional design difficulties and introduces the possibility of concurrency

6-1


6-2


pathologies such as deadlock or lifelock, resulting from unforeseen interactions of independently executing system components. A number of new methodologies, such as rapid prototyping, executable specifications, and transformational implementation have been introduced to address these problems in order to arrive at shorter cycle time and increased quality of the developed systems. Although each of these methodologies addresses different concerns they share the underlying assumption that verification and validation be performed as close to the customer requirements as possible. While verification tries to ensure that the system is built “right,” that is, without defects, validation attempts to ensure that the “right” system is developed, that is, a system that matches what the customer actually wants. The customer needs is captured in the system requirements. Many studies have demonstrated that errors in system requirements are the most expensive as they are typically discovered late, when one first interacts with the system; in the worst case such errors can force complete redevelopment of the system. In this chapter, we examine techniques aimed at discovering the unforeseen consequences of requirements as well as omissions in requirements. Requirements should be consistent and complete. Roughly speaking, consistency means the existence of an implementation that meets the requirements; completeness means that the implementation (its function or behavior) is defined uniquely by the requirements. Validation of a system is to establish that the system requirements are consistent and complete. Embedded systems [1–3] consist of several components that are designed to interact with one another and with their environment. In contrast to functional systems, which are specified as functions from input to output values, an embedded system is defined by its properties. A property is a set of desired behaviors that the system should possess. In Section 6.2, we present a mathematical model of embedded systems. Labeled transition systems, representing the environment and agents inserted into this environment by means of a continuous insertion function are used for representing system requirements at all levels of details. Section 6.3 presents a survey of freely available systems that could be used to validate embedded systems as well as references to commercially available systems. In Section 6.4, we present a notation to describe the requirements of embedded systems. Two kinds of requirements are distinguished: static requirements define the properties of system and environment states and the insertion function of the environment; dynamic requirements define the properties of histories and behavior of system and environment. Hoare-style triples are used to formulate static requirements; logical formulae with temporal constraints are used to formulate dynamic requirements. To define transition systems with a complex structure of states we rely on attributed transition systems which allow to split the definition of a transition relation into a definition of transitions on a set of attributes, and formulate general transition rules for the entire environment states. We also present a tool for reasoning about the embedded systems and discuss more formally the consistency and completeness condition for a set of requirements. Our approach does not require the developers to build software prototypes, which are traditionally used for checking consistency of a system under development. Instead, one develops formal specifications and uses proofs to determine consistency of the specification. Finally, Section 6.5 presents the specification of a simple scheduler as a case study and reports the results on applying these techniques to the systems in various application domains. We have observed the following time distribution in the software development cycle: 40% of the cycle time is spent on requirements capture, 20% on coding, and 40% on testing. Requirements capture, includes not only the development of requirements but also their corrections and refinement during the entire development cycle. According to Brooks [4], 15% of the development efforts are spent on validation, that is, ensuring that the system requirements are correct. Therefore, improving validation has a significant impact on development time, even for successful requirement specifications. For failed requirements that forced major system redevelopment, the impact is obviously much higher.

6.2 Mathematical Models of Embedded Systems In the embedded domain, the main properties of the systems concern the interaction of components with each other and with the environment. The primary mathematical notion to formally represent interacting


System Validation

6-3

and communicating systems is that of a labeled transition system. When formulating requirements or developing high-level specifications we are not interested in the internal structure of the states of a system and consider these states, and therefore also the systems, as identical up to some equivalence. The abstraction afforded by this equivalence leads to the general notion of an agent and its behavior. Agents exist in some environment and an explicit definition of the interaction of agents and environments in terms of a function that embeds the agent in this environment helps to specialize the mathematical models to particular characteristics of the subject domain.

6.2.1 Transition Systems The most general abstract model of software and hardware systems, which evolve in time and change states in a discrete way, is that of a discrete dynamic system. It is defined as a set of states and a set of histories, describing the evolution of a system in time (either discrete or continuous). As a special case, a labeled transition system over the set of actions A is a set S of states together with the transition relation a → s and say that a system S moves from the T ⊆ S × A × S. If (s, a, s ) ∈ T , we usually write this as s − state s to state s while performing the action a. (Sometimes the term “event” is used instead of “action.”) An automaton is a more special case, where the set of actions is the set of input/output values. Continuity of time, if necessary, can be introduced by supplying actions with a duration, that is, by considering complex actions (a, t ), where a is a discrete component of an action (its content) and t is a real number representing the duration of a. In timed automata, duration is defined nondeterministically and intervals for possible durations are used instead of specific moments in time. Transition systems separate the observable part of a system, which is represented by actions, from the hidden part, which is represented by states. Actions performed by a system are observable by an external observer and other systems, which can communicate with the given system, synchronizing their actions, and combining their behaviors. The internal states of a system are not observable; they are hidden. Therefore, the representation of states can be ignored when considering the external behavior of a system. The activity of a system can be described by its history which is a sequence of transitions, beginning from an initial state: a1

a2

an1

→ s1 − → · · · sn −→ sn+1 · · · s0 − A history can be finite or infinite. Each history has an observable part (a sequence of actions a1 , a2 , . . . , an , . . .) and a hidden part (a sequence of states). The former is called a trace generated by the initial state s0 (in Reference 5, the term behavior is used instead of trace). Two states are called to be trace-equivalent if the set of all traces generated by these states coincide. A final history cannot be continued: it is infinite or for the last state sn in the sequence, there are an → sn+1 from this state; such a state is called a final state. We distinguish a final state no transitions sn − representing successful termination from deadlock states (states where one part of a system is waiting for an event caused by another part and the latter is waiting for an event caused by the former) and divergent or undefined states. Such states can be defined later or constitute livelocks (states that contain hidden infinite loops or infinite recursive unfolding without observable actions). Transition systems can be nondeterministic in which a system can move from a given state s into different states performing the same action a. A labeled transition system (without hidden transitions) is a a → s and s − → s , it follows that s = s , and that there are no deterministic if for arbitrary transitions s − states representing both successful termination and divergence. To define transition systems with a complex structure of states we rely on attributed transition systems. If e is a state of an environment and f is an attribute of this environment, then the value of this attribute will be denoted as e · f . We will represent a state of an environment with attributes f1 , . . . , fn as an object with public (observable) attributes f1 : t1 , . . . , fn : tn , where t1 , . . . , tn are types, and some hidden private part.


6-4


6.2.2 Agents Agents are objects that can be recognized as separate from the “rest of the world,” that is, other agents or the environment. They change their internal state and can interact with other agents and the environment, performing observable actions. The notion of an agent formalizes such diverse objects as software components, programs, users, clients, servers, active components of distributed systems, and so on. In mathematical terms, agents are labeled transition systems with states considered up to bisimilarity. We are not interested in the structure of the internal states of an agent but only in its observable behavior. The notion of an agent as a transition system considered up to some equivalence has been studied extensively in concurrency theory; van Glabbeek [6] presents a survey of the different equivalence relations that have been proposed to describe concurrent systems. These theories use an algebraic representation of agent states and develop a corresponding algebra so that equivalent expressions define equivalent states. The transition relation is defined on the set of algebraic expressions by means of rewriting rules and recursive definitions. Some representations avoid the notion of a state, and instead, if for some agent E a transition for action a is defined, it is said that the agent performs the action a and thus becomes another agent E . 6.2.2.1 Behaviors Agents with the same behavior (i.e., agents which cannot be distinguished by observing their interaction with other agents and environments) are considered equivalent. We characterize the equivalence of agents in terms of the complete continuous algebra of behaviors F (A). This algebra has two sorts of elements — behaviors u ∈ F (A), represented as finite or infinite trees, and actions a ∈ A, and two operations — prefixing and nondeterministic choice. If a is an action and u is a behavior, prefixing results in a new behavior denoted as a · u. Nondeterministic choice is an associative, commutative, and idempotent binary operation over behaviors denoted as u + v, where u, v ∈ F (A). The neutral element of nondeterministic choice is the deadlock element (impossible behavior) 0. The empty behavior performs no actions and denotes the successful termination of an agent. The generating relations for the algebra of behaviors are as follows: u+v =v +u (u + v) + w = u + (v + w) u+u =u u+0=u ∅·u =0 where ∅ is the impossible action. Both operations are continuous functions on the set of all behaviors over A. The approximation relation ⊆ is a partial order with minimal element ⊥. Both prefixing and nondeterministic choice are monotonic with respect to this approximation: ⊥⊆u u ⊆v ⇒u+w ⊆v +w u ⊆v ⇒a·u ⊆a·v The algebra F (A) is constructed so that prefixing and nondeterministic choice are also continuous with respect to the approximation and it is closed relative to the limits (least upper bounds) of the directed sets of finite behaviors. Thus, we can use the fixed point theorem to give a recursive definition of behaviors starting


System Validation

6-5

from the given behaviors. Finite elements are generated by three termination constants: (successful termination), ⊥ (the minimal element of the approximation relation), and 0 (deadlock). a → v if u can be F (A ) can be considered as a transition system with the transition relation defined by u − represented in the form u = a · v + u . The terminal states are those that can be represented in the form u + , divergent states are that which can be represented in the form u + ⊥. In algebraic terms we can say that u is terminal (divergent) iff u = u + (u = u + ⊥), which follows from the idempotence of nondeterministic choice. Thus, behaviors can be considered as states of a transition system. Let beh(s) denote the behavior of an agent in a state s, then the behavior of an agent in state s can be represented as the solution us ∈ F (A ) of the system us =

a · ut + εs

(6.1)

a

s− →t

where εs = 0 if s is neither terminal nor divergent, εs = if s is terminal but not divergent, εs = ⊥ for divergent but not terminal states, and εs = + ⊥ for states which are both terminal and divergent. If all summands in the representation (6.1) are different, then this representation is unique up to associativity and commutativity of nondeterministic choice. As an example, consider the behavior u defined as u = tick .u. This behavior models a clock that never terminates. It can be represented by a transition system with only one state u which generates the infinite history tick

tick

u −−−→ u −−−→ · · · The infinite tree with only one path representing this behavior can be obtained as the limit of the sequence of finite approximations u (0) = ⊥, u (1) = tick .⊥, u (2) = tick .tick .⊥, . . . . Now consider, u = tick .u + stop . This is a model of a clock which can terminate by performing the action stop, but the number of steps to be done before terminating are not known in advance. The transition system representing this clock has two states, one of which is a terminal state. The first two approximations of this behavior are u (1) = tick .⊥ + stop . u (2) = tick .(tick .⊥ + stop .) + stop . Note that, the second approximation cannot be written in the form tick .tick .⊥+tick .⊥+stop . because distributivity of choice does not hold in behavior algebra. u = tick .u + tick .0 describes a similar behavior but is terminated by deadlock rather than successfully. 6.2.2.2 Bisimilarity Trace equivalence is too weak to capture the notion of the behavior of a transition system. Consider the systems shown in Figure 6.1. Both systems in Figure 6.1 start by performing the action a. But the system at the left-hand side has a choice at the second step to perform either action b or c. The system on the right can only perform an action b and can never perform c or it can only perform c and never perform b, depending on what decision was made at the first step. The notion of bisimilarity [7] captures the difference between these two systems.


6-6


a

a

b

FIGURE 6.1

c

b

a

c

Two systems which are trace equivalent but have different behaviors.

A binary relation R ⊆ S × S on the set of states S of a transition system without terminal and divergent states is called a bisimulation if for each s and t such that (s, t ) ∈ R and for each a ∈ A: a

a

→ s then there exists t ∈ S such that t − → t and (s , t ) ∈ R. 1. If s − a a → t then there exists s ∈ S such that s − → s and (s , t ) ∈ R. 2. If t − Two states s and t are called bisimilar if there exists a bisimulation relation R such that (s, t ) ∈ R. Bisimilarity is an equivalence relation whose definition is easily extended to the case when R is defined as a relation between the states of two different systems, considering the disjoint union of their sets of states. Two transition systems are bisimilar if each state of one of them is bisimilar to some state of the other. For systems with nontrivial sets of terminal states S and divergent states S⊥ , partial bisimulation is considered instead of bisimulation. A binary relation R ⊆ S × S is a partial bisimulation if for all s and t such that (s, t ) ∈ R and for all a ∈ A, / S⊥ then t ∈ / S⊥ . 1. If s ∈ S then t ∈ S and if s ∈ a a 2. If s − → s then there exists t such that t − → t and (s , t ) ∈ R. a a → t then there exists s such that s − → s and (s , t ) ∈ R. 3. If t − A state s of a transition system S is called a bisimilar approximation of t , denoted by s ⊆B t , if there exists a partial bisimulation R such that (s, t ) ∈ R. Bisimilarity s ∼B t can then be introduced as the relation s ⊆B t ∧ t ⊆B s. For attributed transition systems, the additional requirement is that if (s, t ) ∈ R, then s and t have the same attributes. A divergent state without transition approximates arbitrary other states that are not terminal. If s approximates t and s is convergent (not divergent) then t is also convergent, s and t have transitions for the same sets of actions, and satisfy the same conditions as for bisimulation without divergence. Otherwise if s is divergent, the set of actions, for which s has transitions, is only included in the set of actions for which t has transitions, that is, s is less defined than t . For the states of a transition system it can be proved that s ⊆B t ⇔ beh(s) ⊆ beh(t ) s ∼B t ⇔ beh(s) = beh(t ) and, therefore, the states of an agent considered up to bisimilarity can be identified with corresponding behaviors. If S is a set of states of an agent then a set U = {beh(s)|s ∈ S} is a set of all its behaviors. a This set is transition closed which means that u ∈ U and u − → v implies v ∈ U . Therefore, U is also a transition system equivalent to S and can be used as a standard behavior representation of an agent.


System Validation

6-7

For many applications, a weaker equivalence such as weak bisimilarity introduced by Milner [8], or insertion equivalence as discussed in Section 6.2.3, have been considered. Note that, for deterministic systems, if two systems are trace-equivalent, they are also bisimilar. 6.2.2.3 Composition of Behaviors Composition of behaviors is defined as an operation over agents and is expected to preserve equivalence; it can, therefore, also be defined as an operation on behaviors. The sequential composition of behaviors u and v is a new behavior denoted as (u; v) and defined by means of the following inference rules and equations: a

→ u u−

(6.2)

a

(u; v) − → (u ; v ) ((u + ); v) = (u; v) + v

(6.3)

((u + ⊥); v) = (u; v) + ⊥

(6.4)

(0; u) = 0

(6.5)

We consider a transition system with states built from arbitrary behaviors over the set of states A by means of operations of the behavior algebra F (A) and a new operation denoted as (u; v). Expressions are considered up to the equivalence defined by the above equations (thus, the extension of a behavior algebra by this operation is conservative). The inference rule (6.2) defines a transition relation on a set of equivalence classes. From rule (6.2) and equation (6.4) it follows that (; v) = v and (⊥; v) = ⊥. One can prove that (u; ) = u and that sequential composition is associative and distributives to the left ((u + v); w) = (u; w) + (v; w) Sequential composition can also be defined explicitly by the following recursive definition:

(u; v) =

a · (u ; v) +

(ε; v)

u=u+ε

a

u− →u

6.2.2.3.1 Parallel Composition of Behaviors We define an algebraic structure on the set of actions A by introducing the combinator a × b of actions a and b. This operation is commutative and associative with the impossible action ∅ as the zero element (a × ∅ = ∅). As ∅ · u = 0, there are no transitions labeled ∅. The inference rules and equations defining the parallel composition u v of behaviors u and v are a

b

→ u, v − → v , a × b = ∅ u− a×b

u v −−→ u v a

u− → u a

uv − → u v a

u− → u a

u (v + ) − → u


6-8

Embedded Systems Handbook a

v− → v a

uv − → u v a

v− → v a

(u + ) v − → v (u + ) (v + ) = (u + ) (v + ) + (u + ⊥) v = (u + ⊥) v + ⊥ u (v + ⊥) = u (v + ⊥) + ⊥ The following equations for termination constants are direct consequences of these definitions: ε = ε = ε

⊥ε = ε⊥ = ⊥

0 ε = ε 0 = 0 if ε = ε + ⊥ 0 ε = ε 0 = ⊥ if ε = ε + ⊥ Parallel composition is commutative and associative. Parallel composition is the primary means for describing the interaction of agents. The simplest interaction is interleaving, which trivially defines composition as a × b = ∅ for arbitrary actions. Agents in a parallel composition interact with each other and can synchronize via combined actions. Parallel composition can also be defined explicitly by the following recursive definition: (u v ) =

(a × b ) · (u v ) +

a

u− →u b

v− →

v

a

u− →u

a · (u v ) +

b · (u v ) + εu εv

b

v− →v

where εu is a termination constant in the equational representation of behavior u.

6.2.3 Environments An environment E is an agent over an action algebra C with an insertion function. All states of the environment are initial states. The insertion function, denoted by e[u] takes an argument e (the behavior of an environment) and the behavior of an agent over an action algebra A in a given state u (the action algebra of agents may be a parameter of the environment) and yields a new behavior of the same environment. The insertion function is continuous in both its arguments. We consider agents up to a weaker equivalence than bisimilarity. Consider the example in Figure 6.2. Clearly, these systems are not bisimilar. However, if a represents the transmition of a message, and b represents the reception of that message, the second trace on the left-hand side figure would not be possible within an environment that supports asynchronous message passing. Consequentially, both systems would always behave the same. Insertion equivalence captures this difference: the environment can impose constraints on the inserted agent, such as disallowing the behavior b · a, in this example. In such environment, both behaviors shown in Figure 6.2 are considered equivalent. Insertion equivalence depends on the environment and its insertion function. Two agents u and v are insertion equivalent with respect to an environment E, written as u ∼E v, if for all e ∈ E, e[u] = e[v]. Each agent u defines a transformation on the set of environment states; two agents are equivalent with respect to a given environment if they define the same transformation of the environment.


System Validation

6-9

b

a

a

a

b

b

FIGURE 6.2 Two systems which are not bisimilar, but may be insertion equivalent.

External environment

Environment

Insertion function

Agent

FIGURE 6.3 Agents in environment.

After insertion of an agent into an environment, the new environment is ready to accept new agents to be inserted. Since insertion of several agents is a common operation, we shall use the notation e[u1 , . . . , un ] = e[u1 ] · · · [un ] as a convenient shortcut for insertion of several agents. In this expression, u1 , . . . , un are agents inserted into the environment simultaneously, but the order of insertion may be essential for some environments. If we wanted an agent u to be inserted after an agent v, a → s and consider the expression s[v]. Some environments can move we must find some transition e[u] − a independently, suspending the actions of an agent inserted into it. In this case, if e[u] − → e [u], then e [u, v] describes the simultaneous insertion of u and v into the environment in state e as well as the insertion of u when the environment is in a state e and is followed by the insertion of v. An agent can be inserted into the environment e[u1 , u2 , . . . , un ], or that environment can itself be considered as an agent which can be inserted into a new external environment e with a different insertion function. An environment with inserted agents as a transition system is considered up to bisimilarity, but after insertion into a higher level environment it is considered up to insertion equivalence (Figure 6.3). Some example environments arising in real-life situations are: • A vehicle with sensors is an environment for a computer system. • A computer system is an environment for programs.


6-10


• The operating system is an environment for application programs. • A program is an environment for data, especially when considering interpreters or higher-order functional programs. • The web is an environment for applets. 6.2.3.1 Insertion Functions Each environment is defined by its insertion function. The restriction on the insertion function to be continuous is too weak and in practice more restricted types of insertion functions are considered. The states of environments and agents can be represented in algebraic form as expressions of a behavior algebra. To define an insertion function it is sufficient to define transitions on the set of expressions of the type e[u]. We use rules in the form of rewriting logic to define these transitions. The typical forms of such rules are: F (x)[G(y)] → d · F (z)[G (z)] F (x)[G(y)] → F (z)[G (z)] where x = (x1 , . . . , xn ), y = (y1 , . . . , yn ), z = (x1 , x2 , . . . , y1 , y2 , . . .), x1 , x2 , . . . , y1 , y2 are action or behavior variables, F , G, F , G are expressions in the behavior algebra, that is, expressions built by nondeterministic choice and prefixing. More complex rules allow arbitrary expressions on the right-hand side in the behavior algebra extended by insertion as two sorted operation. The first type of rule defines observable transitions d → F (z)[G (z)] F (x)[G(y)] − The second type of rule defines unlabeled transitions which can be used as auxiliary rules. They are not observable outside the environment and can be reduced by the rule ∗

d

e[u] − → e [u ], e [u ] − → e [u ] d

e[u] − → e [u ] ∗

→ means the transitive closure of unlabeled transitions. Special rules or equations must be added where − for termination constants. Rewriting rules must be left linear with respect to the behavior variables, that is, none of the behavior variables can occur more than once in the left-hand side. Additional completeness conditions must be present to ensure all possible states of the environment are covered by the lefthand side of the rules. Under these conditions, the insertion function will be continuous even if there are infinitely many rules. This is because, to compute the function e[u] one needs to know only some finite approximations of e and u. If e and u are defined by means of a system of fixed point equations, these approximations can be easily constructed by unfolding these equations sufficiently many times. Insertion functions that are defined by means of rewriting rules can be classified on the basis of the height of terms F (x) and G(y) in the left-hand side of the rules. The simplest case is when this height is no more than 1, that is, terms are the sum of variables and expressions of the form c · z, where c is an action, and z is a variable. Such insertion functions are called one-step insertions, other important classes are head insertion and look-ahead insertion functions. For head insertion the restriction on the height should not exceed 1 which refers only to the agent behavior term G(y). The term F (x) can be of arbitrary height. Head insertion can be reduced to one-step insertion by changing the structure of the environment but preserving the insertion equivalence of agents. In head insertion, the interaction between the environment and agent is similar to the interaction between the server and the client: a server has information only about the next step in the behavior of the client but knows everything about its own behavior. In a look-ahead insertion environment, the behavior of an agent can be analyzed for arbitrary long (but finite) future steps. We can liken such environment to the interaction between an interpreter and a program.


System Validation

6-11

We consider a one-step insertion which is applied in many practical cases by restricting ourselves to purely additive insertion functions that satisfy the following conditions: ei [u] ei [u] = e

ui = e[ui ]

Given two functions D1 : A × C → 2C and D2 : C → 2C , the transition rules for insertion functions are a

c

u− → u, e − → e , d ∈ D1 (a, c) d

e[u] − → e [u ] c

e− → e , d ∈ D2 (c) d

→ e [u] e[u] − We refer to D1 and D2 as residual functions. The first rule (interaction rule) defines the interaction between the agent and the environment which consists of choosing a matching pair of actions a ∈ A and c ∈ C. Note that, the environment and the agent move independently. If the choice of action is made first by the environment, then the choice of action c by the environment defines a set of actions that the agent may take: a can be chosen only so that D1 (a, c) = ∅. The observable action d must be selected from the set D1 (a, c). This selection can be restricted by the external environment if e[u] considered as an agent is inserted into the environment by other agents inserted into environment e[u] after u. This rule can be combined with rules for unobservable transitions if some action, say τ (as in Milner CCS), is selected in C to hide the transition. For this case we formulate the interaction rule to account for hidden interactions. a

c

→ u, e − → e , τ ∈ D1 (a, c) u− e[u] − → e [u ] The second rule (environment move rule) describes the case when the environment transitions independently of the inserted agent and the agent is waiting until the environment will allow it to move. Unobservable transitions can also be combined with environment moves. Some equations should be added for the case when e or u are termination constants. We shall assume that ⊥[u] = ⊥, 0[u] = 0, e[] = e, e[⊥] = ⊥, and e[0] = 0. There are no specific assumptions about [u] but usually neither nor 0 belong to E. Note that, in the case when ∈ E and [u] = u, insertion equivalence coincides with bisimulation. The definition of the insertion function for one-step insertion discussed earlier will be complete, if we assume that there are no transitions other than those defined by the rules. The definition above can be expressed in the form of rewriting rules as follows: d ∈ D1 (a, c) ⇒ (c · x)[a · y] → d · y d ∈ D2 (c) ⇒ (c · x)[y] → d · x[y] and in the form of explicit recursive definition as e[u] =

c

e− →e a

u− →

u

d∈D1 (a,c)


d · e [u ] +

c

e− →e d∈D2 (c)

d · e [u] + εe [u]

6-12


To compute transitions for the multiagent environment e[u1 , u2 , . . . , un ] we recursively compute transitions for e[u1 ], then for e[u1 , u2 ] = (e[u1 ])[u2 ], and eventually for e[u1 , u2 , . . . , un ] = (e[u1 , u2 , . . . , un−1 ])[un ]. Important special cases of one-step insertion functions are parallel and sequential insertion. An insertion function is called a parallel insertion if e[u, v] = e[u v] This means that the subsequent insertion of two agents can be replaced by the insertion of their parallel composition. The simplest example of a parallel insertion is defined as e[u] = e u. This special case holds when the sets of actions of environment and agents are the same (A = C), b = D1 (a, a × b), and D2 (a) = A. In the case when ∈ E, this environment is a set of all other agents interacting with a given agent in parallel, and insertion equivalence coincides with bisimilarity. Sequential insertion is introduced in a similar way: e[u, v] = e[u; v] This situation holds, for example, when D1 (a, c) = ∅, D2 (c) = C, and [u] = u. 6.2.3.2 Example: Agents over a Shared and Distributed Store As an example, consider a store, which generalizes the notions of memory, data bases, and other information environments used by programs and agents to hold data. An abstract store environment E is an environment over an action algebra C, which contains a set of actions A used by agents inserted into this environment. We shall distinguish between local and shared store environments. The former can interact with an agent inserted into it while this agent is not in a final state and, if another agent is inserted into this environment, the activity of the latter is suspended until the former completes its work. The shared store admits interleaving of the activity of agents inserted into it, and they can interact concurrently through this shared store. 6.2.3.2.1 Local and Shared Store The residual functions for a local store are defined as: D1 (a, c) = {d|c = a × d}, where d = ∅ for d ∈ C\A or d = δ otherwise, and D2 (c) = C and for a shared store as D1 (a, c) = {d|c = a × d},

where d = ∅, d ∈ C, and D2 (c) = C.

It can be proved that the one-step insertion function for a local store is a sequential insertion and that one-step insertion for a shared store is a parallel insertion. In other words, e[u1 , u2 , . . .] = e[u1 ; u2 ; . . .] for a local store, and e[u1 , u2 , . . .] = e[u1 u2 . . .] for a shared store. The interaction move for the local store is defined as a

a×d

u− → u , e −−→ e d

e[u] − → e [u ]


System Validation

6-13

When the store moves according to this rule, an agent inserted into it plays the role of control for this store. A store in a state e[u] can only perform actions which are allowed by the agent u. This action can be combined only with an action d which is not from the action set A and cannot be used by another agent in a transition. The actions returned by the residual function are external actions and can be observed and used only from outside the store environment. Different from a local store, in a shared store environment several agents can perform their actions in parallel according to the rule a1

an

a1 ×···×an ×d

→ u1 , . . . , un − → un , e −−−−−−−→ e u1 − d

e[u1 · · · un v] − → e [u1 · · · un v] An important special case of the store environment E is a memory over a set of names R and a data domain D. The memory can be represented by an attributed transition system with attributes R and states e : R → R . Agent actions are assignments and conditions, and their combinations are possible if they can a be performed simultaneously. If a is a set of assignments, then in a transition e − → e the state e results c×a

from applying a to e. The conjunction of conditions c enables a transition e −−→ e if c is valid on e and a e− → e . 6.2.3.2.2 Multilevel Store For a shared memory store the residual action d in the transition d

e[u1 · · · un v] − → e [u1 · · · un v] is intended to be used by external agents inserted later, but in a multilevel store it is convenient to restrict the interaction with the environment to a given set of agents which have already been inserted. For this purpose, a shared memory can be inserted into a higher level closure environment with an insertion function defined by the equation g [e[u]][v] = g [e[u v]] where g is a state of this environment, e is a shared memory environment, and only the following two rules are used for transitions in the closure environment: c

→ e [u ], c ∈ Cext ∧ c = δ e[u] − ϕext (c,e)

g [e[u]] −−−−→ g [e [u ]] δ

e[u] − → e [u ] g [e[u]] → g [e [u ]] Here Cext is a distinguished set of external actions. Some of external actions can contain occurrences of names from e. The function ϕext substitutes the values of these names in c and performs other transformations to make an action be observable for external environment. Two level insertions can be described in the following way. Let R = R1 ∪ R2 be divided into two nonintersecting parts: the external and internal memory. Let A1 be the set of actions which change only the values of R1 , but can use the values of R2 (external output actions), let A2 be the set of actions which change only the values of R2 , but can use the values of R1 (external input actions), and A3 be the set of actions which change and use only the values of R2 (internal actions). These sets are assumed to be defined on the syntactical level. Redefine the residual function D1 and transitions of E: let a ∈ A and split a into a combination of actions ϕ1 (a) × ϕ2 (a) × ϕ3 (a) so that ϕ1 (a) ∈ A1 , ϕ2 (a) ∈ A2 , and ϕ3 (a) ∈ A3


6-14


(some of these actions may be equal to δ). Define the interaction rule in the following way: (ϕ2 (a)σ )×ϕ3 (a)

a

→ u , e −−−−−−−−−→ e u− cσ ×ϕ1 (a)

e[u] −−−−−→ e [u ] where σ is an arbitrary substitution of names used in conditions and in the right-hand sides of assignments of ϕ2 (a) into the set of their values, bσ is an application of the substitution σ to b, cσ is a substitution written in the form of the condition r1 = σ (r1 ) ∧ r2 = σ (r2 ) ∧ · · · . Define ϕext (b, e) = be, that is a substitution of the values of R2 to b. Consider a two level structure of a store state t [g [e1 [u1 ]] g [e1 [u1 ]] · · · ] where t ∈ D R1 is a shared store and e1 , e2 , . . . ∈ D R2 represent the distributed store (memory). When the component g [ei [ui ]] performs internal actions these are hidden and do not affect the shared memory. Performing external output actions change the names of the shared memory and external input actions receive values from the shared memory to change components of the distributed memory. This construction is easily iterated as the components of a distributed memory can have multilevel structure. 6.2.3.2.3 Message Passing Distributed components can interact via shared memory. We now introduce direct interaction via message passing. Synchronous communication can be organized by extending the set of actions with a combination of actions in parallel composition independently of the insertion function. To describe synchronous data exchange in the most general abstract schema, let u=

a(d) · F (d)

u =

a (d ) · F (d )

d ∈D

d∈D

be two agents which use data domain D for the exchange of information. Functions a and a map the data domain onto the action algebra, functions F and F map elements of the data domain onto behaviors. The parallel composition of u and u is u u =

a(d) × a (d )(F (d) F (d )) +

a(d)×a (d ) =∅

a(d)(F (d) u ) +

d∈D

a (d )(F (d ) u)

d ∈D

(note that, εu = εu = 0, i.e., this is a special case of parallel composition where there are no termination constants). The first summand corresponds to the interaction of two agents. The other two summands reflect the possibility of interleaving. The interaction can be deterministic even if u and u are nondeterministic if a(d) × a (d ) = ∅ has only one solution. Interleaving makes it possible to select other action if u u is embedded into another parallel composition. They can also be hidden by a closure environment (similar to restriction in Calculus of Concurrent Systems, CCS). The exchange of information through combination is bidirectional. An important special case of information exchange is the use of send/receive pairs. For example, consider the following combination rule send (addr , d)×receive (addr , d )=

exch (addr), if addr=addr , d=d ∅, otherwise

In the latter case, if u = send (addr , d) · v


System Validation

6-15

and u =

receive (addr , d ) · F (d )

d ∈D

the interaction summand of the parallel composition will be exch (addr ) · (v F (d)). Asynchronous message passing via channels can be described by introducing a special communication environment. The attributes of this environment are channels and their values are sequences (queues) of stored messages. It is organized similarly to the memory environment but queue operations are used instead of storing. In addition, send and receive actions are separated in time. This environment is a special case of a store environment and can be combined with a store environment keeping separate the different types of attributes and actions.

6.2.4 Classical Theories of Concurrency The theory of interaction of agents and environments [9–11] focuses on the description of multi-agent systems comprised of agents cooperatively working within a distributed information environment. Other mathematical models for specifications of dynamic and real time systems interacting with environments have been developed based on process algebras (CSP, CCS, ACP, etc.), automata models (timed Büchi and Muller automata, abstract state machines [ASM]), and temporal logic (LPTL, LTL, CTL, CTL∗ ). New models are being developed to support different peculiarities of application areas, such as Milner’s π-calculus [12] for mobility and its recent extension to object-oriented descriptions. The environment may change the predefined behavior of an agent. For example, it may contain some other agents designed independently and intended to interact and communicate with the agent during its execution. The classical theories of communication consider this interaction as part of the parallel composition of agents. The influence of the environment can be expressed as an explicit language operation such as restriction (CCS) or hiding (CSP). In contrast to the classical theories of interaction which are based on an implicit and hence not formalized notion of an environment, the theory of interaction of agents and environments studies them as objects of different types. In our approach the environment is considered as a semantic notion and is not explicitly included in the agent. Instead, the meaning of an agent is defined as a transformation of an environment which corresponds to inserting the agent into its environment. When the agent is inserted into the environment, the environment changes and this change is considered to be a property of the agent described. 6.2.4.1 Process Algebras An algebraic theory of concurrency and communication that deals with the occurrence of events rather than with updates of stored values is called a process algebra. The main variants of process algebra are generally known by their acronyms: CCS [8] — Calculus of Concurrent Systems developed by Milner, CSP [13] — Hoare’s Communicating Sequential Processes, and ACP — Algebra of Communicating Processes of Bergstra and Klop [14]. These theories are based on transition systems and bisimulation, and consider interaction of composed agents. They employ nondeterministic choice as well as parallel and sequential compositions as primitive constructs. The influence of the environment on the system may be expressed as an explicit language operation, such as restriction in CCS or hiding in CSP. These theories consider communicating agents as objects of the same type (this type may be parameterized by the alphabets for events or actions) and define operations on these types. The CCS model specifies sets of states of systems (processes) and transitions between these states. The states of a process are terms and the transitions are defined by the operational semantics of the computation, which indicates how and under which conditions a term transforms itself into another term. Processes are represented by the synchronization tree (or process graph). Two processes are identified through bisimulation.


6-16


CCS introduces a special action τ , called the silent action, which represents an internal and invisible transition within a process. Other actions are split into two classes: output actions, which are indicated by an overbar, and input actions, which are not decorated. Synchronization only takes place between a single input and a single output, and the result is always the silent action τ . Thus, a × a˜ = τ , for all actions a. Consequentially, communication serves only as synchronization; its result is not visible. The π -calculus [12] is an enhancement of CCS and models concurrent computation by processes that exchange messages over named channels. A distributed interpretation of the π-calculus provides for synchronous message passing and nondeterministic choice. The π-calculus focuses on the specification of the behavior of mobile concurrent processes, where “mobility” refers to variable communication via named channels, which are the main entities in the π-calculus. Synchronization takes place only between two channel agents when they are available for interchange (a named output channel is indicated by an overbar, while an input channel with the same name is not decorated). The influence of the environment in the π -calculus is expressed as an explicit operation of the language (hiding). As a result of this operation, a channel is declared inaccessible to the environment. CSP explicitly differentiates the set of atomic actions that are allowed in each of the parallel processes. The parallel combinator is indexed by these sets: when (P{A} Q{B} ), P engages only in events from the set A, and Q only in events from the set B. Each event in the intersection of A and B requires a synchronous participation of both processes, whereas other events only require participation of the relevant single process. As a result, a × a˜ = a, for all actions a. The associative and commutative binary operator × describes how the output data supplied by two processes is combined before transmission to their common environment. In CSP, a process is considered to run in an environment which can veto the performance of certain atomic actions. If, at some moment during the execution, no action, in which the process is prepared to engage in, is allowed by the environment, then a deadlock occurs, which is considered to be observable. Since in CSP a process is fully determined by the observations obtainable from all possible finite interactions, a process is represented by its failure set. To define the meaning of a CSP program, we determine the set of states corresponding to normal termination of the program, and the set of states corresponding to its failures. Thus, the CSP semantics is presented in model-theoretic terms: two CSP processes are identified if they have the same failure set (failure equivalence). The main operations of ACP are prefixing and nondeterministic choice. This algebra allows an event to occur with the participation of only a subset of the concurrently active processes perhaps omitting any that are not ready. As a result, the parallel composition of processes is a mixture of synchronization and interleaving, where each of the processes either occurs independently or is combined by × with a corresponding event of another process. The merge operator is defined as Merge(a, b) = (a × b) + (a; b) + (b; a). ACP defines its semantics algebraically; processes are identified through bisimulation. Most differences between CCS, ACP, and CSP can be attributed to differences in the chosen style of presentation of the semantics: the CSP theory provides a model, illustrated with algebraic laws. CCS is a calculus, but the rules and axioms in this calculus are presented as laws, valid in a given model. ACP is a calculus that forms the core of a family of axiomatic systems, each describing some features of concurrency. 6.2.4.2 Temporal Logic Temporal logic is a formal specification language for the description of various properties of systems. A temporal logic is a logic augmented with temporal modalities to allow a specification of the order of events in time, without introducing time explicitly as a concept. Whereas traditional logics can specify properties relating to the initial and final states of terminating systems, a temporal logic is better suited to describe the on-going behavior of nonterminating and interacting (reactive) systems. As an example, Lamport’s TLA (Temporal Logic of Actions) [5,15] is based on Pnueli’s temporal logic [16] with assignment and an enriched signature. It supports syntactic elements taken from programming


System Validation

6-17

languages to ease maintenance of large-sized specifications. TLA uses formulae on behavior, which are considered as a sequence of states. States in TLA are assignments of values to variables. A system satisfies a formula iff that formula is true in all behaviors of this system. Formulae where the arguments are only the old and the new states are called actions. Here, we distinguish between linear and branching temporal logics. In a linear temporal logic, each moment of time has a unique possible future, while in branching temporal logic, each moment of time may have several possible futures. On one hand, linear temporal logic formulae are interpreted over linear sequences of points in time and specify the behavior of a single computation of a system. Formulae of a branching temporal logic, on the other hand, are interpreted over tree-like structures, each describing the behavior of possible computations of a nondeterministic system. Many temporal logics are decidable and corresponding decision procedures exist for linear and branching time logics [17], propositional modal logic [18], and some variants of CTL∗ [19]. These decision procedures proceed by building a canonical model for a set of temporal formulae representing properties of the system to be verified by using techniques from automata theory, semantic tableaux, or binary decision diagrams [20]. Determining whether such properties hold for a system amounts to establishing that the corresponding formulae are true in a model of the system. Model checking based on these decision procedures has been successfully applied to find subtle errors in industrial-size specifications of sequential circuits, communication protocols, and digital controllers [21]. Typically, a system to be verified is modeled as a (finite) state transition graph, and their properties are formulated in an appropriate propositional temporal logic. An efficient search procedure is then used to determine whether the state transition graph satisfies the temporal formulae or not. This technique was first developed in the 1980s by Clarke and Emerson [22] and by Quielle and Sifakis [23] and extended later by Burch et al. [21]. Examples of temporal properties (properties of the interaction between processes in a reactive system) are as diverse as their applications (this classification was introduced in References 2 and 24): • Safety properties state that “something bad never happens” (a program never enters an unacceptable state). • Liveness properties state that “something good will eventually happen” (a program eventually enters a desirable state). • Guarantees specify that an event will eventually happen but does not promise repetitions. • Obligations are disjunctions of safety and guarantee formulae. • Responses specify that an event will happen infinitely many times. • Persistence specifies the eventual stabilization of a system condition after an arbitrary delay. • Reactivity is the maximal class formed from the disjunction of response and persistence properties. • Unconditional Fairness states that a property p holds infinitely often. • Weak Fairness states that if a property p is continuously true then the property q must be true infinitely often. • Strong Fairness states that if a property p is true infinitely often then the property q must be true infinitely often. 6.2.4.3 Timed Automata Timed automata accept timed words — infinite sequences in which a real-valued time of occurrence is associated with each symbol. A timed automaton is a finite automaton with a finite set of real-valued clocks. The clocks can be reset to 0 (independent of each other) with the transitions of the automaton and keep track of the time elapsed since the last reset. Transitions of the automaton put certain constraints on the clock value such that a transition may be taken only if the current values of the clocks satisfy the associated constraints. Timed automata can capture qualitative features of real time systems such as liveness, fairness, and nondeterminism, as well as its quantitative features such as periodicity, bound response, and timing delays.


6-18


Timed automata are a generalization of finite ω-automata (either Büchi automata and Muller automata [25,26]). When Büchi automata are used for modeling finite-state concurrent processes, the verification problem is reduced to that of language inclusion [27,28]. While the inclusion problem for ω-regular languages is decidable [26], for timed automata the inclusion problem is undecidable, which constitutes a serious obstacle in using timed automata as a specification language for validation of finite-state real time systems [29]. 6.2.4.4 Abstract State Machine Gurevich’s ASM project [30,31] attempts to apply formal models of computation to practical specification methods. ASM assumes the “Implicit Turing Thesis” according to which every algorithm can be modeled at its appropriate abstraction level (its algorithm) by a corresponding ASM. ASM descriptions are based on the concept of evolving algebras, which are transition systems on static algebras. Each static algebra represents a state of the modeled system and transition rules are transitions in the modeled system. To simplify the semantics and ease proofs, transitions are limited: they can change only functions, but not sorts, and cannot directly change the universe. A single agent ASM is defined over a vocabulary (a set of functions, predicates, and domains). Its states are defined by assigning an interpretation to the elements of the vocabulary. An ASM program describes the rules of transitioning between states. An ASM program is defined by basic transition rules, such as updates (changes to the interpretation of the vocabulary), conditions (apply only if some specific condition holds), or choice (extracts from a state elements with given properties), or by combinations of transition rules into complex rules. Multiagent ASM consist of a number of agents that execute their ASM program concurrently and interact through globally shared locations of a state. Concurrency between agents is modeled by partially ordered runs. The program steps executed by each agent are linearly ordered; in addition, program steps in different programs are ordered if they represent causality relations. Multiagent ASM rely on a continuous global system time to model time-related aspects. 6.2.4.5 Rewriting Logic Rewriting logic [32] allows to prove assertions about concurrent systems with states changing under transitions. Rewriting logic extends equational logic and constitutes a logical framework in which many logics and semantic formalisms can be represented naturally (i.e., without distorting encoding). Similar to algebras allowing a semantic interpretation to equational logic, models of rewriting logic are concurrent systems. Moreover, models of concurrent computation, object-oriented design languages, architectural description languages, and languages for distributed components also have natural semantics in rewriting logic [33]. In rewriting logic, system states are in a bijective correspondence with formulae (modulo whatever structural axioms are satisfied by such formulae, e.g., modulo associativity or commutativity of connectives) and concurrent computations in a system are in a bijective correspondence with proofs (modulo appropriate notions of equivalence). Given this equivalence between computation and logic, a rewriting logic axiom of the form t → t has two readings. Computationally, it means that a fragment of a system state that is an instance of the pattern t can change to the corresponding instance of t concurrently with any other state changes. The computational meaning is that of a local concurrent transition. Logically, it just means that we can derive the formula t from the formula t , that is, the logical reading is that of an inference rule. Computation consists of rewriting to a normal form, that is, an expression that can no further be rewritten; when the normal form is unique, it is taken as the value of the initial expression. When rewriting equal terms always leads to the same normal form, the set of rules is said to be confluent and rewriting can be used to check for equality.


System Validation

6-19

Rewriting logic is reflective [34,35], and thus, important aspects of its meta-theory can be represented at the object level in a consistent way. The language Maude [36,37] has been developed at SRI to implement a framework for rewriting logic. The language design and implementation of Maude systematically leverage the reflexivity of rewriting logic and make the meta-theory of rewriting logic accessible to the user allowing to create within the logic a formal environment for the logic with tools for formal analysis, transformation, and theorem proving.

6.3 Requirements Capture and Validation In Reference 38, requirements capture is defined as an engineering process for determining what artifacts are to be produced as the result of the development effort. The process involves the following steps: • • • • •

Requirements identification Requirements analysis Requirements representation Requirements communication Development of acceptance criteria and procedures

Requirements can be considered as an agreement between the customer and the developer. As agreements, they must be understandable to the customer as well as to the developer and the level of formalization depends on the common understanding and the previous experience of those involved in the process of requirements identification. The main properties of the system to be developed including its purpose, functionality, conditions of use, efficiency, safety, liveness, or fairness properties, are specified in the requirements along with the main goals of the development project. The requirements also include an explanation of terms referenced, information about other already developed systems which should be reused in the development process, and possible decisions on implementation and structuring of the system. It is well recognized that identifying and correcting problems in requirements and early design phase avoids far more expensive repairs later. Boehm quotes late life-cycle fixes to be a hundred times more expensive than corrections made during the early phases of system development [39]. In Reference 40, Boehm documents that the relative cost of repairing errors increases exponentially with the life-cycle phase at which the error was detected. Kelly et al. [41] documents a significantly higher density of defects found during the requirements phase as compared with later life-cycle phases. Early life-cycle defects are also very prevalent: in Reference 42, it was shown that of 197 critical faults found during integration testing of spacecraft software, only 3 were programming mistakes. The other 194 were introduced at earlier stages. Fifty percent of these faults were owing to flawed requirements (mainly omissions) for individual components, 25% were owing to flawed designs for these components, and the remaining 25% were owing to flawed interfaces between components and incorrect interactions between them. A number of other studies [43,44] reveal that most errors in software systems originate in the requirements and functional specifications. If the errors were detected as soon as possible, their repair would have been least expensive. A requirements specification must describe the external behavior of a system in terms of observable events and actions. The latter describes interaction of system components with their environments including other components and acceptable parts of the external physical world, that is, those aspects of the world which can influence the behavior of the system. We consider requirements to be correct if they are consistent and complete. Checking consistency and completeness of requirements is the final task of requirements representation. The informal understanding of the consistency of requirements is the existence of an implementation which satisfies the requirements. In other words, if requirements are inconsistent, then an implementation free of errors (bugs) that satisfies all the requirements cannot be created. Unfortunately, most ways for describing requirements and preliminary designs (natural language, diagrams, pseudocode) do not offer mechanized means of establishing correctness, so the primary means


6-20


to deduce their properties and consequences is through inspections and reviews. To be amenable to analysis, the requirements must first be formalized, that is, rewritten in some formal language. We call this formalized description a requirements specification. It should be free of implementation details and can be used for the development of an executable model of a system, which is called an executable specification. Note that, any description that is formal enough to be amenable to operational interpretation will also provide some method of implementation, albeit usually a rather inefficient one. The existence of an executable specification which satisfies the formalized requirements is a sufficient condition of consistency. It is sufficient, because a final implementation that is free of errors can be extracted from executable specifications using formal methods such as stepwise refinement. Completeness of requirements is understood as the uniqueness of the executable specification considered up to some equivalence. The intuitive understanding of equivalence is as follows: two executable specifications are equivalent if they demonstrate the same behaviors in the same environments. Completeness can also be expressed in terms of determinism of the executable specification (requirements are sufficient for constructing a unique deterministic model). In some cases, incompleteness of requirements is not harmful because it can be motivated by the necessity to suspend implementation decisions until later stages of development. The correspondence between the original requirements and the requirements specification is not formal. Experience has shown that special skills are required to check correspondence between informal and formal requirements. Incompleteness and inconsistencies discovered at this stage are used for improvement and correction of the requirements which are used to correct the requirements specification.

6.3.1 Approaches to Requirements Validation Standard approach to requirements analysis and validation typically involve manual processes such as “walk-throughs” or Fagan-style inspections [45,46]. The term walk-through refers to a range of activities that can vary from cursory peer reviews to formal inspections, although walk-throughs usually do not involve the replicable processes and methodical data collection that characterize Fagan-style inspections. Fagan’s highly structured inspection process was originally developed for hardware logic and later applied to software design and code and eventually extended to all life-cycle phases, including requirements development and high level design [45]. A Fagan inspection involves a review team with the following roles: a Moderator, an Author, a Reader, and a Tester. The Reader presents the design or code to the other team members, systematically walking through every piece of logic and every branch at least once. The Author represents the viewpoint of the designer or coder, and the test perspective is represented by the Tester. The Moderator is trained to facilitate intensive but constructive discussion. When the functionality of the system is well understood, the focus shifts to a search for faults, possibly using a checklist of likely errors to guide the process. The inspection process includes highly structured rework. One of the main advantages of Fagan-style inspections over other conventional forms of verification and validation is that inspections can be applied early in the life cycle. Thus potential anomalies can be detected before they become entrenched in low level design and implementation. Rushby [47] gives an overview of techniques of mechanized formal methods: decision procedures for specialized, but ubiquitous, theories such as arithmetic, equality, and the propositional calculus are helpful in discovering false theorems (especially if they can be extended to provide counter examples) as well as in proving true ones, and their automation dramatically improves the efficiency of proof. Rewriting is essential to efficient mechanization of formal methods. Unrestricted rewriting provides a decision procedure for theories axiomatized by terminating and confluent sets of rewrite rules, but few such theories arise in practice. Integration of various techniques increases the efficiency of these methods. For example, theorem proving attempts to show that a formula follows from given premises, while model checking attempts to show that a given system description is a model for the formula. An advantage of model checking is that, for certain finite-state systems and temporal logic formulae, it can be automated and is more efficient than


System Validation

6-21

theorem proving. Benefits of an integrated system (providing both theorem proving and model checking) are that model checking can be used to discharge some cases in a larger proof and theorem proving can be used to justify reduction to a finite state that is required for automated model checking. Integration of these techniques can provide a further benefit: before undertaking a potentially difficult and costly proof, we may be able to use model checking to examine special cases. Any errors that can be discovered and eliminated in this way will save time and effort during theorem proving. Model checking determines whether a given formula stating a property of the specification is satisfied in a Kripke-model (in the specification represented as a Kripke-model). In the worst case, these algorithms must traverse the whole of the model, that is, visit all states of the corresponding transition system, and consequentially, model checking can be applied mainly for finite-state systems even in the presence of sophisticated means of representing the set of states, such as binary decision diagram (BDD) methods [20], while the proof of theorems about infinite state systems can be done only by means of deductive methods.

6.3.2 Tools for Requirements Validation Formal methods may be classified according to their primary purpose as descriptive or analytic. Descriptive methods focus largely on specifications as a medium for review and discussion, whereas analytic methods focus on the utility of specification as a mathematical model for analyzing and predicting the behavior of systems. Not surprisingly, the different emphasis is reflected in the type of a formal language favored by either methods. Descriptive formal methods emphasize the expressive power of the underlying language and provide a rich type system, often leveraging the notations of conventional mathematics or set theory. These choices in language elements do not readily support automation; instead, descriptive methods typically offer attractive user interfaces and little in the way of deductive machinery. These methods assume that the specification process itself serves as verification, as expressing the requirements in mathematical form leads to detect inconsistencies that are typically overlooked in natural language descriptions. Examples of primarily descriptive formal methods are VDM [48], Z [49], B [50], or LOTOS [51]. Analytic formal methods place emphasis on mechanization and favor specification languages that are less expressive but capable of supporting efficient automated deduction. These methods vary in the degree of automation provided by the theorem prover, or, conversely, by the amount of user interaction in the proof process. They range from automatic theorem proving without user interaction to proof checking without automatic proof steps. The former typically have restricted specification languages and powerful provers that can be difficult to control and offer little feedback on failed proofs, but perform impressively in the hands of experts, for example, Nqthm [52]. Proof checkers generally offer more expressive languages, but require significant manual input for theorem proving, for example, high-order logic (HOL) [53]. Many tools fall somewhere in between, depending on language characteristics and proof methodology, for example, Eves [54,55], or PVS [56]. The goal of mechanized analysis may be either to prove the equivalence between different representations of initial requirements or to establish properties that are considered critical for correct system behavior (safety, liveness, etc.). Tools for analytic formal methods fall into two main categories: state exploration tools (model checkers) and deductive verifiers (automated theorem provers). Model checking [57] is an approach for formally verifying finite-state systems. Formalized requirements are expressed as temporal logic formulas, and efficient symbolic algorithms are used to process a model of the system and check if the specification holds in that model. Widely known tools are VeriSoft [58], SMV [59], or SPIN [60]. Some deductive verifiers either support inference in first order (Larch [61]), others, such as PVS [62], are based on higher-order languages integrated with supporting tools and interactive theorem provers. Today, descriptive methods are often augmented by facilities of mechanized analysis. Also, automated theorem proving and model checking approaches may be integrated in a single environment. For example, a BDD-based model checker can be used as a decision procedure in the PVS [63] theorem prover. In addition to the prover or proof checker, a key feature of analytic tools is the type checker which checks


6-22


specifications for semantic consistency, possibly adding semantic information to the internal representation built by the parser. If the type system of the specification language is not decidable, theorem proving may be required to establish the type consistency of a specification. An overview of verification tools and underlying methods can be found in Reference 64. The most severe limitation to the deployment of formal methods is the need for mathematical sophisticated users, for example, with respect to the logical notation these methods use, which is often an insurmountable obstacle to the adoption of these methods in engineering disciplines. General-purpose decision procedures, as implemented in most of these methods, are not scalable to the size of industrial projects. Another obstacle to the application of deductive tools like PVS is the necessity to develop a mathematical theory formalized on a very detailed level to implement even very simple predicates. Recently, formal methods were successfully applied to specification and design languages widely accepted in the engineering community, such as MSC [65], SDL [51], or UML [66]. Several tool vendors participate in the OMEGA project aimed at the development of formal tools for the analysis and verification of design steps based on UML specifications [67]. SDL specifications can be checked by model checkers as well as automated theorem provers: for example, the IF system from Verimag converts SDL to PROMELA as input to the SPIN model checker [68]. At Siemens, verification of the GSM protocol stack was conducted using the BDD-based model checker SVE [69]. An integrated framework for processing SDL specification has been implemented based on the automated theorem prover ACL2 [70]. Ptk [71] provides semantic analysis of MSC diagrams and generates test scripts from such diagrams in a number of different languages including SDL, TTCN, and C. FatCat [72] locates situations of nondeterminacy in a set of MSC diagrams. In the following, we give a cursory overview of some of the wider known tools supporting the application of formal methods to specifications. This survey reviews only tools that are freely available, at least for research use. A number of powerful commercial verification technologies have been developed, for example: ACL2 (Computational Logic, USA), ASCE (Adelard, UK), Atelier B (STERIA Méditerranée, France), B-Toolkit (B-Core, UK), CADENCE (Cadence, USA), Escher (Escher Technologies, UK), FDR (Formal Systems, UK), ProofPower (ICL, UK), Prover (Prover Technology, Sweden), TAU (Telelogic), Valiosys (Valiosys, France), and Zola (Imperial Software, UK). A more detailed survey of commercial tools aimed at formal verification is given in Reference 73. The tools and environments surveyed provide only a sample of the wide variety of tools available. In particular, in the area of model checking a large number of implementations support the verification of specifications written in different notations and supporting different temporal logics. For example, Kronos [74] allows modeling of real time systems by timed automata, that is, automata extended with a finite set of real-valued clocks, used to express timing constraints. It supports TCTL, an extension of temporal logic that allows quantitative temporal claims. UPAAL [75] represents systems as networks of automata extended with clocks and data variables. These networks are compiled from a nondeterministic guarded command language with data types. The VeriSoft [58] model checker explores the state space of systems composed of concurrent processes executing arbitrary code (written in any language) and searches for concurrency pathologies such as deadlock, lifelock, divergence, and for violation of user-specified assertions. 6.3.2.1 Descriptive Tools 6.3.2.1.1 Vienna Development Method [76] Vienna development method (VDM) is a model-oriented formal specification and design method based on discrete mathematics, originally developed at IBM’s Vienna Laboratory. It is a model-oriented formal specification and design method based on discrete mathematics. Tools to support formalization using VDM include parsers, typecheckers, proof support, animation, and test case generators. In VDM, a system is developed by first specifying it formally and proving that the specification is consistent, then iteratively refining and decomposing the specification provided that each refinement satisfies the previous specification. This process continues until the implementation level is reached. The VDM Specification Language (VDM-SL) was standardized by ISO in 1996 and is based on firstorder logic with abstract data types. Specifications are written as constructive specifications of an abstract


System Validation

6-23

data type, by defining a class of objects and a set of operations that act upon these objects. The model of a system or subsystem is then based on such an abstract data type. A number of primitive data types are provided in the language along with facilities for user-defined types. VDM has been used extensively in Europe [48,77,78]. Tools have been developed to support formalization using VDM, for example, Mural [79] which aids formal reasoning via a proof assistant. 6.3.2.1.2 Z [49,80] Z evolved from a loose notation for formal specifications to a standardized language with tool support provided by a variety of third parties. The formal specification notation has been developed by the Programming Research Group at Oxford University. It is based on Zermelo-Fraenkel set theory and first-order predicate logic. Z is supported by graphical representations, parsers, typecheckers, prettyprinters, and a proof assistant implemented in HOL providing proof checkers as well as a full-fledged theorem prover. The standardization of Z through ISO solidified the tool base and enhanced interest in mechanized support. The basic Z form is called a schema, which is used to introduce functions. Models are constructed by specifying a series of schemata using a state transition style. Several object-oriented extensions to Z have been proposed. Z has been used extensively in Europe (primarily in the United Kingdom) to write formal specifications for various industrial software development efforts and has resulted in two awards for technological achievement: for the IBM CICS project and for a specification of the IEEE standard for floating-point arithmetic. To leverage Z in embedded systems design, Reference 81 extended Z by temporal interval logic and automated reasoning support through Isabelle and the SMV model checker. 6.3.2.1.3 B [50] Following the B method, initial requirements are represented as a set of abstract machines, for which an object-based approach is employed at all stages of development. B relies on a wide-spectrum notation to represent all levels of description, from specification through design to implementation. After specifying requirements, they can be checked for consistency which for B means preservation of invariants. In addition, B supports checking correctness of the refinement steps to design and implementation. B is supported through toolkits providing syntax checkers, type checkers, a specification animator, proof-obligation generator, provers allowing different degrees of mechanization, and a rich set of coding tools. It also includes convenient facilities for documenting, cross-referencing, and reviewing specifications. The B method is popular in industry as well as in the academic community. Several international conferences on B have been conducted. An example of the use of B in embedded systems design is reported in Reference 82 which promotes the development of correct software for smart cards through translation of B specifications into embedded C code. 6.3.2.1.4 Rigorous Approach to Industrial Software Engineering [83,84] Rigorous Approach to Industrial Software Engineering (RAISE) is based on a development methodology that evolved from the VDM approach. Under the RAISE methodology, development steps are carefully organized and formally annotated using the RAISE specification language which is a powerful wide-spectrum language for specifying operations and processes allowing derivations between levels of specifications. It provides different styles of specification: model-oriented, algebraic, functional, imperative, and concurrent. The CORE requirements method is also provided as an approach for front-end analysis. Supporting tools provide a window-based editor, parser, typechecker, proof tools, a database, and translators to C and Ada. Derivations from one level to the next generate proof obligations. These obligations may be discharged using proof tools which are also used to perform validation (establishing system properties). Detailed descriptions of the development steps and overall process are available for each tool. The final implementation step has been partially mechanized for common implementation languages.


6-24


6.3.2.1.5 Common Algebraic Specification Language [85] Common Algebraic Specification Language (CASL) was developed as a language for formal specification of functional requirements and modular software design that subsumes many algebraic specification frameworks and also provides tool interoperability. CASL is a complex language with a complete formal semantics comprising a family of formally defined specification languages meant to constitute a common framework for algebraic specification and development [86]. To make tool construction manageable, it allows for reuse of existing tools, for interoperability of tools developed at different sites, and for construction of generic tools that can be used for several languages. The CASL Tool Set, CATS, combines a parser, a static checker, a pretty printer, and facilities for translation of CASL to a number of different theorem provers. Encoding eliminates subsorting and partiality, and thus allows reuse of existing theorem proving tools and term rewriting engines for CASL. Typical applications of a theorem prover in the context of CASL are checking semantic correctness (according to the model semantics) by discarding proof obligations that have been generated during checking of static semantic constraints and validating intended consequences, which can be added to a specification using annotations. This allows a check for consistency with informal requirements. In the scope of embedded systems verification, the Universal Formal Methods (UniForM) Workbench has been deployed in the development of railway control and space systems [87]. This system aims to provide a basis for interoperability of tools and the combination of languages, logics, and methodologies. It supports verification of basic CASL specification encoded in Isabelle and the subsequent implementation of transformation rules for CASL to support correct development by transformation. 6.3.2.1.6 Software Cost Reduction [88] Software cost reduction (SCR) is a formal method for modeling and validating system requirements. SCR models a system as a black box computing output data from the input data. System behavior is represented as a finite-state machine. SCR is based on tabular notations that are relatively easy to understand. To develop correct requirements with SCR, a user shall perform four types of activities supported by SCR tools. First, a specification is developed using the SCR tabular notation using the specification editor. Second, the specification is automatically analyzed for violations of application-independent proprieties, such as nondeterminism and missing cases, using an extension of the semantic tableaux algorithm. To validate the specification, the user may run scenarios, sequences of observable events, through the SCR simulator and for checking application-dependent properties the user can apply the Spin model checker by translating the specification into Promela. A toolset has been developed by the Naval Research Laboratory, including a specification editor, a simulator for symbolically executing the specification, and formal analysis tools for testing the specification for selected properties. SCR has been applied primarily to the development of embedded control systems including the A-7E aircrafts operational flight program, a submarine communications system, and safety-critical components of two nuclear power plants [89]. 6.3.2.1.7 EVES [55] EVES is an integrated environment supporting formal development of systems from requirements to code. Additionally, it may be used for formal modeling and mathematical analysis. To date, EVES applications have primarily been in the realm of security-critical systems. EVES relies on the wide-spectrum language Verdi, ranging from a variant of classical set theory with a library mechanism for information hiding and abstraction to an imperative programming language. The EVES mathematics is based on ZFC set theory without the conventional distinction between terms and formulae. Supporting tools are a well-formedness checker, the integrated automated deduction system NEVER, a proof checker, reusable library framework, interpreter, and compiler. Development is treated as theory extension: each declaration extends the current theory with a set of symbols and axioms pertaining to those symbols. Proof obligations are associated with every declaration


System Validation

6-25

to guarantee conservative extension. The EVES library is a repository of reusable concepts (e.g., a variant of the Z mathematical toolkit is included with EVES) and is the main support for scaling, information hiding, and abstraction. Library units are either specification units (axiomatic descriptions), model units (models or implementations of specifications), or freeze units (for saving work in progress). 6.3.2.2 Deductive Verifiers 6.3.2.2.1 High-Order Logic [53] High-order logic is an environment for interactive theorem proving in HOL, that is, predicate calculus with terms from the typed lambda calculus. HOL provides a parser, pretty-printer, typechecker, as well as forward and goal oriented theorem provers. It interfaces HOL to the MetaLanguage (ML) which allows representation of terms and theorems of the logic, of proof strategies, and of logical theories. The HOL system is an interactive mechanized proof assistant. It supports both forward and backward proofs. The forward proof style applies inference rules to existing theorems in order to obtain new theorems and eventually the desired goal. Backward or goal oriented proofs start with the goal to be proven. Tactics are applied to the goal and subgoals until the goal is decomposed into simpler existing theorems. HOL provides a general and expressive vehicle for reasoning about various classes of systems. Some of the applications of HOL include the specification and verification of compilers, microprocessors, interface units, algorithms, and formalization of process algebras, program refinement tools, and distributed algorithms. Initially, HOL was aimed at hardware specification and verifying but later its application was extended to many other domains. Since 1988 an annual meeting of the HOL community evolved into a large international conference. 6.3.2.2.2 Isabelle [90] Isabelle, developed at Cambridge University, is a generic theorem prover providing a high degree of automation and supporting a wide variety of built-in logics: many-sorted first-order logic, constructive and classical versions, higher-order logic, Zermelo–Fraenkel set theory, an extensional version of Martin-Löf ’s Type Theory, two versions of the Logic for Computable Functions, the classical first-order sequent calculus, and modal logic. New logics are introduced by specifying their syntax and inference rules. Proof procedures can be expressed using tactics. A generic simplifier performs rewriting by equality relations and handles conditional and permutative rewrite rules, performs automatic case splits, and extracts rewrite rules from context. A generic package supports classical reasoning in a first-order logic, set theory, etc. The proof process is automated to allow long chains of proof steps, reasoning with and about equations, and proofs about facts of linear arithmetic. Isabelle aims at the formalization of mathematical proofs. Some large mathematical theories have been formally verified and are available to a user. These include elementary number theory, analysis, and set theory. For example, Isabelle’s Zermelo–Fraenkel set theory derives general theories of recursive functions and data structures (including mutually recursive trees and forests, infinite lists, and infinitely branching trees). Isabelle has been applied to formal verification as well as reasoning about the correctness of computer hardware, software, and computer protocols. Reference 91 has applied Isabelle to prove correctness of safety-critical embedded software: an HOL implemented in Isabelle has been used to model both specification and implementation of initial requirements; the problem of implementation correctness is reduced to a mathematical theorem to be proven. 6.3.2.2.3 PVS [56,62] PVS provides an integrated environment for the development and analysis of formal specifications and is intended primarily for the formalization of requirements and design-level specifications, and for the rigorous analysis of difficult problems. It has been designed to benefit from synergetic usage of different formalisms in its unified architecture.


6-26


The PVS specification language is based on classical, typed HOL with predicate subtypes, dependent typing, and abstract data types. The highly expressive language is tightly integrated with its proof system and allows automated reasoning about type dependencies. PVS offers a rich type system, strict typechecking, and powerful automated deduction with integrated decision procedures for linear arithmetic and other useful domains, and a comprehensive support environment. PVS specifications are organized into parameterized theories that may contain assumptions, definitions, axioms, and theorems. Definitions are guaranteed to provide conservative extension. Libraries of proved specifications from a variety of domains are available. The PVS prover supports a fully automated mode as well as an interactive mode. In the latter, the user chooses among various inference primitives (induction, quantifier reasoning, conditional rewriting, simplification using specific decision procedures, etc.). Automated proofs are based on user-defined strategies composed from inference primitives. Proofs yield scripts that can be edited and reused. Modelchecking capabilities are integrated with the verification system and can be applied for automated checking of temporal properties. PVS has been applied to algorithms and architecture for fault-tolerant flight control systems, to problems in real-time system design, and to hardware verification. Reference 92 combined PVS with industrial, UML-based development. Similarly, the Boderc project at the Embedded Systems Institute aims to integrate UML-based software design for embedded systems into a common framework that is suitable for multidisciplinary system engineering. 6.3.2.2.4 Larch [61] Larch is a first-order specification language supporting equational theories embedded in a first-order logic. The Larch Prover (LP) is designed to treat equations as rewrite rules and carry out other inference steps such as induction and proof by cases. The user may introduce operators and assertions about operators as part of the formalization process. The system is comprised of parser, type checker, and a user-directed prover. Larch Prover is designed to work midway between proof checking and fully automatic theorem proving. Users may direct the proof process at a fairly high level. LP attempts to carry out routine steps in a proof automatically and provide useful information about why proofs fail, but is not designed to find difficult proofs automatically. 6.3.2.2.5 Nqthm [52] Nqthm is a toolset based on the powerful heuristic Boyer–Moore theorem prover for a restricted logic (a variant of pure applicative Lisp). There is no explicit specification language; rather, one writes specifications directly in the Lisp-like language that encodes the quantifier-free, untyped logic. Recursion is the main technique for defining functions and, consequentially, mathematical induction is the main technique for proving theorems. The system consists of parser, pretty-printer, limited typechecker (the language is largely untyped), theorem prover, and animator. The highly automated prover can be driven by large databases of previously supplied (and proven) lemmas. The tool distribution comes with many examples of formalized and proved applications. For over a decade, the Nqthm series of provers has been used to formalize a wide variety of computing problems including safety-critical algorithms, operating systems, compilers, security devices, microprocessors, and pure mathematics. Two well-known industrial applications are a model of a Motorola digital signal processing (DSP) chip and the proof of correctness of the floating point division algorithm for the AMD5K 86 microprocessor. 6.3.2.2.6 Nuprl [93] Nuprl was originally designed by Bates and Constable at Cornell University and has been expanded and improved over the past 15 yr by a large group of students and research associates. Nuprl is a highly extensible open system that provides for interactive creation of proofs, formulae, and terms in a typed language which is constructive type theory with extensible syntax. The Nuprl system supports HOLs and rich type theories. The logic and the proof systems are built on a highly regular untyped term structure,


System Validation

6-27

a generalization of the lambda calculus and mechanisms given for reduction of such terms. The style of the Nuprl logic is based on the stepwise refinement paradigm for problem solving in which the system encourages the user to work backwards from goals to subgoals until one reaches what is known. Nuprl provides a window-based interactive environment for editing, proof generation, and function evaluation. The system incorporates a sophisticated display mechanism that allows users to customize the display of terms. Based on structure editing, the system is free to display terms without regard to parsing of syntax. The system also includes the functional programming language ML as its metalanguage; users extend the proof system by writing their own proof generating programs (tactics) in ML. Since tactics invoke the primitive Nuprl inference rules, user extensions via tactics cannot corrupt system soundness. The system includes a library mechanism and is provided with a set of libraries supporting the basic types including integers, lists, and Booleans. The system also provides an extensive collection of tactics. The Nuprl system has been used as a research tool to solve open problems in constructive mathematics. It has been used in formal hardware verification, as a research tool in software engineering, and to teach mathematical logic to Cornell undergraduates. It is now being used to support parts of computer algebra and is linked to the Weyl computer algebra system. 6.3.2.3 State Exploration Tools 6.3.2.3.1 Symbolic Model Verifier (SMV) [59] The SMV system is a tool for checking finite-state systems against specifications of properties. Its highlevel description language supports modular hierarchical descriptions and the definition of reusable components. Properties are described in Computation Tree Logic (CTL), a propositional, branching-time temporal logic. It covers a rich class of properties including safety, liveness, fairness, and deadlock freedom. The SMV input language offers a set of basic data types consisting of bounded integer subranges and symbolic enumerated types, which can be used to construct static, structured types. SMV can handle both synchronous and asynchronous systems, and arbitrary safety and liveness properties. SMV uses a BDD-based symbolic algorithm of model checking to avoid explicitly enumerating the states of the model. With carefully tuned variable ordering, the BDD algorithm yields a system capable of verifying circuits with extremely large numbers of states. The SMV system has been distributed widely and has been used to verify industrial-scale circuits and protocols, including the cache coherence protocol described in the IEEE Futurebus+ standard and the cache consistency protocol developed at Encore Computer Corporation for their Gigamax distributed multiprocessor. Formal verification of embedded systems using symbolic model checking with SMV has been demonstrated in Reference 94: a Petri net-based system model is translated into the SMV input language along with the specification of timing properties. 6.3.2.3.2 Spin [60,95,96] Spin is a widely distributed software package that supports the formal verification of distributed systems It was developed by the formal methods and verification group at Bell Laboratories. Spin relies on the high-level specification language PROMELA (Process MetaLanguage), a nondeterministic language based on Dijkstra’s guarded command language notation and CSP. PROMELA contains primitives for specifying asynchronous (buffered) message passing via channels with an arbitrary number of message parameters. It also allows for the specification of synchronous message passing systems (rendezvous) and mixed systems, using both synchronous and asynchronous communications. The language can model dynamically expanding and shrinking systems, as new processes and message channels can be created and deleted on the fly. Message channel identifiers can be passed in messages from one process to another. Correctness properties can be specified as standard system or process invariants (using assertions), or as general Linear Temporal Logic (LTL) requirements, either directly in the syntax of next time free LTL, or indirectly as Büchi Automata (expressed in PROMELA syntax as never claims). Spin can be used in three modes: for rapid prototyping with random, guided, or interactive simulation; as an exhaustive verifier,


6-28


capable of rigorously proving the validity of user-specified correctness requirements (using partial order reduction to optimize search); and as proof approximation system that can validate very large protocol systems with maximal coverage of the state space. Spin has been applied to the verification of data transfer and bus protocols, controllers for reactive systems, distributed process scheduling algorithms, fault-tolerant systems, multiprocessor designs, local area network controllers, microkernel design, and many other applications. The tool checks the logical consistency of a specification. It reports on deadlocks, unspecified receptions, and flags incompleteness, race conditions, and unwarranted assumptions about the relative speeds of processes. 6.3.2.3.3 COordination SPecification ANalysis (COSPAN) [97] The COSPAN is a general purpose, rapid prototyping tool developed at AT&T that provides a theoretically seamless interface between an abstract model and its target implementation, thereby supporting top-down system development and analysis. It includes facilities for documentation, conformance testing, software maintenance, debugging, and statistical analysis, as well as libraries of abstract data types and reusable pretested components. The COSPAN input language, S/R (selection/resolution), belongs to the omega-regular languages which are expressible as finite-state automata on infinite strings or behavioral sequences. COSPAN is based on homomorphic reduction and refinement of omega-automata, that is, the use of homomorphisms to relate two automata in a process based on successive refinement that guarantees that properties verified at one level of abstraction hold in all successive levels. Reduction of the state space is achieved by exploiting symmetries and modularity inherent in large, coordinating systems. Verification is framed as a language-containment problem: checking consists of determining whether the language of the system automaton is contained in the language of the specification automaton. Omega-automata are particularly well-suited to expressing liveness properties, that is, events that must occur at some finite, but unbounded time. The COSPAN has been used in the commercial development of both software and hardware systems: high-level models of several communications protocols, for example, the X.25 packet switching link layer protocol, the ITU file transfer and management protocol (FTAM), and AT&T’s Datakit universal receiver protocol (URP) level C; verification of a custom VLSI chip to implement a packet layer protocol controller; and analysis and implementation of AT&T’s Trunk Operations Provisioning Administration System (TOPAS). 6.3.2.3.4 MEIJE [98] The MEIJE project at INRIA and the Ecole des Mines de Paris has long investigated concurrency theory and implemented a wide range of tools to specify and verify both synchronous and asynchronous reactive systems. It uses Esterel, a language designed to specify and program synchronous reactive systems, and a graphical notation to describe labeled transition systems. The tools (graphical editors, model checkers, observer generation) operate on the internal structure of automata combined by synchronized product which are generated either from the Esterel programs or from the graphical representations of these automata. MEIJE supports both explicit representation of the automata supporting model checking and compositional reduction of systems using bisimulation or hiding, as well as implicit representation of the automata favoring verification through observers and forward search for properties to verify. To deal with the large state spaces induced by realistic-sized specifications, the MEIJE tools provide various abstraction techniques, such as behavioral abstraction which replaces a sequence of actions by a single abstract behavior, state compression and encoding in BDD, and on-the-fly model checking. Observers can be either directly written in Esterel or can be generated automatically from temporal logic formulae. 6.3.2.3.5 CADP [99] The CADP, developed at INRIA and Verimag, is a tool box for designing and verifying concurrent protocols specified in the ISO language LOTOS and to study formal verification techniques. Systems are specified as networks of communicating automata synchronizing through rendevous or labeled transition systems.


System Validation

6-29

The CAESAR compiler translates the behavioral part of LOTOS specifications into a C program that can be simulated and tested or into a labeled transition system to be verified by ALDEBARAN. The latter allows the comparison and reduction of labeled transition systems by equivalences of various strength (such as bisimulation, weak bisimulation, branching bisimulation, observational equivalence, tau bisimulation, or delay bisimulation). Diagnostic abilities provide the user with explanations when the tools failed to establish equivalence between two labeled transition systems. The OPEN environment provides a framework for developing verification algorithms in a modular way, and various tools are included: interactive simulators, deadlock detection, reachability analysis, path searching, on-the-fly model checker to search for safety, liveness, and fairness properties, and a tool for generating test suites. 6.3.2.3.6 Murphi [100] Murphi is a complete finite-state verification system that has been tested on extensive industrialscale, examples including cache coherence protocols and memory models for commercially designed multiprocessors. The Murphi verification system consists of the Murphi compiler, and the Murphi description language for finite-state asynchronous concurrent systems which is loosely based on Chandy and Misra’s Unity model and includes user-defined data types, procedures, and parameterized descriptions. A version for synchronous concurrent systems is under development. A Murphi description consists of constant and type declarations, variable declarations, rule definitions, start states, and a collection of invariants. The Murphi compiler takes a Murphi description and generates a C++ program that is compiled into a specialpurpose verifier that checks for invariant violations, error statements, assertion violations, deadlock, and liveness. The verifier attempts to enumerate all possible states of the system, while the simulator explores a single path through the state space. Efficient encodings, including symmetry-based techniques, and effective hash-table strategies are used to alleviate state explosion.

6.4 Specifying and Verifying Embedded Systems The problems of consistency and completeness of requirements viewed as mathematical problems are well known to be algorithmically unsolvable even for notations solely based on the first-order predicate calculus. To overcome these difficulties, we have studied the general form of the requirements used in specific subject domains and developed methods of proving sufficient conditions of consistency and completeness. Each requirements specification defines a class of systems compatible with the requirements. All systems in this class are defined at least up to bisimulation, that is, systems with the same observable behavior are considered as equal. However, systems often operate in the context of some environment; when requirements describe the properties of the environment into which a system is inserted, the weaker notion of insertion equivalence is used to distinguish different systems. If the class of systems compatible with the requirements is not empty, we consider the requirements to be consistent. If the class contains only one system (up to bisimulation or insertion equivalence, respectively), the requirements are said to be complete. To represent requirements, we distinguish between a class of logical requirement languages representing behavior in logical form, a class of trace languages, representing behavior in the form of traces, and a class of automata networks languages, representing behavior in terms of states and transitions. The latter are model-oriented [101], in that desired properties or behaviors are specified by giving a mathematical model that exhibits those properties. The disadvantage of model-oriented specifications is that they state what should be done or how something should be implemented, rather than the properties that are required. For property-oriented [101] languages, each description defines not a single model, but a class of models which satisfy the properties defined by the description. The properties expressed are properties of attributed transition systems representing environments and agents inserted into these environments. Only actions and the values of attributes are observable for inserted agents.


6-30


6.4.1 System Descriptions and Initial Requirements The descriptions of observable entities (attributes) of a system and the external environment, their types, and the parameters they depend on, are captured in first-order predicate logic extended by types and certain predefined predicates such as equality or arithmetic inequality. The signature of the language contains a set of attribute names, where each attribute has a type which is defined using direct or axiomatic definitions. Operations defined on the sets of values of different types are used to construct value expressions (terms) from attribute symbols, variables, and constants. If an expression contains attributes, the value of this expression depends not only on the values of its variables, if any, but also on the current state of the environment. Consequentially, it defines a function on the set of states of an environment. The language includes the temporal modalities always and sometimes. Logical statements define properties of the environment, characterize agent actions, or define abstract data types. Initial requirements describe the initial state as a logical statement. If initial requirements are present, requirements refer only to the states reachable from initial states, which are those states satisfying the initial requirements. To describe initial requirements, we use a temporal modality initially. Axioms about the system are introduced by the form let : <statement>;

6.4.2 Static Requirements Static requirements define the change of attributes at any given moment or interval of time depending on the occurrence of events and the previous history of system behavior. Static requirements describe the local behavior of a system, that is, all possible transitions which may be taken from the current state if it satisfies the precondition, after the event forcing these transitions has happened. The general form of static requirements is req :[<prefix>]([<precondition>] -> [after <event description>] <postcondition>); The precondition is a predicate formula of first-order logic, true before the transition; the postcondition is a predicate formula of first-order logic, true after the transition. Both precondition and postcondition may refer to a set of attributes used in the system. Variables are typed and may occur in precondition, event, and postcondition; they link the values of attributes before and after the transition. Only universal quantifiers are allowed in the quantifier prefix. All attributes occur free in the requirements, but if an attribute depends on parameters, the algebraic expressions substituted for the parameter may contain bound variables. If the current state of the environment satisfies the precondition, and allows the event, then after performing this action the new state of the system will satisfy the property expressed by the postcondition. This notation corresponds to Hoare-style triples. Predicates are defined on sets of values, and predicate expressions can be constructed using predicate symbols and value expressions to express properties of states of environments (if they include attributes). The quantifiers forall and exists can be used, as well as the usual propositional connectives. The precondition is a logical statement without explicit temporal modalities and describes observable properties of a system state. It may include predefined temporal functionals that depend on the past behavior of a system or its attributes, for example, the duration functional dur: if P is a statement, then dur P denotes the time passed from the last moment when P became true; its value is a number (real for continuous time and an integer for discrete time). The event denotes the cause of a transition or a change of attribute values. The simplest case of an event is an action. More complex examples are sequences of actions or finite behaviors (i.e., the event is an algebraic expression generated by actions, prefixing, nondeterministic choice, and sequential and parallel compositions). To describe histories we use product P ∗ Q and iteration It(P) over logical statements.


System Validation

6-31

The postcondition is a logical statement denoting the property of attribute values after the transition defined by the static requirement has been completed. It cannot include temporal modalities as well as functionals depending on the past. As an example, consider a system which counts the number of people entering a room. The requirement for an action enter could be written as: req Enter: Forall(n : int)((num = n)− > after(enter) (num = n + 1)); where num is a system attribute representing the number of people in the room. The variable n links the value of the attribute num before and after the transition caused by the enter action.

6.4.3 Dynamic Requirements Dynamic requirements are arbitrary logical statements including temporal modalities and functionals. They should be the consequences of static requirements and, therefore, dynamic requirements are formulated as calls to the prover to establish a logic statement: prove<statement>; We use the temporal modalities always and sometimes, as well as the temporal functional dur and Kleene operations (product and iteration) over logical statements. Temporal statements refer to properties of attributed transition systems with initial states. A logical statement without modalities describes a property of a state of a system while temporal statements describe properties of all states reachable from admissible initial states: if P is a predicate formula, the formula always P means that any state reachable from an admissible initial state possesses property P. The formula sometimes P means that there exists a reachable state that possesses property P. The notion of reachability can be expressed in set-theoretical terms. Temporal modalities can be translated to first-order logic by introducing quantifiers on states and consequentially, it is possible to use first-order provers for proving properties of environments. For synchronous systems (systems with explicit time assignments at each state) we introduce the functional (E time t ) for arbitrary expression E (terms or statements) and integer expression t which denotes the value of Eat time t . Then always and sometimes are defined as: always P ⇔ (∀t ≥ 0)(P time t ) sometimes P ⇔ (∃t ≥ 0)(P time t ) The functional dur is defined in the following way [102]: (dur P = s) time t ⇔ (t ≥ s)∧ ∼ (P) time (t − s) ∧ (∀s )((t ≥ s > t − s) → P time s ) Statements, which do not explicitly mention state or time are considered as referring to an arbitrary current state or an arbitrary current moment of time. Statements with Kleene operations refer to discrete time and are reduced to logical statements as follows: (P1∗ P2∗ . . .∗ Pn ) time t ⇔ (P1 time (t − n + 1)) ∧ (P2 time (t − n + 2)) ∧ · · · ∧ (Pn time t ) It(P) time t ⇔ (∃s ≤ t )(∀s )((t − s ≤ s ≤ t ) → P time s )

6.4.4 Example: Railroad Crossing Problem The railroad crossing problem is a well-known benchmark to assess the expressiveness of development techniques for interactive systems. We illustrate the description of a synchronous system (in discrete time)


6-32


relying on duration functionals. The problem statement is to develop a control device for a railroad crossing so that safety and liveness conditions are satisfied. This system has three components, as shown in Figure 6.4). The n-track railroad has the following observable attributes: InCr is a Boolean variable equal to 1 if a train is at the crossing; Cmg(i) is a Boolean variable equal to 1 if a train is coming on track number i. At the moment this attribute becomes equal to 1, the time left until the train will reach the crossing is not less than d_min and it remains 1 until the train reaches the crossing. Cmg(i) is an input signal to the controller which has a single output signal DirOp. When DirOp equals to 1, the gate starts opening, and when it becomes 0, the gate starts closing. The attribute gate shows the position of the gate. It is equal to open when the gate is completely opened and closed if it is completely closed. The time taken for the gate to open is d_open, the time taken to close is d_close. The requirements text below omits the straightforward static requirements. The dynamic properties of the system are safety and liveness. Safety means that when the train is at the crossing, the gate is closed. Liveness means that the gate will open when the train is at a safe distance (Code 6.1).

InCr

Gate Cmg

n-track railroad

DirOp Controller

Gate

n

FIGURE 6.4

Railroad crossing problem.

Code 1 parameters( d_min, d_close, d_open, WT ); attributes(n:int)( InCr:bool, Cmg(n):bool, DirOp:bool, gate ); let C1:(d_min>d_close); let C2:(d_close>0); let Duration Theorem: Forall(x,d)( always(dur Cmg(x) > d -> ˜(DirOp))-> always(dur ˜(DirOp) > dur Cmg(x)+(-1)*(d+1)) ); /* ------------- Environment spec ------------------------ */ let CrCm: always(InCr->Exist x (dur Cmg(x) > d_min)); let OpnOpnd:always( dur DirOp >d_open ->(gate=opened)); let ClsClsd:always( dur ˜(DirOp)>d_close->(gate=closed)); /* ------------ Controller spec ------------------------ */ let Contr1: always(Exist x (dur Cmg(x) > WT ) -> ˜(DirOp)); let Contr2: always(Forall x (WT >= dur Cmg(x)) -> DirOp ); /* ------------- Safety and Liveness --------------------- */ let(WT=d_min+(-1)*d_close); prove Safety: always(InCr->(gate=closed)); prove Liveness: always( Forall x ((WT > dur Cmg x) -> (gate=opened)));


System Validation

6-33

Note the assumption of the Duration Theorem in the requirements to shorten the proofs of safety and liveness.

6.4.5 Requirement Specifications The example in Section 6.4.4 is rather simple in the number of requirements. Requirement specifications used in practice to describe embedded systems are typically much more complex. Requirement specifications may consist of hundreds or thousands of static requirements, and a large domain descriptions through attributes and parameters. Each requirement is usually simple but taken together the resultant behavior may be complex and contain inconsistencies or be incomplete. We use attributed transition systems to describe the requirements for embedded systems. The formal specification of requirements consists of the environment description, the description of common system properties in the form of axioms, the insertion function defined by static requirements, and intended properties of the system as a whole defined as dynamic requirements. A typed list of system parameters and a typed list of system attributes are used to describe the structure of the environment. The parameters of the system are variables, which have influence on the behavior of the environment and can change their values from one configuration of the system to another but they never change their value during the execution of the system. Examples of system parameters are the set of tasks for an embedded operating system, the bus threshold for a device controller, etc. System attributes are variables that differ between the observable states of the environment. Attributes may change their values during runtime. Examples of attributes are the queue of tasks, which are ready to be executed by the operating system, or the current data packet for a device controller. As an example, we consider (in simplified form) several fragments of the formalized requirements for an embedded operating system for automotive electronics, OSEK [103]. A typed list of system parameters and a typed list of system attributes describes the structure of the environment (Code 6.2).

Code 2 parameters ( tasks: Set of name, resources: Set of name ); attributes ( suspended: Set of name, ready: Set of name, running: name );

Parameters of the system are variables, which have influence on the behavior of the environment and can change their values from one configuration of the system to another, but never change their value during the execution of the system. The operating system (environment) and executing tasks (agents) interact via service calls. The list of actions contains the names of the services defined provided by the system, including service parameters, if any (Code 6.3). Common system properties are defined as a propositions in first-order logic extended with temporal modalities. For example, consider the following requirements: “the length of the queue of suspended tasks can never be greater than the set of defined tasks.” We will formalize this requirement as follows (Code 6.4). To define the transitions of the system when processing the request for a service we use Hoare-style triples notation, as defined above (Code 6.5).


6-34


Code 3 a: name) ( Activate a, Terminate, Schedule );

Code 4 Let SuspendedLengthReq: Always ((length(suspended) after (Activate a) ( (suspended = (s setminus a)) & (ready = (r union a)) ));

The insertion function expressed by this rule is sequential, in that only one running task can be performed at a time, all others are in a state suspended or ready. A task becomes running as a result of performing a schedule action. It is selected from a queue of ready tasks ordered by priorities. Agents can change the behavior of the environment by service requests. The interaction between the environment and the agents is defined by an insertion function, which computes the new behavior of the environment with inserted agents. The part of the description of requirements specific to sequential environments is the definition of the interaction of agents and environments, where this interaction is described by the insertion function. The most straightforward way to define this function is through interactive requirements: an action is allowed to be processed if and only if the current state of the environment matches one of the preconditions for service requests. This is denoted as E-(act)->E’ intuitively meaning that the environment E allows the action act and if it will be processed then the environment will be equal to E . The agent (composition of all agents) interacting with the environment requests the service act if and only if it transits from its current state u into state u with action act.u-(act)->u’. The composition of environment and the agent is noted as env(E,u). To define the transition of env(E,u) we use interactive rules (Code 6): Code 6

req ActivateInteract1: Forall (E: env, E’:env, u:agent, u’:agent, a:name)( ( (E-(Activate a)->E’) & (u-(Activate a)->u’) ) -> ( env(E,u) -(Activate a)-> env(E’, (Schedule;u’)))); req ActivateInteract2: Forall (E: env, E’:env, u:agent, u’:agent, a:name)( (˜(E -(Activate a)-> E’) & (u-(Activate a)->u’) ) -> ( (E’.error = 1) & env(E,u) -(Activate a)-> env(E’, bot)) ));


System Validation

6-35

Intuitively, this rule means that when the environment allows the action Activate a and the agent requests the action Activate a, then the composition of agent and environment env(E,u) transitions to the state env(E,(Schedule;u’)) with the action Activate a’, then at the next step the environment will be equal to E and the current agent will be (Schedule;u’). 6.4.5.1 Requirements for Sequential Environments This class of models includes such products as single processor operating systems and single client devices. The definitive characteristic of such systems is that at any moment of time only one service request can be processed by the environment. Agents request services from the environment; they are defined by their behavior. The only way of interaction between the environment and the agents is to interact through service requests. It determines the level of abstraction that we use in the formal definition of the behavior of agents. The insertion functions used for the description of sequential systems is broader than the insertion functions discussed earlier. An inserted agent can start its activity before agents inserted earlier terminate. The active agent can be selected by the environment using various criteria such as priority or other static or dynamic characteristics. To compare agent behaviors, in some cases a look-ahead insertion may be used. Usually, sequential environments are deterministic systems and static requirements should be consistent to define deterministic transitions. Consistency requirements reduce to the following condition: the preconditions for each pair of static requirements referring to the same action must be nonintersecting. In other words, for arbitrary values of attributes there must be at least one of two requirements which has a false precondition. Completeness can also be checked for the set of all static requirements that refer to the same action. Every such set of requirements must satisfy the condition that for arbitrary values of attributes there must be at least one among the requirements that is applicable with a true precondition. 6.4.5.2 Requirements for Parallel Environments A parallel environment is an environment with inserted agents that work in parallel and independently. Agents are considered as transition systems. An agent can transition from one state to another by performing an action at any time when the environment accepts this action. Once the agent has completed the action, it transitions into a new state and, in addition, causes a transition of the environment. As an example, consider the modeling of device interaction protocols. Devices are independent and connected through the environment. They interact by sending and receiving of data packets. The protocol is considered as an algorithm used by devices to interact with other components. Such a device is an agent in the parallel environment. It is represented as a transition system that can cause transition between states by one of the two actions: sending or receiving a packet that is a parameter of these actions. We formalize such requirements by using the notation of Hoare-style triples. Asynchronous parallel environments are highly nondeterministic. Such specifications are easily expressed in sequence diagrams or message sequence charts, as shown in Figure 6.5. The preconditions and postconditions are conditions and states on the message sequence diagram, while the actions represent the message arrows and tasks shown on the diagrams. 6.4.5.3 Requirements for Synchronous Agents As an example of a synchronous system, consider a processor with a bus and its time-dependent behavior. The processor interacts with the bus through signals which appear at every bus cycle (discrete time step). Interaction protocols in the processor-bus system are defined by signal behavior. Every signal can have a value of either 0 or 1. After some event, every signal switches to one of these values. For every signal of the system there is a history describing the conditions of signal switching. Such conditions are called assertion condition (when the signal switches to 1) and deassertion condition (when the signal switches to 0). Formally, the history of signals is a sequence of conjunctions of signals. The situation when signal S1 is equal to 1, signal S2 is equal to 0 in moment tn and signal S1 is equal to 0, signal S2 is equal to 1 in moment tn+1 can be described in the following way: (S1 & ∼ S2) ∗ (∼ S1 & S2) © 2006 by Taylor & Francis Group, LLC

6-36

Embedded Systems Handbook CCCH(k)

SDGC_Call(i, Group_ID)

ACG

DAP SDGC_Page_Request

(Group_ID, LAI,) SDGC_Paging_Request (List_Of_Targets) SDGC_Page_Request_Type_1 (ms_id=1) SDGC_Page_Request_Type_1 (ms_id=2) SDGC_Page_Request_Type_1 (ms_id=3) SDGC_Page_Response_Type_1 (ms_id,) SDGC_Complete (MS_ID, Group_Id=0,) SDGC_Page_Response_Type_1 (ms_id,) SDGC_Complete (MS_ID, Group_Id=0,)

SDGC_Page_Response_Type_1 (ms_id,) SDGC_Complete (MS_ID, Group_Id=0,)

FIGURE 6.5

Sample MSC diagram.

Let signal S3 have as assertion condition that it will be equal to 1 after the above history of signals S1 and S2 . This fact can be described as (S1 & ∼ S2) ∗ (∼ S1 & S2)− > after(1) S3 using triples notation. This condition can be reflected on a wave (or timing) diagram (Figure 6.6). The consistency condition is fulfilled if signal q will not be changed into 1 and 0 in the same cycle. In other words, if there are two requirements P → q and P → ∼q, then preconditions P and P cannot be true simultaneously. Static requirements for synchronous systems can use Kleene expressions over conditions and duration functions with numeric inequalities in preconditions. These requirements are converted into standard form with logic statements relating to adjacent time intervals.


System Validation

6-37

t1

t2

t3

S1

S2

S3

FIGURE 6.6 Sample wave diagram.

6.4.6 Reasoning about Embedded Systems The theory of agents and environments has been implemented in the system 3CR [104]. The kernel of our system [105] consists of a simulator for a generic Action Language (AL) [10,11] for the description of system behaviors, of services for automatic exploration of the behavior tree of a system, and of a theorem prover for first-order predicate logic, enriched with a theory of linear equations and inequalities. It provides the following technologies supporting the development, verification, and validation of requirements for embedded systems: • Prove the internal consistency and completeness of static requirements of a system. • Prove dynamic properties of the system defined by static requirements including safety, liveness, and integrity conditions. • Translate systems described in standard engineering languages (e.g., MSC, SDL, or wave diagrams) into the first-order format described earlier and simulate these models in user-defined environments. • Generate test suites for a system defined by verified requirements specifications and validate the implementations of the system against these test cases. These facilities can be used in automated as well as in interactive mode. To determine consistency and completeness of requirements for interactive systems we rely on the theory of interaction of agents and environments as the underlying formal machinery. 6.4.6.1 Algebraic Programming The mathematical models described in Section 6.2 can be made more concrete by imposing structure on the state space of transition systems. An universal approach is to consider an algebraic structure of the set of states of a system. Then states are represented by algebraic expressions and transitions can conveniently be defined by (conditional) rewriting rules. A combination of conditional rewriting rules with congruence on the set of algebraic expressions can be defined in terms of rewriting logic [32]. Most modern rewriting techniques are considered primarily in the context of equational theories but could also be applied to first-order or higher-order clausal or nonclausal theorem proving. The main disadvantage of computations with such systems is their relatively weak performance. For instance, rewriting modulo associativity and commutativity (AC-matching) is NP-complete. Consequentially, these systems are usually not powerful enough when “real-life” problems are considered. Our environment [105] supports reasoning in noncanonical rewriting systems. It is possible to combine arbitrary systems of rewriting rules with different rewrite strategies. The equivalence relation (basic congruence) on a set of algebraic expressions is introduced by means of interpreters for operations


6-38


which define a canonical form. The primary strategy of rewriting is a one-step syntactic rewriting with postcanonization by means of reducing the rewritten node to this canonical form. All other strategies are combinations of the primary strategy with different traversals of the tree representing a term structure. Rewrite strategies can be chosen from the library of strategies or written as procedures or functions. The generic AL [11,106] is used for the syntactical representation of agents as programs and is based on the behavior algebra defined in Section 6.2. The main syntactic constructs of AL are prefixing, nondeterministic choice, sequential composition, and parallel composition. Actions and procedure calls are primitive statements. It provides the standard termination constants (successful termination, divergence, deadlock). The semantics of this language is parameterized by an intensional semantics defined through an unfolding function for procedure calls and an interaction semantics defined by the insertion function of an environment into which the program will be inserted. The intensional semantics and the interaction semantics are defined as systems of rewriting rules. The intensional semantics of an AL program is an agent which is obtained by unfolding procedure calls in the program and defining transitions on a set of program states. It is defined independently of the environment by means of rewriting rules for the unfolding function (unfolding rules) up to bisimulation. The left-hand side of an unfolding rule is an expression representing a procedure call. The right-hand side of an unfolding rule is an AL program which may be unfolded further generating more and more exact approximations of the behavior under recursive computation. The only built-in compositions of AL are prefixing and nondeterministic choice. The unfolding of parallel and sequential compositions are flexible and can be adjusted by the user. Alternatives for parallel composition are defined by the choice of the combination operator. For example, when the combination of arbitrary actions is the impossible action, parallel composition is reduced to interleaving. On the other hand, exclusion of interleaving from the unfolding rules defines parallel composition as synchronization at each step (similar to hand shaking in Milner’s π-calculus). The interaction semantics of AL programs is defined through the insertion function. Programs are considered up to insertion equivalence. Rewriting rules which define the insertion function (insertion rules) have the following structure: the left-hand side of an insertion rule is the state or behavior of the environment with a sequence of agents inserted into this environment (represented as AL programs). The right-hand side is a program in AL augmented by “calls” to the insertion function denoted as env(E, u), where E is an environment state expression and u is an AL program. To compute the interaction semantics of AL program one uses both the unfolding rules for procedure calls and the insertion rules to unfold calls to the insertion function. In this approach, the environment is considered as a semantic notion and is not explicitly included in the agent. Instead, the meaning of an agent is defined as a transformation of an environment which corresponds to inserting the agent into its environment. When the agent is inserted into the environment, the environment changes and this change is considered to be a property of the agent described. 6.4.6.2 Simulating of Transition Systems The AL has been implemented by means of a simulator [10,106,107], an interactive program which generates all histories of an environment with inserted agents and which can explore the behavior of this environment step-by-step, starting from any possible initial state, with branching at nondeterministic points and backtracking to previous states. The simulator permits forward and backward moves along histories; in automatic mode it can search for states satisfying predefined properties (deadlock, successful termination, etc.) or properties defined by the user. The generation of histories may be user guided and thus permits examination of different histories. The user can retrieve information about the current state of a system and change this state by means of inserting new agents using different insertion functions. Arbitrary data structures can be used for the representation of the states of an environment and the environment actions. The set of states of an environment is closed under the insertion function e[u] which is denoted in the simulator as env(e, u). The agent u is represented by an AL expression. Arbitrary algebraic data structures can be used for the representation of agent actions and procedure calls.


System Validation

6-39

The core of the simulator is specified as a nondeterministic transition system that functions as an environment for the system model. Actions of the simulating environment are expressed by means of calls for services of the simulator. Local services define one-step transition of the simulated system. Global services permit the user to compute different properties of the behavior of a simulated system. The user can formulate the property of a state by means of a rewriting rule system or some other predicate function and the simulator will search for the existence of a state satisfying the property among the states reachable from the current state. Examples of such properties are deadlock, successful termination, undefined states, and so on. 6.4.6.3 Theorem Proving The proof system [108] is based on the interactive evidence algorithm [109–111] — a Gentzen-style calculus with unification used for first-order reasoning. The Interactive Evidence Algorithm is a sequent calculus and relies on the construction of an auxiliary goal as the main inference step which allows easy control of the direction of the search for proofs at each step through the choice of auxiliary goals. This algorithm can be represented as a combination of two calculi: inference in the calculus of auxiliary goals is used as a single-step inference in the calculus of conditional sequents. In a sense, the interactive evidence algorithm generalizes logic programming in that for the latter, auxiliary goals are extracted from Horn disjuncts while in the interactive evidence algorithm they are extracted from arbitrary formulae with quantifiers (which need not be skolemized). The interactive evidence algorithm is implemented as a nondeterministic algebraic program extracted from the calculus based on the simulator for AL. This program is inserted as an agent into a control environment which searches for a proof, organizes interaction with the user and the knowledge bases, and implements strategies and heuristics to speed up the proof search. The control environment contains the assumptions of a conditional sequent, and so the local information can be combined with other information taken from knowledge base agents and used in search strategies. The prover is invoked by the function prove implemented as a simple recursive procedure with backtracking which takes an initial conditional sequent as argument and searches for a path from the initial statement to axioms, and this path is converted to a proof. The inference search is nondeterministic owing to disjunction rules. Predicates are considered up to equivalence defined by means of all Boolean equations except distributivity. A function Can defined by means of a system of rewriting rules defines the reduction of predicate formulae as well as propositional formulae to a normal form. Predicate formulae are considered up to renaming of bound variables and equations ¬(∃x)p = (∀x)¬p, ¬(∀x)p = (∃x)¬p. Associativity, commutativity, and idempotence of conjunction and disjunction as well as the laws of contradiction, excluded middle, and the laws for propositional constants are used implicitly in these equations.

6.4.7 Consistency and Completeness The notion of consistency of requirements in general is equivalent to the existence of an implementation or model of a system that satisfies these requirements. Completeness means that this model is unique up to some predefined equivalence. The traditional way of proving consistency is to develop a model coded in some programming or simulation language and to prove that this code is correct with respect to requirements. However, direct proving of correctness is difficult because it demands computing necessary invariant conditions for the states of a program. Another method is generating the space of all possible states of a system reachable from the initial states and checking whether the dynamic requirements are satisfied in each state. This approach is known as model checking and many systems which support model checking have been developed. Unfortunately, model checking is realistic only if the state space is finite, and all reachable states can be generated in a reasonable amount of time. Our approach proves consistency and completeness of requirements directly, without developing a model or implementation of the system. We prove that the static requirements define the system completely and that dynamic properties of consistent requirements are all the logical consequences of static


6-40


requirements. Based on this assumption, one can define an executable specification using only static requirements and then execute it using a simulator. We distinguish between the consistency and completeness of static requirements and dynamic consistency. The first is defined in terms of static requirements only and reflects the property of a system to be deterministic to actions by the environment. For example, a query from a client to a server as the action of an inserted agent can be selected nondeterministically, but the response must be defined by static requirements selected in a deterministic manner. When all dynamic requirements are the consequences of static requirements, we say the system is dynamically consistent. Sufficient conditions for the consistency of static requirements depend on subject domains and implicit assumptions about the change of observable attributes. For example, for the classes of asynchronous systems considered previously, the condition for internal consistency is simply that the conjunction of two preconditions corresponding to different rules with the same action is not satisfiable. Completeness means that the disjunction of all preconditions for all rules corresponding to the same action is generally valid. For synchronous systems, on the other hand, it is the nonsatisfiability of two preconditions corresponding to rules which define conflicting changes to the same (usually binary) attribute. The incompleteness of static requirements usually is not harmful, it merely postpones design decisions to the implementation stage. However, it is harmful if there exists an implementation which meets the static requirements but does not meet the dynamic requirements. Dynamic consistency of requirements (the invariance of dynamic conditions expressed using the temporal modality “always”) can be proven inductively using the structure of static requirements. Consistency checking proceeds by formulating and proving consistency conditions for every pair of static requirements with the same starting event. Every such pair of requirements must satisfy the condition that for arbitrary values of attributes there must be at least one of the two requirements which has a false precondition or the postconditions are equivalent. Completeness of requirements means that there exists exactly one model for the requirements up to some equivalence. We distinguish two main cases depending on the focus of the requirements specification. If the specification defines the environment, the equivalence of environments needs to be considered. Otherwise, if an agent is defined by the requirements, the equivalence of agents needs to be examined. Let e and e be two environment states (of the same or different environments). We say that e and e are equivalent if for an arbitrary agent u the states e[u] and e [u] are bisimilar (from the equivalence of two environment states it follows that for insertion equivalent agents u and u , e[u] and e [u ] are also bisimilar). If there are restrictions on possible behaviors of the agents, we consider admissible agents rather than arbitrary agents. Let E and E be two environments (each being a set of environment states and an insertion function). These environments are equivalent if each state of one of the environments is equivalent to some state of the other. If the set of environments defines an agent for a given environment E, logical completeness (with respect to agent definition) means that all agents satisfying these requirements are insertion equivalent with respect to the environment E, that is, if u and u satisfy the requirements, then for all e ∈ E, e[u] ∼E e[u ]. We check completeness for the set of all static requirements that refer to the same starting event. Every such set of requirements must satisfy the condition that for arbitrary values of attributes there must be at least one among the requirements that is applicable with a true precondition.

6.5 Examples and Results Figure 6.7 exhibits a design process using the 3CR [104] tool set. The requirements for a system are represented as input text written in the formal requirements language or translated from engineering notations, such as SDL or MSC. Static requirements are sent to the checker which establishes their consistency and completeness. The checker analyzes a requirement statement and generates a logical


System Validation

6-41

Static requirements

Checker

Prover

Dynamic requirements

Generate executablespec Behavior model Simulator

Environment model

Generate tests

Structure model

Validate

FIGURE 6.7 Design process.

statement expressing the consistency of the given requirement with other requirements already accepted, as well as a statement expressing completeness after all static requirements have been accepted. Then this statement is submitted to the prover in order to search for a proof. The prover may return one of three answers: proved, not proved, or unknown. In the case where consistency could not be proven, one of the following types of inconsistencies is considered. • Inconsistent formalization. This type of inconsistency can be eliminated through improved formalization, if the postconditions are consistent for the states where all preconditions are true. Splitting the requirements can help. • Inconsistency resulting from incompleteness. This is the case when two requirements are consistent, but nonintersection of preconditions cannot be proved because complete knowledge of the subject domain is not available. A discussion with experts or the authors of the requirements is recommended. • Inconsistency. Preconditions are intersected, but postconditions are inconsistent after performing an action. This is a sign of a possible error, which can be corrected only by the change of requirements. If the intersection is not reachable, the inconsistency will not actually arise. In this case, a dynamic property can be formulated and proven. Dynamic properties are checked after accepting all static requirements. These are logical statements expressing properties of a system in terms of first-order predicate calculus, extended by temporal modalities, as well as higher-order functions and types. If an inductive proof is needed, all static requirements are used for generating lemmas to prove the inductive step. After checking the consistency and completeness of static requirements, the requirements are used for the automatic generation of an executable specification of a system satisfying the static requirements. At this point, the dynamic requirements have already been proven to be consequences of static requirements, so the system also satisfies the dynamic requirements. The next step of system design would be the use of


6-42


the obtained information in the next stages of development. For example, executable specifications can be used for generating complete test cases for system test.

6.5.1 Example: Embedded Operating System In this section, we shall describe a general model which could be used for developing formal requirements for embedded operating systems such as OSEK [103]. The requirements for the OSEK operating system can serve as an example of the application of the general methodology of checking consistency. These requirements comprise two documents: OSEK Concept and OSEK API. The first document contains an informal description of conformance classes (BCC1, BCC2, ECC1, ECC2, ECC3) and requirements of the main services of the system. The second document refines the requirements in terms of C function headers and types of service calls. Two kinds of requirements can be distinguished in these documents. Static requirements define permanent properties of the operating system, which must be true for arbitrary states and any single-step transition. These requirements refer to the structure of operating system states and their changes in response to the performance of services. Dynamic requirements state global system properties such as the absence of deadlocks or priority inversions. Using the theory of interaction of agents and environments as the formalism for the description of OSEK, an environment consists of a processor (or processor network), an operating system, and the external world which interacts with the environment via some kind of communication network; agents are tasks interacting with the operating system and communication network via services. We use nondeterministic agents over a set of actions representing operating system services as models of tasks. The states of the environment are characterized by a set of observable attributes with actions corresponding to the actions of task agents. Each attribute defines a partial function from the set E of environment states to the set of values D. E is considered as an algebra with the set of (internal or external) operations defined on it. The domain D should be defined as abstract as possible, for example, by means of set theoretic constructions (functions, relations, powersets) over abstract data types represented as initial algebras, in order to be independent as much as possible of the details of implementation when formulating the requirements specifications. In monoprocessor systems only one agent is in the active state, that is capturing a processor resource. If e is a state of the environment with no active agents then in the representation e[u] of the environment the state u is a state of an active agent. All other agents are in nonactive states (suspended and ready states for OSEK) and are included into the state e as parts of the values of attributes. The properties of an environment can be divided into static and dynamic properties. Static properties define one-step transitions of a system; dynamic properties define the properties of the total system. The general form of a rule for transitions is: c

a

→ e , u − → u e− d

e[u] − → e [u ] In this rule d, e , and u depend on parameters appearing in the assumptions. Usually, if a = c = d (synchronization), e = e , and u = u , albeit there can be special cases such as hiding, scheduling points, or interrupt routines. c → e for the environment we first define the transition rules To define the properties of a transition e − for the attributes associating a transition system to each attribute. The states of a system associated to the attribute p is a pair p : v where v is a value of a type associated with the attribute p. All transition systems are defined jointly, that is the transitions of one attribute can depend on the current values or transitions of other ones. After defining the transitions for attributes the transitions for environment states must be


System Validation

6-43

defined in such a way that the following general rule should be true: c Let p1 , . . . , pn be attributes of a state e of the environment and e · p1 = v1 , . . . , e · pn = vn . Let e − → e , c e is not a terminal state (, ⊥, or 0), and pi : vi − → pi : vi for all i ∈ I ⊆ [1 : n] where I is the set of all / I . From indices for which such transitions defined. Then e · pi = vi for i ∈ I and e · pi = e · pi for i ∈ c

→ e then e · pi = e · pi for all i ∈ I ⊆ [1 : n]. this definition it follows that if I = ∅ and e − In the case, when two states of the environment are bisimilar, this rule is sufficient to define the transitions of the environment. Otherwise we can introduce the hidden part of the environment state and consider transitions of attributes jointly with this hidden component. For space considerations, in Section 6.5.1.1 we show only the example of a simple scheduler applicable to this class of operating systems. 6.5.1.1 Requirements Specification for a Simple Scheduler This example of a simplified operating system providing initial loading and scheduling for tasks and interrupt processing is used as a benchmark to demonstrate the approach for formalizing and checking the consistency of requirements. We use the terminology of OSEK [103]. The attributes of the scheduler are: • • • •

Active, a name Priority, a partial function from names to natural numbers Ready, a list of name/agent pairs Call, a partial function from names to agents

The empty list and the everywhere undefined function are denoted as Nil. These attributes are defined only for nonterminal and deterministic states. The actions of task agents are calls for services: • • • •

new_task (a, i), a is a name of an agent, i is an integer activate a, a is a name terminate schedule

In the following requirements we assume that the current state of the environment is e[u] and that c → u for a given service c. The values of attributes are their values in a state e. We define the transitions u− d

→ e [u ]. e[u] − The actions of environment include all task actions and, in addition, the following actions which are specific only for the environment and are addressed to an external observer of scheduler activity: • • • • • • • • • •

loaded a, a is a name activated a, a is a name activate_error schedule_error terminated a, a is a name schedule u, u is an agent scheduled a, a is a name wait start_interrupt end_interrupt

6.5.1.1.1 Requirements for new_task This action substitutes the old task with the same name if it was previously defined in the scheduler or adds the task to an environment as a new task otherwise. Transitions for the attributes: new_task(a:v,i)

priority : f −−−−−−−−−→ priority : f [a := i]


6-44


We use the following notation for the redefinition of functions: if f : X → Y and x ∈ X then f [x := y] is / X it a new function g such that g (x) = y and g (x ) = f (x ) for x = x (assignment for functions). If x ∈ is added to the domain of a function and then an assignment is performed. new_task (a:v,i) call : f −−−−−−−−−−−−→ f [a := v] Now the task agent v becomes the initial state of a task named a. new_task is defined by the following rule: new_task (a:v,i) new_task (a:v,i) e −−−−−−−−−−−−→ e , u −−−−−−−−−−−−→ u loaded a e[u] −−−−−−−→ e [u ] 6.5.1.1.2 Requirements for Activate We use the following notation: if p is an attribute, its value is a function and x is in the domain of this function, then p(x) denotes the current value of this function on x. call a = v activate a ready : r −−−−−−−−−→ ready : ord(a : v, r) The function ord is defined on the set of lists of pairs (a : u) where a is a name and u is an agent and this function must satisfy the following system of axioms where all parameters are assumed to be universally quantified: ord(a : , r) = r priority b ≤ priority a ⇒ ord(a : u, b : v, r) = (b : v, ord(a : u, r)) Hence ready is a queue of task agents ordered by priorities and adding a pair (a : u) put this pair as the last one among all pairs of the same priority as a. The rules are: activate a activate a e −−−−−−−−−→ e , u −−−−−−−−−→ u , a ∈ Dom(call ) activated a e[u] −−−−−−−−−−−→ e [u ] activate a u −−−−−−−−−→ u , a ∈ / Dom(call ) − activate_error e[u] −−−−−−−−−−−−−−−→ ⊥ An undefined state of the environment only means that a decision about the behavior of the environment in this case is left for the implementation stage. For instance, the definition can be extended so that the environment sends an error message and calls error processing programs or continuous its functioning ignoring the incorrect action. 6.5.1.1.3 Requirements for Terminate terminate u −−−−−−−−−→ u activated (e.active) e[u] −−−−−−−−−−−−−−−→ e[schedule ] 6.5.1.1.4 Requirements for Schedule Let P(u, b, v, s) = P1 ∨ P2 where P1 = e.active = Nil ∧ ord(e.active : u, e.ready ) = (b : v, s) P2 = e.active = Nil ∧ u = ∧ e.ready = (b : v, s)


System Validation

6-45

Let r = e.ready , and a = e.active , then the rules for attributes are: P(u, b, v, s) schedule u ready : r −−−−−−−−−−→ ready : s P(u, b, v, s) schedule u active : a −−−−−−−−−−→ active : b Note that, transitions for attributes and therefore for the environment are highly nondeterministic because the parameter u is an arbitrary agent behavior. But this nondeterminism disappears in the rule for scheduling which restricts the possible values for u to no more than one value. The rules are: schedule u schedule P(u , b, v, s), e −−−−−−−−−−→ e , u −−−−−−−−→ u scheduled b e[u] −−−−−−−−−−−→ e [v] schedule u −−−−−−−−→ u , e.active = Nil ∧ u = schedule_error e[u] −−−−−−−−−−−−−−−→ ⊥ schedule u −−−−−−−−→ , e.ready = Nil wait e[u] −−−−→ e [] Therefore, if a task has no name (it can happen if a task is initially inserted into an environment) it can use scheduling only as the last action, otherwise it is an error. And if there is nothing to schedule, the scheduling action is ignored. 6.5.1.1.5 Interrupts The simplest way to introduce interrupts to our model is to hide the occurrence of interrupts and the choice of the start of interrupt processing. Only actions which show the start and the end of interrupt processing are observable. The rules are: start_interrupt e −−−−−−−−−−−−−−−−→ e [v] start_interrupt e −−−−−−−−−−−−−−−−→ e [v; end_interrupt ; u] We have no transitions for attributes labeled by the interrupt action so in this transition e and e have the same values for all attributes. The program v is an interrupt processing routine. end_interrupt u −−−−−−−−−−−−−−→ u end_interrupt e[u] −−−−−−−−−−−−−−→ e[u ] Nesting of interrupts can be of arbitrary depth. The action end_interrupt is an environment action but it is used by the inserted agent after interrupt started to show the end of interrupt processing. Therefore, the set of actions for inserted agent is extended, but still it is not an action of an agent before its insertion into the environment.


6-46


6.5.1.1.6 Termination When all tasks are successfully terminated, the scheduler reaches the waiting state: wait active : a −−−−→ active : Nil wait ready : Nil, e −−−−→ e wait e[] −−−−→ e [] 6.5.1.1.7 Dynamic Requirements A state e of an environment is called initial if e.ready = e.active = Nil, and the domains of functions e.priority and e.call are empty. Let E0 be the set of all states reachable from the initial states. Define En+1 , n = 0, 1, . . . as a set of all states reachable from the states e[u], where e ∈ En and u is an arbitrary task agent. The set E of admissible states is defined as a union E = E0 ∪ E1 ∪ . . .. Multiple insertion rules show that the insertion function is sequential. Dynamic requirements for environment states are as follows: • E does not contain the deadlock state 0. • There are no undefined states in E except for those which result from error actions. • Tasks are scheduled in FIFO discipline for the tasks of the same priority, tasks of a higher priority are scheduled first and interrupt actions are nested as brackets. 6.5.1.1.8 Consistency The only nonconstructive transition in the requirements specification of the simple scheduler is the insertion of an arbitrary agent as an interrupt processing routine. If we restrict the corresponding transitions to the selection from some finite set (even nondeterministically) the requirements will be executable. To prove dynamic properties, first some invariant properties for E (always statements) must be proved. Then after their formalization, dynamic properties are inferred from these invariants: • • • •

Dom(e.priority ) = Dom(e.call ) (a : u) ∈ e.ready ⇒ a ∈ Dom(e.priority ) e.active = Nil ⇒ e.active ∈ Dom(e.priority ) e.ready is ordered by priority

In the invariants formulated above, e is assumed to be nonterminal. 6.5.1.2 Input Text to the Consistency Checker The consistency checker accepts static requirements represented in the form of Hoare-style triples and dynamic requirements in the form of logical formulae. Requirements include the description of typed attributes and actions. The following input text is obtained from the description of simple scheduler considered above. It is statically consistent and can be used for proving dynamic properties of the scheduler. Each requirement describes the change of a state of environment with the inserted agent represented as the value of the attribute active_task. The value of this attribute is the behavior of a previously inserted agent which is currently active. The predicate active_task–>a · u is used to represent the transition a → u. The action axiom is needed to prove consistency for action wait (Code 6.7). active_task − Code 7 attributes( active: name, priority: name -> Nat, ready: list of (name:agent), call: name -> agent,


System Validation

6-47

active_task: agent ); actions(a:name,u:agent,i:int)( new_task(a:u,i), activate a, terminate, schedule, loaded a, activated a, activate_error, schedule_error, terminated a, schedule u, scheduled a, wait, start_interrupt, end_interrupt ); Let action axiom: Forall x(˜(x.Delta = Delta)); Let ord Delta: Forall(a,r)(ord(a:Delta,r) = r); Let ord: Forall(a,b,u,v,r)( (priority b (ord(a:u,b:v,r) = (b:v,ord(a:u,r)))); /* ------------ new_task ------------------------------ */ req new_task: Forall(a:name, (u,v):agent, i:int)( (active_task --> new_task(a:v,i).u) -> after(loaded a) ((active_task = u) & (priority a = i) & (call a = v))); /* ------------ activate ------------------------------ */ req activate success: Forall(a:name,(u,v):agent, r:list of(name:agent))( ((active_task --> activate a.u) & (ready = r) &(call a = v) & ˜(v = Nil)) -> after(activated a) (active_task = u & ready = ord(a:v,r))); req activate error: Forall(a:name,u:agent)( ((active_task --> activate a.u) & (call a = Nil)) -> after activate_error bot); /* ------------ terminate ----------------------------- */ req terminate: Forall(a:name, u:agent)( ((active_task --> terminate.u) & (active = a)) -> after(terminated a) (active_task = schedule)); /* ------------ schedule ------------------------------ */ req schedule success active: Forall((u,v):agent, a:name,s:list of(name:agent))( ((active_task --> schedule.u) & ˜(active = Nil) & (ord(active:u,ready) = (a:v,s))) -> after(scheduled a) ((active_task =u) & (active = a) & (ready = s))); req schedule success not active: Forall(v:agent, a:name,s:list of(name:agent))( ( (active_task = schedule) & (active = Nil) & (ready = (a:v,s)))


6-48


-> after(scheduled a) ((active_task =v) & (active = a) & (ready = s))); req schedule error: Forall(u:agent)( ((active_task --> schedule.u) & (active = Nil) & ˜(u = Delta)) -> after schedule_error bot); req schedule final: Forall(v:agent, b:name,s:list of(name:agent))( ((active_task --> schedule.Delta) & (ready = Nil)) -> after wait (active_task = Delta)); /* ------------ interrupt ------------------------------ */ req start interrupt: Forall((u,v):agent)( ((active_task = u) & (interrup_process = v)) -> after start_interrupt (active_task = (v;end_interrupt;u))); req end interrupt: Forall(u:agent)( (active_task --> end_interrupt.u) -> after end_interrupt (active_task = u)); /* ------------ termination --------------------------- */ req termination: Forall(u:agent)( (active_task = Delta) & (ready = Nil) -> after wait (active_task = Delta)) /* ------------ dynamic properties -------------------- */ prove always Forall(a:name)( a in_set Dom(priority)a in_set Dom(call); prove always Forall(a:name,u:agent)( (a:u)in_list(ready)-> a in_set Dom(priority)); prove always ˜(active = Nil)-> active in_set Dom(priority); prove always is_ord ready

6.5.2 Experimental Results in Various Domains We have developed specializations for the following subject domains: sequential asynchronous environments, parallel asynchronous environments, and sequential synchronous agents. We have conducted a number of projects in each domain to determine the effectiveness of formal requirements verification. Figure 6.8 exhibits the performance of our provers. We show the measurements in terms of MSC diagrams, a familiar engineering notation often used to describe embedded systems. The chart on the left shows performance in terms of “arrows,” that is, communications between instances on an MSC diagram. We can see that the performance is roughly linear to the number of arrows up to roughly 800 arrows per diagram. Note that, a typical diagram has much less arrows, no more than hundred in most cases. The chart on the right shows that performance is linear in the number of MSC diagrams (of typical size). Jointly, these charts indicate that the system is scalable to realistically sized applications. 6.5.2.1 OSEK OSEK [103] is a representative example of an asynchronous sequential environment. The OSEK standard defines an open, embedded operating system for automotive electronics. The OSEK formal model has been described as an environment for application tasks of different types, considered as agents inserted into this environment. The actions common for agents and environment are the services of the operating system. The system is multitasking but has only one processor and only one task is running at any given moment and, therefore, the system is considered to be sequential. The


System Validation

6-49 800

0.20 600 0.15 (sec)

Proving time per arrow (sec)

0.25

0.10

400

200

0.05 0.00 0

200

400

600 Arrows

800

1000

1200

0 25

50

75

100

125

150

MSC

FIGURE 6.8 Performer of prover in terms of MSC diagrams.

system is asynchronous because all actions performed by tasks independently of the operating system are not observable and so the time between two services cannot be taken into account. Static requirements are represented by transition rules with preconditions and postconditions. The reachable states for OSEK can be characterized by integrity conditions. After developing the formal requirements for OSEK, the proof system was used to prove static consistency and completeness of the requirements. Several interesting dynamic properties of the requirements were also proven. The formalization of OSEK requirements led to the discovery of 12 errors in the nonformal OSEK standard. For example, Section 6.7.5 of the OSEK/VDX specification [103] defines a transition related to the current priority of a task in the case when it has a priority less than the ceiling priority of the resource; however, no transition is defined in the case when the current priority of the task is equal to the ceiling priority. All these errors were documented and the corrections have been integrated into the OSEK standard. In the formal specification, we have covered 10 services defined by the OSEK standard and have proven the consistency and completeness of this specification. This covers approximately 40% of the complete OSEK standard. Moreover, we have found a number of mistakes in the other parts of the OSEK standard, which prevented formalization of the rest of the standards document. Consistency and completeness of the covered parts of the standard (49 requirements) were proven, after making corrections for the above mentioned defects. The proof of consistency took approximately 7 min on a Pentium III computer with 256M of RAM running the Red Hat Linux Operating System. 6.5.2.2 RIO The RapidIO Interconnect Protocol [112] is an example of a parallel asynchronous environment. This is a protocol for a set of processor elements to communicate amongst each other. There are three layers of abstraction developed: logic, transport, and physical layers. The static requirements for RIO are standard (pre- and postconditions referring to the adjacent moments of time). But while in OSEK an action is uniquely defined by a running task, in RIO it is generated by a nondeterministic choice of one of the processor elements that generates an observable action. The formal requirements description of RIO for logic (14 requirements) and transport layers (6 requirements) was obtained from the documentation and proved to be consistent and complete (46 sec); 46 requirements for the physical layer have been proven consistent in 8.5 min. 6.5.2.3 V’ger The formal requirements for the protocol used by the SC-V’ger processor [113] for communicating with other processor elements of a system via the MAXbus bus device were extracted from the documentation of the MAXbus and from discussions with experts. V’ger is a representative example of a synchronous


6-50


sequential agent inserted into a parallel environment. V’ger is a deterministic automaton with binary input–output signals and shared data available from the bus. The attributes of the system are its input– output signals and its shared data. Originally, there are no actions and we can consider the clock signal synchronizing the system as the only observable action. Static requirements are written using assertion/deassertion conditions for output signals. Each requirement is a rule for setting the signal to a given value (0 or 1). The precondition is a history of conditions represented in a Klenee-like algebra with time. Several rules can be applied at the same moment. For the static consistency conditions, the preconditions of two rules which set the same attribute to different values can never be true at the same lock interval. There are no static completeness conditions because we define the semantics of the requirements text so that if there are no rules to change the output value, it remains in the same state as in the previous moment of time. We use binary attribute symbols as predicates and as long as there are no other predicate symbols the systems represents a propositional calculus. To prove statements with Klenee algebra expressions, these must first be reduced to first-order logic, that is, to requirements with preconditions referring to one moment of time (without histories). A converter has been developed for the automatic translation of subject domains relying on Kleene algebra and the interval calculus notation. The set of reachable states of V’ger is not defined in first-order logic, and the proof of the consistency condition is only a sufficient condition for consistency. A more powerful yet still sufficient condition is the provability of consistency conditions by standard induction from static requirements. There exists a sequence of increasingly powerful conditions which converge to the results obtained by model checking. All 26 V’ger requirements have been proven to be consistent (192 sec).

6.6 Conclusions and Perspectives In this chapter, we reviewed tools and methods to ensure that the “right” system is developed, by which we mean a system that matches what the customer really wants. Systems that do not match customer requirement result in cost overruns owing to later changes of the system at best, and, in the worst case, may never be deployed. Based on the mathematical model of the theory of agents and interactions we developed a set of tools capable of establishing the consistency and completeness of system requirements. Roughly speaking, if the requirements are consistent, an implementation which meets the requirements is possible; if the requirements are complete, this implementation is defined uniquely by the requirements. We discuss how to represent requirements specifications for formal validation and exhibit experimental results of deploying these tools to establish the correctness of embedded software systems. This chapter also reviews other models of system behavior and other tools for system validation and verification. Our experience has shown that dramatic quality improvements are possible through formal validation and verification of systems under development. In practice, deployment of these techniques will require increased upstream development effort: thorough analysis of requirements and their capture in specification languages result in a longer design phase. In addition, significant training and experience are needed before significant benefits can be achieved. Nevertheless, the improvements in quality and reduction in effort in later development phases warrant this investment, as application of these methods in pilot projects has demonstrated.

References [1] D. Harel and A. Pnueli. On the development of reactive systems. In K. Apt, Ed., Logics and Models of Concurrent Systems. NATO ASI Series, vol. 13. Springer-Verlag, pp. 477–498. [2] Z. Manna and A. Pnueli. The Temporal Logic of Reactive and Concurrent Systems. Springer-Verlag, Heidelberg, 1992. [3] Z. Manna and A. Pnueli. Temporal Verification of Reactive Systems: Safety. Springer-Verlag, Heidelberg, 1995.


System Validation

6-51

[4] F.P. Brooks. The Mythical Man-Month: Essays on Software Engineering. Addison-Wesley, Reading, MA, 1995. [5] L. Lamport. Introduction to TLA. SRC Technical note 1994-001, 1994. [6] R.J. van Glabbeek. Notes on the methodology of CCS and CSP. Theoretical Computer Science, 177: 329–349, 1997. [7] D.M.R. Park. Concurrency and automata on infinite sequences. In Proceedings of the 5th GI Conference. Lecture Notes in Computer Science, vol. 104. Springer-Verlag, Heidelberg, 1981. [8] R. Milner. Communication and Concurrency. Prentice Hall, New York, 1989. [9] J.V. Kapitonova and A.A. Letichevsky. On constructive mathematical descriptions of subject domains. Cybernetics, 4: 408–418, 1988. [10] A.A. Letichevsky and D.R. Gilbert. Towards an implementation theory of nondeterministic concurrent languages. Second Workshop of the INTAS-93-1702 Project: Efficient Symbolic Computing, St Petersburg, October 1996. [11] A.A. Letichevsky and D.R. Gilbert. A general theory of action languages. Cybernetics and System Analysis, 1: 12–31, 1998. [12] R. Milner. The polyadic π-calculus: a tutorial. In F.L. Bauer, W. Brauer, and H. Schwichtenberg, Eds., Logic and Algebra of Specification. Springer-Verlag, Heidelberg, 1993, pp. 203–246. [13] C.A.R. Hoare. Communicating Sequential Processes. Prentice Hall, New York, 1985. [14] J.A. Bergstra and J.W. Klop. Process algebra for synchronous communication. Information and Control, 60: 109–137, 1984. [15] L. Lamport. The temporal logic of actions. ACM Transactions on Programming Languages and Systems, 16(3): 872–923, 1994. [16] A. Pnueli. The temporal logic of programs. In Proceedings of the 18th Annual Symposium on the Foundations of Computer Science, November 1977, pp. 46–52. [17] E. Emerson and J. Halpern. Decision procedures and expressiveness in the temporal logic of branching time. Journal of Computer and System Science, 30: 1–24, 1985. [18] M.J. Fisher and R.E. Ladner. Propositional modal logic of programs. In Proceedings of the 9th ACM Annual Symposium on Theory of Computing, pp. 286–294. [19] E. Emerson. Temporal and modal logic. In J. van Leeuwen, Ed., Handbook of Theoretical Computer Science. MIT Press, Cambridge, MA, 1991, pp. 997–1072. [20] R. Bryant. Graph-based algorithms for Boolean function manipulation. IEEE Transactions on Computers, 35: 677–691. [21] J. Burch, E. Clarke, K. McMillan, D. Dill, and L. Hwang. Symbolic model checking: 10–20 states and beyond. Information and Computation, 98: 142–170, 1992. [22] E. Clarke and E. Emerson. Synthesis of synchronization skeletons for branching time temporal logic. In The Workshop on Logic of Programs. Lecture Notes in Computer Science, vol. 131. SpringerVerlag, Heidelberg, 1981, pp. 128–143. [23] J. Quielle and J. Sifakis. Specification and verification of concurrent systems in CESAR. In Proceedings of the 5th International Symposium on Programming, pp. 142–158. [24] L. Lamport. What good is temporal logic? In R. Mason, Ed., Information Processing-83: Proceedings of the 9th IFIP World Computer Congress, Elsevier, 1983, pp. 657–668. [25] M. Abadi and L. Lamport. Composing specifications. ACM Transactions on Programming Languages and Systems, 15: 73–132, 1993. [26] W. Thomas. Automata on infinite objects. In J. van Leeuwen, Ed., Handbook of Theoretical Computer Science. MIT Press, Cambridge, MA, 1991, pp. 131–191. [27] A.P. Sistla, M. Vardi, and P. Wolper. The complementation problem for Büchi automata with application to temporal logic. Theoretical Computer Science, 49: 217–237, 1987. [28] M. Vardi and P. Wolper. An automata-theoretic approach to automatic program verification. In Proceedings of the 1st IEEE Symposium on Logic in Computer Science, pp. 332–344. [29] H. Rodgers. Theory of Recursive Functions and Effective Computability. McGraw-Hill, New York, 1967.


6-52


[30] Y. Gurevich. Evolving algebras: an attempt to discover semantics. In G. Rozenberg and A. Salomaa, Eds., Current Trends in Theoretical Computer Science, World Scientific, River Edge, NJ, 1993, pp. 266–292. [31] Y. Gurevich. Evolving algebras 1993: Lipari guide. In E. Börger, Ed., Specification and Validation Methods. University Press, 1995, pp. 9–36. [32] J. Meseguer. Conditional rewriting logic as a unified model of concurrency. Theoretical Computer Science, 96: 73–155, 1992. [33] P. Lincoln, N. Marti-Oliet, and J. Meseguer. Specification, transformation and programming of concurrent systems in rewriting logic. In G. Bleloch et al., Eds., Proceedings of the DIMACS Workshop on Specification of Parallel Algorithms American Mathematical Society, Providence, 1994. [34] M. Clavel. Reflection in General Logics and Rewriting Logic with Application to the Maude Language. Ph.D. thesis, University of Navarra, 1998. [35] M. Clavel and J. Meseguer. Axiomatizing reflective logics and languages. In G. Kicrales, Ed., Reflection’96. 1996, pp. 263–288. [36] M. Clavel, F. Durán, S. Eker, P. Lincoln, N. Martí-Oliet, J. Meseguer, and J. Quesada. Towards Maude 2.0. In F. Futatsugi, Ed., Proceedings of the 3rd International Workshop on Rewriting Logic and its Applications. Notes in Theoretical Computer Science, vol. 36, Elsevier, 2000. [37] J. Meseguer and P. Lincoln. Introduction in Maude. Technical report, SRI International, 1998. [38] J. Brackett. Software Requirements. Technical report SEI-CM-19-1.2, Software Engineering Institute, 1990. [39] B. Boehm. Industrial software metrics top 10 list. IEEE Software, 4: 84–85, 1987. [40] B. Boehm. Software Engineering Economics. Prentice Hall, New York, 1981. [41] J.C. Kelly, S.S. Joseph, and H. Jonathan. An analysis of defect densities found during software inspections. Journal of Systems Software, 17: 111–117, 1992. [42] R. Lutz. Analyzing requirements errors in safety-critical embedded sytems. In IEEE International Symposium Requirements Engineering, San Diego, 1993, pp. 126–133. [43] T. DeMarco. Structured Analysis and System Specification. Yourdon Press, New York, 1979. [44] C.V. Ramamoorthy, A. Prakash, W. Tsai, and Y. Usuda. Software engineering: problems and perspectives. Computer, 17: 191–209, 1984. [45] M.E. Fagan. Design and code inspections to reduce errors in program evelopment. IBM Systems Journal, 15: 182–211, 1976. [46] M.E. Fagan. Advances in software inspection. IEEE Transactions on Software Engineering, 12: 744–751, 1986. [47] J. Rushby. Formal Methods and their Role in the Certification of Critical Systems. Technical report CSL-95-1, March 1995. [48] C.B. Jones. Systematic Software Development Using VDM. Prentice Hall, New York, 1990. [49] J.M. Spivey. Understanding Z: A Specification Language and its Formal Semantics. Cambridge University Press, London, 1988. [50] J.-R. Abrial. The B-Book: Assigning Programs to Meanings. Cambridge University Press, London, 1996. [51] International Organization for Standardization — Information Processing Systems — Open Systems Interconnection. Lotos — A Formal Description Technique Based on the Temporal Ordering of Observational Behavior. ISO Standard 8807. Geneva, 1988. [52] R.S. Boyer and J.S. Moore. A Computational Logic Handbook. Academic Press, New York, 1988. [53] M.J.C. Gordon and T.F. Melham, Eds., Introduction to HOL. Cambridge University Press, London, 1993. [54] D. Craigen, S. Kromodimoeljo, I. Meisels, B. Pase, and M. Saaltink. EVES: an overview. In VDM’91: Formal Software Development Methods. Lecture Notes in Computer Science, vol. 551. Springer-Verlag, Heidelberg, 1991, pp. 389–405. [55] M. Saaltink, S. Kromodimoeljo, B. Pase, D. Craigen, and I. Meisels. Data abstraction in EVES. In Formal Methods Europe’93, Odense, April 1993.


System Validation

6-53

[56] S. Owre, N. Shankar, and J.M. Rushby. User Guide for the PVS Specification and Verification System. Technical report, SRI International, 1996. [57] E. Clarke, O. Grumberg, and D. Peled. Model Checking. MIT Press, Cambridge, MA, 2000. [58] P. Godefroid. VeriSoft: A tool for the automatic analysis of concurrent reactive software. In Proceedings of the 9th Conference on Computer Aided Verification. Lecture Notes in Computer Science, vol. 1254. Springer-Verlag, Heidelberg, 1997, pp. 476–479. [59] J. Burch, E. Clarke, D. Long, K. McMillan, and D. Dill. Symbolic model checking for sequential circuit verification. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 13(4): 401–424, 1994. [60] G. Holzmann. The SPIN Model Checker, Primer and Reference Manual. Addison-Wesley, Reading, MA, 2004. [61] S.J. Garland and J.V. Guttag. A Guide to LP, the Larch Prover. Technical report, DEC Systems Research Center Report 82, 1991. [62] J. Crow, S. Owre, J. Rushby, N. Shankar, and M. Srivas. A tutorial introduction to PVS. In WIFT ’95: Workshop on Industrial-Strength Formal Specification Techniques. Boca Raton, FL, April 1995. [63] S. Rajan, N. Shankar, and M. Srivas. An integration of model checking with automated proof checking. In Proceedings of the 7th International Conference on Computer Aided Verification — CAV ’95. Lecture Notes in Computer Science, vol. 939. Springer-Verlag, Heidelberg, 1995, pp. 84–97. [64] B. Berard, Ed., Systems and Software Verification: Model-Checking Techniques and Tools. SpringerVerlag, Heidelberg, 2001. [65] International Telecommunications Union. Recommendation Z.120 — Message Sequence Charts. Geneva, 2000. [66] Object Management Group. Unified Modeling Language Specification, 2.0. 2003. [67] J. Hooman. Towards formal support for UML-based development of embedded systems. In Proceedings of the 3rd PROGRESS Workshop on Embedded Systems, 2002, pp. 71–76. [68] M. Bozga, J. Fernandez, L. Ghirvth, S. Graf, J.P. Krimm, L. Mounier, and J. Sifakis. IF: an intermediate representation for SDL and its applications. In Proceedings of the 9th SDL Forum, Montreal, June 1999. [69] F. Regensburger and A. Barnard. Formal verification of SDL systems at the Siemens mobile phone department. In Tools and Algorithms for the Construction and Analysis of Systems — ACAS’98. Lecture Notes in Computer Science, vol. 1384. Springer-Verlag, Heidelberg, 1998, pp. 439–455. [70] O. Shumsky and L. J. Henschen. Developing a framework for verification, simulation and testing of SDL specifications. In M. Kaufmann and J.S. Moore, Eds., Proceedings of the ACL2 Workshop 2000, Austin, 2000. [71] P. Baker, P. Bristow, C. Jervis, D. King, and B. Mitchell. Automatic generation of conformance tests from message sequence charts. In Proceedings of the 3rd SAM (SDL And MSC) Workshop, Telecommunication and Beyond, Aberystwyth. Lecture Notes in Computer Science, 2003, p. 2599. [72] B. Mitchell, R. Thomson, and C. Jervis. Phase automaton for requirements scenarios. In Proceedings of the Feature Interactions in Telecommunications and Software Systems, vol. VII, 2003, pp. 77–87. [73] L. Philipson and L. Hogskola. Survey compares formal verification tools. EETIMES, 2001. http://www.eetimes.com/story/OEG20011128S0037 [74] S. Yovine. Kronos: A verification tool for real-time systems. International Journal of Software Tools for Technology Transfer, 1: 123–133, 1997. [75] P. Pettersson and K. Larsen. UPPAAL2k. Bulletin of the European Association for Theoretical Computer Science, 70: 40–44, 2000. [76] D. Bjorner and C.B. Jones, Eds., The Vienna development method: the meta-language. In Logic Programming. Lecture Notes in Computer Science, vol. 61. Springer-Verlag, Heidelberg, 1978. [77] Y. Ledru and P.-Y. Schobbens. Applying VDM to large developments. ACM SIGSOFT Software Engineering Notes, 15: 55–58, 1990.


6-54


[78] A. Puccetti and J.Y. Tixadou. Application of VDM-SL to the development of the SPOT4 programming messages generator. FM ’99: World Congress on Formal Methods, VDM Workshop, Toulouse, 1999. [79] J.C. Bicarregui and B. Ritchie. Reasoning about VDM developments using the VDM support tool in Mural. In VDM 91: Formal Software Development Methods. Lecture Notes in Computer Science, vol. 551. Springer-Verlag, Heidelberg, 1991, pp. 371–388. [80] A. Diller. Z: An Introduction to Formal Methods. John Wiley & Sons, New York, 1990. [81] W. Grieskamp, M. Heisel, and H. Dorr. Specifying embedded systems with statecharts and Z: an agenda for cyclic software components. In Proceedings of the Formal Aspects of Software Engineering — FASE ’98. Lecture Notes in Computer Science, vol. 1382. Springer-Verlag, Heidelberg, 1998. [82] D. Bert, S. Boulmé, M.-L. Potet, A. Requet, and L. Voisin. Adaptable translator of B specifications to embedded C programs. In Formal Methods 2003. Lecture Notes in Computer Science, vol. 2805. Springer-Verlag, Heidelberg, 2003, pp. 94–113. [83] R. Milne. The Semantic Foundations of the RAISE Specification Language. RAISE report REM/11, STC Technology, 1990. [84] M. Nielsen, K. Havelund, K. Wagner, and C. George. The RAISE language, methods, and tools. Formal Aspects of Computing, 1: 85–114, 1989. [85] T. Mossakowski, Kolyang, and B. Krieg-Bruckner. Static semantic analysis and theorem proving for CASL. In F. Parisi Presicce, Ed., Proceedings of the 12th Workshop on Algebraic Development Techniques. Lecture Notes in Computer Science, vol. 1376. Springer-Verlag, Heidelberg, 1998, pp. 333–348. [86] P.D. Mosses. COFI: the common framework initiative for algebraic specification and development. In TAPSOFT’97: Theory and Practice of Software Development. Lecture Notes in Computer Science. vol. 1214. Springer-Verlag, Heidelberg, 1997, pp. 115–137. [87] B. Krieg-Brückner, J. Peleska, E. Olderog, and A. Baer. The UniForM workbench, a universal development environment for formal methods. In J. Wing, J. Woodcock, and J. Davies, Eds., FM’99, Formal Methods. Lecture Notes in Computer Science, vol. 1709. Springer-Verlag, Heidelberg, 1999, pp. 1186–1205. [88] C.L. Heitmeyer, J. Kirby, and B. Labaw. Tools for formal specification, verification and validation of requirements. In Proceedings of the 12th Annual Conference on Computer Assurance, Gaithersburg, June 1997. [89] S. Easterbrook, R. Lutz, R. Covington, Y. Ampo, and D. Hamilton. Experiences using lightweight formal methods for requirements modeling. IEEE Transactions on Software Engineering, 24: 4–14, 1998. [90] L.C. Paulson. Isabelle: A Generic Theorem Prover. Lecture Notes in Computer Science, vol. 828. Springer-Verlag, Heidelberg, 1994, pp. 23–34. [91] B.J. Krämer and N. Völker. a highly dependable computer architecture for safety-critical control applications. Real-Time Systems Journal, 13: 237–251, 1997. [92] D. Muthiayen. Real-Time Reactive System Development — A Formal Approach Based on UML and PVS. Technical report, Concordia University, 2000. [93] P.B. Jackson. The Nuprl Proof Development System, Reference Manual and User Guide. Cornell University, Ithaca, NY, 1994. [94] L. Cortes, P. Eles, and Z. Peng. Formal coverification of embedded systems using model checking. In Proceedings of the 26th EUROMICRO Conference, Maastricht, September 2000, pp. 106–113. [95] G. Holzmann. Design and Validation of Computer Protocols. Prentice Hall, New York, 1991. [96] G. Holzmann. The model checker SPIN. IEEE Transactions on Software Enginering, 23: 3–20, 1997. [97] R. Kurshan. Automata-Theoretic Verification of Coordinating Processes. Princeton University Press, Princeton, NJ, 1993. [98] R. de Simone and M. Lara de Souza. Using partial-order methods for the verification of behavioural equivalences. In G. von Bochmann, R. Dssouli, and O. Rafiq, Eds., Formal Description Techniques VIII, 1995. © 2006 by Taylor & Francis Group, LLC

System Validation

6-55

[99] J. Fernandez, H. Garavel, A. Kerbrat, R. Mateescu, L. Mounier, and M. Sighireanu. CADP: a protocol validation and verification toolbox. In Proceedings of the 8th Conference on ComputerAided Verification. New Brunswick, August 1996, pp. 437–440. [100] D. Dill, A. Drexler, A. Hu, and C. Yang. Protocol verification as a hardware design aid. In IEEE International Conference on Computer Design: VLSI in Computers and Processors. October 1992, pp. 522–525. [101] E. Astegiano and G. Reggio. Formalism and method. Theoretical Computer Science, 236: 3–34, 2000. [102] Z. Chaochen, C.A.R. Hoare, and A.P. Ravn. A calculus of durations. Information Processing Letter, 40: 269–276, 1991. [103] OSEK Group. OSEK/VDX. Operating System.Version 2.1. May 2000. [104] S.N. Baranov, V. Kotlyarov, J. Kapitonova, A. Letichevsky, and V. Volkov. Requirement capturing and 3CR approach. In Proceedings of the 26th International Computer Software and Applications Conference, Oxford, 2002, pp. 279–283. [105] J.V. Kapitonova, A.A. Letichevsky, and S.V. Konozenko. Computations in APS. Theoretical Computer Science, 119: 145–171, 1993. [106] D.R. Gilbert and A.A. Letichevsky. A universal interpreter for nondeterministic concurrent programming languages. In M. Gabbrielli, Ed., Fifth Compulog Network Area Meeting on Language Design and Semantic Analysis Methods, September 1996. [107] T. Valkevych, D.R. Gilbert, and A.A. Letichevsky. A generic workbench for modelling the behaviour of concurrent and probabilistic systems. In Workshop on Tool Support for System Specification, Development and Verification, TOOLS98, Malente, June 1998. [108] A.A. Letichevsky, J.V. Kapitonova, and V.A. Volkov. Deductive tools in algebraic programming system. Cybernetics and System Analysis, 1: 12–27, 2000. [109] A. Degtyarev, A. Lyaletski, and M. Morokhovets. Evidence algorithm and sequent logical inference search. In H. Ganzinger, D. McAllester, and A. Voronkov, Eds., Logic for Programming and Automated Reasoning (LPAR’99). Lecture Notes in Computer Science, vol. 1705. Springer-Verlag, 1999, pp. 44–61. [110] V.M. Glushkov, J.V. Kapitonova, A.A. Letichevsky, K.P. Vershinin, and N.P. Malevanyi. Construction of a practical formal language for mathematical theories. Cybernetics, 5: 730–739, 1972. [111] V.M. Glushkov. On problems of automata theory and artificial intelligence. Cybernetics, 5: 3–13, 1970. [112] Motorola. RIO Interconnect Globally Shared Memory Logical Specification. Motorola, 1999. [113] Motorola. SC-V’ger Microprocessor Implementation Definition. Motorola, 1997. [114] S. Abramsky. A domain equation for bisimulation. Information and Computation, 92: 161–218, 1991. [115] R. Alur and D. Dill. A theory of timed automata. Theoretical Computer Science, 126: 183–235, 1994. [116] S.N. Baranov, C. Jervis, V. Kotlyarov, A. Letichevsky, and T. Weigert. Leveraging UML to deliver correct telecom applications. In L. Lavagno, G. Martin, and B. Selic, Eds., UML for Real: Design of Embedded Real-Time Systems. Kluwer Academic Publishers, Amsterdam, 2003. [117] J. Bicarregui, T. Dimitrakos, B. Matthews, T. Maibaum, K. Lano, and B. Ritchie. The VDM+B project: objectives and progress. In World Congress on Formal Methods in the Development of Computing Systems. Toulouse, September 1999. [118] G. Booch, J. Rumbaugh, and I. Jacobson. Unified Modeling Language User Guide. Addison-Wesley, Reading, MA, 1997. [119] S. Chandra, P. Godefroid, and C. Palm. Software model checking in practice: an industrial case study. In Proceedings of the International Conference on Software Engineering, Orlando, May 2002.


6-56


[120] E. Clarke, I. Draghicescu, and R. Kurshan. A Unified Approach for Showing Language Containment and Equivalence between Various Types of Omega-Automata. Technical report, Carnegie-Mellon University, 1989. [121] F. Van Dewerker and S. Booth. Requirements Consistency — A Basis for Design Quality. Technical report, Ascent Logic, 1998. [122] E. Felt, G. York, R. Brayton, and A. Vincentelli. Dynamic variable reordering for BDD minimization. In Proceedings of the EuroDAC, 1993, pp. 130–135. [123] M. Fitting. A Kripke-Kleene semantics for logic programs. Journal of Logic Programming, 2: 295–312, 1985. [124] I. Graham. Migrating to Object Technology. Addison-Wesley, Reading, MA, 1995. [125] Green Mountain Computing Systems. Green Mountain VHDL Tutorial, 1995. [126] International Telecommunications Union. Recommendation Z.100 — Specification and Description Language. Geneva, 1999. [127] B. Jacobs. Objects and classes, coalgebraically. In B. Freitag, C.B. Jones, C. Lengauer, and H.-J. Schek, Eds., Object-Orientation with Parallelism and Persistence. Kluwer Academic Publishers, 1996, pp. 83–101. [128] I. Jacobson. Object-Oriented Software Engineering, A Use Case Driven Approach. Addison-Wesley, Reading, MA, 1992. [129] N.D. Jones, C. Gomard, and P. Sestoft. Partial Evaluation and Automatic Program Generation. Prentice Hall, New York, 1993. [130] J.V. Kapitonova, T.P. Marianovich, and A.A. Mishchenko. Automated design and simulation of computer systems components. Cybernetics and System Analysis, 6: 828–840, 1997. [131] M. Kaufmann and J.S. Moore. ACL2: an industrial strength version of NQTHM. In Proceedings of the 11th Annual Conference on Computer Assurance (COMPASS96), June 1996, pp. 23–34. [132] S. Kripke. Semantical considerations on modal logic. Acta Philosophica Fennica, 16: 83–94, 1963. [133] J. van Leeuwen, Ed., Handbook of Theoretical Computer Science. MIT Press, Cambridge, MA, 1991. [134] A.A. Letichevsky, and J.V. Kapitonova. Mathematical information environment. In Proceedings of the 2nd International THEOREMA Workshop, Linz, June 1998, pp. 151–157. [135] A.A. Letichevsky and D.R. Gilbert. Agents and environments. In Proceedings of the 1st International Scientific and Practical Conference on Programming, Kiev, 1998. [136] A.A. Letichevsky and D.R. Gilbert. A model for interaction of agents and environments. In Selected Papers from the 14th International Workshop on Recent Trends in Algebraic Development Techniques. Lecture Notes in Computer Science. vol. 1827, 2004, pp. 311–328. [137] P. Lindsay. On transferring VDM verification techniques to Z. In Proceedings of Formal Methods Europe — FME’94, Barcelona, October 1994. [138] W. McCune. Otter 3.0 Reference Manual and Guide. Technical report, Argonne National Laboratory Report ANL-94, 1994. [139] K. McMillan. Symbolic Model Checking. Kluwer Academic Publishers, Dordrecht, 1993. [140] M. Morockovets and A. Luzhnykh. Representing mathematical texts in a formalized natural like language. In Proceedings of the 2nd International THEOREMA Workshop, Linz, June 1998, pp. 157–160. [141] T. Nipkow, L. Paulson, and Markus Wenzel. Isabelle/HOL — A Proof Assistant for Higher-Order Logic. Lecture Notes in Computer Science, vol. 2283. Springer-Verlag, Heidelberg, 2002. [142] S. Owre, J.M. Rushby, and N. Shankar. A prototype verification system. In D. Kapur, Ed., Proceedings of the 11th International Conference on Automated Deduction (CADE). Lecture Notes in Artificial Intelligence, vol. 601. Springer-Verlag, Heidelberg, 1992, pp. 748–752. [143] G. Plotkin. A Structured Approach to Operational Semantics. Technical report, DAIMI FN-19, Aarhus University, 1981. [144] K.S. Rubin and A. Goldberg. Object behavior analysis. Communications of the ACM, 35: 48–62, 1992.


System Validation

6-57

[145] R. Rudell. Dynamic variable reordering for ordered binary decision diagrams. In Proceedings of the IEEE/ACM ICCAD’93, 1993, pp. 42–47. [146] J. Rushby. Mechanized formal methods: where next? In J. Wing and J. Woodcock, Eds., FM99: The World Congress in Formal Methods. Lecture Notes in Computer Science, vol. 1708. Springer-Verlag, Heiderberg, 1999, pp. 48–51. [147] J. Rushby, S. Owre, and N. Shankar. Subtypes for specifications: predicate subtypes in PVS. IEEE Transactions on Software Engineering, 24: 709–720, 1998. [148] M. Saeki, H. Horai, and H. Enomoto. Software development process from natural language specification. In International Conference on Software Engineering. Pittsburgh, March 1989, pp. 64–73. [149] J. Tsai and T. Weigert. Knowledge-Based Software Development for Real-Time Distributed Systems. World Scientific Publishers, Singapore, 1993. [150] M. Vardi. Verification of concurrent programs — the automata-theoretic framework. In Proceedings of the 2nd IEEE Symposium on Logic in Computer Science, pp. 167–176. [151] T. Weigert and J. Tsai. A logic-based requirements language for the specification and analysis of real-time systems. In Proceedings of the 2nd Conference on Object-Oriented Real-Time Dependable Systems, Laguna Beach, 1996, pp. 8–16.


Design and Verification Languages 7 Languages for Embedded Systems Stephen A. Edwards

8 The Synchronous Hypothesis and Synchronous Languages Dumitru Potop-Butucaru, Robert de Simone, and Jean-Pierre Talpin

9 Introduction to UML and the Modeling of Embedded Systems Øystein Haugen, Birger Møller-Pedersen, and Thomas Weigert

10 Verification Languages Aarti Gupta, Ali Alphan Bayazit, and Yogesh Mahajan


7 Languages for Embedded Systems 7.1 7.2

Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Software Languages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7-1 7-2

Assembly Languages • The C Language • C++ • Java • Real-Time Operating Systems

7.3

Hardware Languages. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7-8

Verilog • VHDL

7.4

Dataflow Languages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-12

7.5

Hybrid Languages. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-14

Kahn Process Networks • Synchronous Dataflow Esterel • SDL • SystemC

Stephen A. Edwards Columbia University

7.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-18 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-18

7.1 Introduction An embedded system is a computer masquerading as a non-computer that must perform a small set of tasks cheaply and efficiently. A typical system might have communication, signal processing, and user interface tasks to perform. Because the tasks must solve diverse problems, a language general-purpose enough to solve them all would be difficult to write, analyze, and compile. Instead, a variety of languages have evolved, each best suited to a particular problem domain. The most obvious divide is between languages for software and hardware, but there are others. For example, a language for signal-processing is often more convenient for a particular problem than, say, assembly, but might be poor for control-dominated behavior. This chapter describes popular hardware, software, dataflow, and hybrid languages, each of which excels a certain problems. Dataflow languages are good for signal processing, and hybrid languages combine ideas from the other three classes. Due to space, this chapter only describes the main features of each language. The author’s book on the subject [1] provides many more details on all of these languages.

Some of this chapter originally appeared in the Online Symposium for Electrical Engineers (OSEE).

7-1


7-2


7.2 Software Languages Software languages describe sequences of instructions for a processor to execute (Table 7.1). As such, most consist of sequences of imperative instructions that communicate through memory: an array of numbers that hold their values until changed. Each machine instruction typically does little more than, say, add two numbers, so high-level languages aim to specify many instructions concisely and intuitively. Arithmetic expressions are typical: coding an expression such as ax 2 + bx + c in machine code is straightforward, tedious, and best done by a compiler. The C language provides such expressions, control-flow constructs such as loops and conditionals, and recursive functions. The C++ language adds classes as a way to build new data types, templates for polymorphic code, exceptions for error handling, and a standard library of common data structures. Java is a still higher-level language that provides automatic garbage collection, threads, and monitors to synchronize them.

7.2.1 Assembly Languages An assembly language program (Figure 7.1) is a list of processor instructions written in a symbolic, humanreadable form. Each instruction consists of an operation such as addition along with some operands. For example, add r5,r2,r4 might add the contents of registers r2 and r4 and write the result to r5. Such arithmetic instructions are executed in order, but branch instructions can perform conditionals and loops by changing the processor’s program counter — the address of the instruction being executed. A processor’s assembly language is defined by its opcodes, addressing modes, registers, and memories. The opcode distinguishes, say, addition from conditional branch, and an addressing mode defines how and where data is gathered and stored (e.g., from a register or from a particular memory location). Registers can be thought of as small, fast, easy-to-access pieces of memory. There are roughly four categories of modern assembly languages (Table 7.2). The oldest are those for the so-called complex instruction set computers, or CISC. These are characterized by a rich set of instructions and addressing modes. For example, a single instruction in Intel’s x86 family, a typical CISC processor, can add the contents of a register to a memory location whose address is the sum of two other registers and a constant offset. Such instruction sets are usually convenient for human programmers, who are generally fairly skilled at using a heterogeneous set of tools, and the code itself is usually quite compact. Figure 7.1(a) illustrates a small program in x86 assembly. By contrast, reduced instruction set computers (RISC) tend to have fewer instructions and much simpler addressing modes. The philosophy is that while you generally need more RISC instructions to accomplish something, it is easier for a processor to execute them because it does not need to deal with the complex cases and easier for a compiler to produce them because they are simpler and more uniform. Figure 7.1(b) illustrates a small program in SPARC assembly. TABLE 7.1 Software Language Features Compared

Expressions Control-flow Recursive functions Exceptions Classes and inheritance Templates Namespaces Multiple inheritance Threads and locks Garbage collection

C

C++

Java

• • • ◦

• • • • • • • •

• • • • •

Note: •, full support; ◦, partial support.


◦

• ◦ • •

Languages for Embedded Systems (a)

jmp L1: movl movl L2: xorl divl movl testl jne

7-3 mov b mov mov b mov .LL5: mov .LL3: mov call mov cmp bne mov

L2

(b)

%ebx, %eax %ecx, %ebx %edx, %edx %ebx %edx, %ecx %ecx, %ecx L1

%i0, .LL3 %i1, %i0, .LL3 %i1,

%o1 %i0 %o1 %i0

%o0, %i0 %o1, %o0 .rem, 0 %i0, %o1 %o0, 0 .LL5 %i0, %o1

FIGURE 7.1 Euclid’s algorithm (a) i386 assembly (CISC) and (b) SPARC assembly (RISC). SPARC has more registers and must call a routine to compute the remainder (the i386 has division instruction). The complex addressing modes of the i386 are not shown in this example. TABLE 7.2 Typical Modern Processor Architectures CISC

RISC

DSP

x86 68000

SPARC MIPS ARM

TMS320 DSP56000 ASDSP-21xx

(a) move #samples, r0 move #coeffs, r4 move #n-1, m0 move m0, m4 movep y:input, x:(r0) clr a x:(r0)+, x0 y:(r4)+, y0 rep #n-1 mac x0,y0,a x:(r0)+, x0 y:(r4)+, y0 macr x0,y0,a (r0)movep a, y:output

Microcontroller 8051 PIC AVR

(b) START: MOV ACALL ORL SETB LOOP: CLR SETB SETB WAIT: JB

CLR MOV ACALL SETB AJMP

SP, #030H INITIALIZE P1,#0FFH P3.5 P3.4 P3.3 P3.4 P3.5, WAIT P3.3 A,P1 SEND P3.3 LOOP

FIGURE 7.2 (a) A finite impulse response filter in DSP56001 assembly. The mac instruction (multiply and accumulate) does most of the work, multiplying registers X0 and Y0, adding the result to accumulator A, fetching the next sample and coefficient from memory, and updating circular buffer pointers R0 and R4. The rep instruction repeats the mac instruction in a zero-overhead loop. (b) Writing to a parallel port in 8051 microcontroller assembly. This code takes advantage of the 8051’s ability to operate on single bits.

The third category of assembly languages arise from more specialized processor architectures such as digital signal processors (DSPs) and very-long instruction word processors (VLIWs). The operations in these instruction sets are simple like those in RISC processors (e.g., add two registers); but they tend to be very irregular (only certain registers may be used with certain operations) and support a much higher degree of instruction-level parallelism. For example, Motorola’s DSP56001 can, in a single instruction, multiply two registers, add the result to a third, load two registers from memory, and increase two circular buffer pointers. However, the instruction severely limits which registers (and even which memory) it may use. Figure 7.2(a) shows a filter implemented in 56001 assembly.


7-4


The fourth category includes instruction sets on small (4- and 8-bit) microcontrollers. In some sense, these combine the worst of all worlds: there are few instructions and each cannot do much, much like a RISC processor, and there are also significant restrictions on which registers can be used when, much like a CISC processor. The main advantage of such instruction sets is that they can be implemented very cheaply. Figure 7.2(b) shows a routine that writes to a parallel port in 8051 assembly.

7.2.2 The C Language C is currently the most popular language for embedded system programming. C compilers exist for virtually every general-purpose processor, from the lowliest 4-bit microcontroller to the most powerful 64-bit processor for compute servers. C was originally designed by Dennis Ritchie [2] as an implementation language for the Unix operating system being developed at Bell Labs for a 24K DEC PDP-11. Because the language was designed for systems programming, it provides very direct access to the processor through such constructs as untyped pointers and bit-manipulation operators, things appreciated today by embedded systems programmers. Unfortunately, the language also has many awkward aspects, such as the need to define everything before it is used, that are holdovers from the cramped execution environment in which it was first implemented. A C program (Figure 7.3) contains functions built from arithmetic expressions structured with loops and conditionals. Instructions in a C program run sequentially, but control-flow constructs such as loops of conditionals can affect the order in which instructions execute. When control reaches a function call in an expression, control is passed to the called function, which runs until it produces a result, and control returns to continue evaluating the expression that called the function. C derives its types from those a processor manipulates directly: signed and unsigned integers ranging from bytes to words, floating point numbers, and pointers. These can be further aggregated into arrays and structures — groups of named fields. C programs use three types of memory. Space for global data is allocated when the program is compiled, the stack stores automatic variables allocated and released when their function is called and returns, and the heap supplies arbitrarily-sized regions of memory that can be deallocated in any order. The C language is an ISO standard, but most people consult the book by Kernighan and Ritchie [3]. C succeeds because it can be compiled into very efficient code and because it allows the programmer almost arbitrarily low-level access to the processor when necessary. As a result, virtually every function can be written in C (exceptions include those that must manipulate specific processor registers) and can be expected to be fairly efficient. C’s simple execution model also makes it fairly easy to estimate the efficiency of a piece of code and improve it if necessary.

#include <stdio.h> int main(int argc, char *argv[]) { char *c; while (++argv, --argc > 0) { c = argv[0] + strlen(argv[0]); while (--c >= argv[0]) putchar(*c); putchar(’\n’); } return 0; }

FIGURE 7.3 A C program that prints each of its arguments backwards. The outermost while loop iterates through the arguments (count in argc, array of strings in argv), while the inner loop starts a pointer at the end of the current argument and walks it backwards, printing each character along the way. The ++ and -- prefixes increment the variable they are attached to before returning its value.


Languages for Embedded Systems

7-5

While C compilers for workstation-class machines usually comply closely to ANSI/ISO standard C, C compilers for microcontrollers are often much less standard. For example, they often omit support for floating-point arithmetic and certain library functions. Many also provide language extensions that, while often very convenient for the hardware for which they were designed, can make porting the code to a different environment very difficult.

7.2.3 C++ C++ (Figure 7.4) [4] extends C with structuring mechanisms for big programs: user-defined data types, a way to reuse code with different types, namespaces to group objects and avoid accidental name collisions when program pieces are assembled, and exceptions to handle errors. The C++ standard library includes a collection of efficient polymorphic data types such as arrays, trees, strings for which the compiler generates custom implementations. A class defines a new data type by specifying its representation and the operations that may access and modify it. Classes may be defined by inheritance, which extends and modifies existing classes. For example, a rectangle class might add length and width fields and an area method to a shape class. A template is a function or class that can work with multiple types. The compiler generates custom code for each different use of the template. For example, the same min template could be used for both integers and floating-point numbers. C++ also provides exceptions, a mechanism intended for error recovery. Normally, each method or function can only return directly to its immediate caller. Throwing an exception, however, allows control to return to an arbitrary caller, usually an error-handling mechanism in the main function or similar. Exceptions can be used, for example, to gracefully recover from out-of-memory conditions no matter where they occur, without the tedium of having to check whether every function encountered an out-of-memory condition. Memory consumption is a disadvantage to C++’s exception mechanism. While most C++ compilers do not generate slower code when exceptions are enabled, they do generate larger executables by including tables that record the location of the nearest exception handler. For this reason, many compilers, such as GNU’s gcc, have a flag that completely disables exceptions.

class Cplx { double re, im; public: Cplx(double v) : re(v), im(0) {} Cplx(double r, double i) : re(r), im(i) {} double abs() const { return sqrt(re*re + im*im); } void operator+= (const Cplx& a) { re += a.re; im += a.im; } }; int main() { Cplx a(5), b(3,4); b += a; cout a, y => cc); -- by name map(a, c, ai); -- by position map(a1 => b, a2 => cc, y => bi); map(a1 => ai, a2 => bi, y => d);

architecture arch2 of mux2 is signal cc, ai, bi : Bit; begin cc . Conditional. “x = if b then y else z” defines x by y if b is true and by z if b is false. It can be used without alternative “x = if b then y” to sample y at the clock b, as shown in Figure 8.11. Lustre programs are structured as data-flow functions, also called nodes. A node takes a number of input signals and defines a number of output signals upon the presence of an activation condition. If that condition matches an edge of the input signal clock, then the node is activated and possibly produces output. Otherwise, outputs are undetermined or defaulted. As an example, Figure 8.12 defines a resettable counter. It takes an input signal tick and returns the count of its occurrences. A boolean reset signal can be triggered to reset the count to 0. We observe that the boolean input signals tick and reset are synchronous to the output signal count and define a data-flow function.


8-14


:= FIGURE 8.13 The delay operator in Signal.

:= default

:= when

FIGURE 8.14 The merge operator in Signal. process counter = (? event tick, reset ! integer value) (| value := (0 when reset) default ((value$ init 0 + 1) when tick) default (value$ init 0) |);

FIGURE 8.15 A resettable counter in Signal.

8.4.2.2 Combinators for Signal As opposed to nodes in Lustre, equations x := y f z in Signal more generally denote processes that define timing relations between input and output signals. There are three primitive combinators in Signal: Delay. “x := y$1 init v” initially defines the signal x by the value v and then by the previous value of the signal y. The signal y and its delayed copy “x := y$1 init v” are synchronous: they share the same set of tags t1 , t2 , . . . . Initially (at t1 ), the signal x takes the declared value v. At tag tn , x takes the value of y at tag tn−1 . This is displayed in Figure 8.13. Sampling. “x := y when z” defines x by y when z is true (and both y and z are present); x is present with the value v2 at t2 only if y is present with v2 at t2 and if z is present at t2 with the value true. When this is the case, one needs to schedule the calculation of y and z before x, as depicted by yt2 → xt2 ← zt2 . Merge. “x = y default z” defines x by y when y is present and by z otherwise. If y is absent and z present with v1 at t1 then x holds (t1 , v1 ). If y is present (at t2 or t3 ) then x holds its value whether z is present (at t2 ) or not (at t3 ). This is depicted in Figure 8.14. The structuring element of a Signal specification is a process. A process accepts input signals originating from possibly different clock domains to produce output signals when needed. Recalling the example of the resettable counter (Figure 8.12), this allows, for instance, to specify a counter (pictured in Figure 8.15) where the inputs tick and reset and the output value have independent clocks. The body of counter consists of one equation that defines the output signal value. Upon the event reset, it sets the count to 0. Otherwise, upon a tick event, it increments the count by referring to the previous value of value and adding 1 to it. Otherwise, if the count is solicited in the context of the counter process (meaning that its clock is active), the counter just returns the previous count without having to obtain a value from the tick and reset signals. A Signal process is a structuring element akin to a hierarchical block diagram. A process may structurally contain sub-processes. A process is a generic structuring element that can be specialized to the timing context of its call. For instance, a definition of the Lustre counter (Figure 8.12) starting from the specification of Figure 8.15 consists of the refinement depicted in Figure 8.16. The input tick and reset clocks expected by the process counter are sampled from the boolean input signals tick and reset


Synchronous Hypothesis and Languages

8-15

process synccounter = (? boolean tick, reset ! integer value) (| value := counter (when tick, when reset) | reset ˆ= tick ˆ= value |);

FIGURE 8.16

Synchronization of the counter interface. ˆ

ˆ+ ˆ

FIGURE 8.17

ˆ*

when

(clock expression) (clock relations)

The syntax of clock expressions and clock relations (equations).

:=

ˆ ˆ ˆ

:= when := default

FIGURE 8.18

ˆ

ˆ-

ˆ ˆ ˆ

when

ˆ ˆ ˆ ˆ+ ˆ

when ˆ ˆ- ˆ

The clock inference system of Signal.

by using the “when tick” and “when reset” expressions. The count is then synchronized to the inputs by the equation reset ˆ= tick ˆ= count.

8.4.3 Compilation of Declarative Formalisms The analysis and code generation techniques of Lustre and Signal are necessarily different, tailored to handle the specific challenges determined by the different models of computation and programming paradigms. 8.4.3.1 Compilation of Signal Sequential code generation starting from a Signal specification starts with an analysis of its implicit synchronization and scheduling relations. This analysis yields the control- and data-flow graphs that define the class of sequentially executable specifications and allow to generate code. 8.4.3.1.1 Synchronization and Scheduling Analysis In Signal, the clock ˆx of a signal x denotes the set of instants at which the signal x is present. It is represented by a signal that is true when x is present and that is absent otherwise. Clock expressions (see Figure 8.17) represent control. The clock “when x” (respectively “when not x”) represents the time tags at which a boolean signal x is present and true (respectively false). The empty clock is denoted by 0. Clock expressions are obtained using conjunction, disjunction, and symmetric difference over other clocks. Clock equations (also called clock relations) are Signal processes: the equation “eˆ= e ” synchronizes the clocks e and e while “eˆ< e ” specifies the containment of e in e . Explicit scheduling relations “x → y when e” allow the representation of causality in the computation of signals (e.g., x after y at the clock e). A system of clock relations E can be easily associated (using the inference system P:E of Figure 8.18) to any Signal process P, to represent its timing and scheduling structure. 8.4.3.1.2 Hierarchization The clock and scheduling relations E of a process P define the control- and data-flow graphs that hold all necessary information to compile a Signal specification upon satisfaction of the property of endochrony, as illustrated in Figure 8.19. A process is said to be endochronous iff given a set of input signals (x and y in Figure 8.19) and flow-equivalent input behaviors (datagrams on the left of Figure 8.19), it has the capability to reconstruct a unique synchronous behavior up to clock-equivalence: the datagrams of the


8-16


Input buffer

Input buffer

Endochronous process

Endochronous process

FIGURE 8.19 Endochrony: from flow-equivalent inputs to clock-equivalent outputs.

input signals in the middle of Figure 8.19 and of the output signal on the right of Figure 8.19 are ordered in clock-equivalent ways. To determine the order x y in which signals are processed during the period of a reaction, clock relations E play an essential role. The process of determining this order is called hierarchization and consists of an insertion algorithm that proceeds in three easy steps: 1. First, equivalence classes are defined between signals of same clock: if E ⇒ ˆxˆ= ˆy then x y (we write E ⇒ E iff E implies E ). 2. Second, elementary partial order relations are constructed between sampled signals: if E ⇒ ˆxˆ= when y or E ⇒ ˆxˆ= when not y then y x. 3. Last, assume a partial order of maximum z such that E ⇒ ˆz = ˆyf ˆw (for some f ∈ { ˆ+ , ˆ* , ˆ- }) and a signal x such that y x w, then insertion consist of attaching z to x by x z. The insertion algorithm proposed in Reference 30 yields a canonical representation of the partial order by observing that there exists a unique minimum clock x below z such that rule 3 holds. Based on the order , one can decide whether E is hierarchical by checking that its clock relation has a minimum, written min E ∈ vars(E ), so that ∀x ∈ vars(E ), ∃y ∈ vars(E ), y x. If E is furthermore acyclic (i.e., E ⇒ x → x when e implies E ⇒ eˆ= 0, for all x ∈ vars(E )) then the analyzed process is endochronous, as shown in Reference 28. Example 8.1 The implications of hierarchization for code generation can be outlined by considering the specification of one-place buffer in Signal (Figure 8.20, left). Process buffer implements two functionalities. One is the process alternate which desynchronizes the signals i and o by synchronizing them to the true and false values of an alternating boolean signal b. The other functionality is the process current. It defines a cell in which values are stored at the input clock î and loaded at the output clock ô. cell is a predefined Signal operation defined by: x := y cell z init v =def (m := x$1 init v | x := y default m | ˆxˆ= ˆy ˆ+ ˆz )/m Clock inference (Figure 8.20, middle) applies the clock inference system of Figure 8.18 to the process buffer to determine three synchronization classes. We observe that b, c_b, zb, and zo are synchronous and define the master clock synchronization class of buffer. There are two other synchronization classes, c_i and c_o, that corresponds to the true and false values of the boolean flip–flop variable b, respectively: b≺c_b≺zb≺zo and b c_i≺i and b c_o≺o This defines three nodes in the control-flow graph of the generated code (Figure 8.20, right). At the main clock c_b, b, and c_o are calculated from zb. At the sub-clock b, the input signal i is read. At the sub-clock c_o the output signal o is written. Finally, zb is determined. Notice that the sequence of instructions follows the scheduling relations determined during clock inference.



8-17

process buffer = (? i ! o) (| alternate (i, o)

buffer_iterate () { (| c_b ˆ= b

b = !zb;

| o := current (i)

| b

ˆ= zb

c_o = !b;

|) where

| zb

ˆ= zo

if (b) {

process alternate = (? i, o ! ) (| zb := b$1 init true

| c_i := when b

if (!r_buffer_i(&i))

| c_i ˆ= i

return FALSE;

| b := not zb

| c_o := when not b

}

| o ˆ= when not b

| c_o ˆ= o

if (c_o) {

| i ˆ= when b

| i -> zo when î

o = i;

|) / b, zb;

| zb -> b

w_buffer_o(o);

process current = (? i ! o) (| zo := i cell ô init false | o

:= zo when ô

| zo -> o when ô

}

|) / zb, zo, c_b,

zb = b;

c_o, c_i, b;

|) / zo;

FIGURE 8.20

return TRUE; }

Specification, clock analysis, and code generation in Signal.

8.4.3.2 Compilation of Lustre Whereas Signal uses a hierarchization algorithm to find a sequential execution path starting from a system of clock relations, Lustre leaves this task to engineers, which must provide a sound, fully synchronized program in the first place: well-synchronized Lustre programs correspond to hierarchized Signal specifications. The classic compilation of Lustre starts with a static program analysis that checks the correct synchronization and cycle freedom of signals defined within the program. Then, it essentially partitions the program into elementary blocks activated upon boolean conditions [9] and focuses on generating efficient code for high-level constructs, such as iterators for array processing [31]. Recent efforts have been conducted to enhance this compilation scheme by introducing effective activation clocks, whose soundness is checked by typing techniques. In particular, this was applied to the industrial SCADE version, with extensions [32,33]. 8.4.3.3 Certification The simplicity of the single-clocked model of Lustre eases program analysis and code generation. Therefore, its commercial implementation — Scade by Esterel Technologies — provides a certified C code generator. Its combination to Sildex (the commercial implementation of Signal by TNI-Valiosys) as a frontend for architecture mapping and early requirement specification is the methodology advocated in the IST project Safeair (URL: http://www.safeair.org). The formal validation and certification of synchronous program properties has been the subject of numerous studies. In Reference 34, a coinductive axiomatization of Signal in the proof assistant Coq [35], based on the calculus of constructions [36], is proposed. The application of this model is twofold. It allows, first of all, for the exhaustive verification of formal properties of infinite-state systems. Two case studies have been developed. In Reference 37, a faithful model of the steam-boiler problem was given in Signal and its properties proved with Signal’s Coq model. In Reference 38, it is applied to proving the correctness of real-time properties of a protocol for loosely time-triggered architectures (TTAs), extending previous work proving the correctness of its finite-state approximation [39]. Another important application of modeling Signal in the proof assistant Coq is being explored: the development of a reference compiler translating Signal programs into Coq assertions. This translation


8-18


allows to represent model transformations performed by the Signal compiler as correctness-preserving transformations of Coq assertions, yielding a costly yet correct-by-construction synthesis of the target code. Other approaches to the certification of generated code have been investigated. In Reference 40, validation is achieved by checking a model of the C code generated by the Signal compiler in the theorem prover PVS with respect to a model of its source specification (translation validation). Related work on modeling Lustre have equally been numerous and started in Reference 41 with the verification of a sequential multiplier using a model of stream functions in Coq. In Reference 42, the verification of Lustre programs is considered under the concept of generating proof obligations and by using PVS. In Reference 43, a semantics of Lucid-Synchrone, an extension of Lustre with higher-order stream functions, is given in Coq.

8.5 Success Stories — A Viable Approach for System Design Synchronous and reactive formalisms appeared in the early nineties and the theory matured and expanded since then to cover all the topics presented in this chapter. Research groups were active mostly in France, but also notably in Germany and in the United States. Several large academic projects were completed, including the IST Syrf, Sacres, and Safeair projects, as well as industrial early-adopters ones. S/R modeling and programming environments are today marketed by two French software houses, Esterel Technologies for Esterel and SCADE/Lustre, and TNI-Valiosys for Sildex/Signal. The influence of S/R systems tentatively pervaded to hardware CAD products such as Synopsys CoCentric Studio and Cadence VCC, despite the omnipotence of classical HDLs there. The Ptolemy cosimulation environment from UC Berkeley comprises a S/R domain based on the synchronous hypothesis. There have been a number of industrial take-ups on S/R formalisms, most of them in the aeronautics industry. Airbus Industries is now using Scade for the real design of parts of the new Airbus A-380 aircraft. S/R languages are also used by Dassault Aviation (for the next-generation Rafale fighter jet) and Snecma ([4] gives an in-depth coverage of these prominent collaborations). Car and phone manufacturers are also paying increasing attention (for instance, at Texas Instruments), as well as advanced development teams in embedded hardware divisions of prominent companies (such as Intel).

8.6 Into the Future: Perspectives and Extensions Future advances in and around synchronous languages can be predicted in several directions: Certified compilers. As already seen, this is the case for the basic SCADE compiler. But as the demand becomes higher, owing to the critical-safety aspects of applications (in transportation fields notably), the impact of full-fledged operational semantics backing the actual compilers should increase. Formal models and embedded code targets. Following the trend of exploiting formal models and semantic properties to help define efficient compilation and optimization techniques, one can consider the case of targeting distributed platforms (but still with a global reaction time). Then, the issues of spatial mapping and temporal scheduling of elementary operations composing the reaction inside a given interconnect topology become a fascinating (and NP complete) problem. Heuristics for user guidance and semiautomatic approaches are the main topic of the SynDEx environment [44,45]. Of course this requires an estimation of the time budgets for the elementary operations and communications. Desynchronized systems. In larger designs, the full global synchronous assumption is hard to maintain, especially if long propagation chains occur inside a single reaction (in hardware, for instance, the clock tree cannot be distributed to the whole chip). Several types of answers are currently being brought to this issue, trying to instill a looser coupling of synchronous modules in a desynchronized network (one then talks of “Globally Asynchronous Locally Synchronous,” GALS, systems). In the theory of latency-insensitive design, all processes are supposed to be able to stall until the full information is synchronously available.



8-19

The exact latency duration meant to recover a (slower) synchronous model are computed afterwards, only after functional correctness on the more abstract level is achieved [46, 47]. Fancier approaches, trying to save on communications and synchronizations, are introduced in Section 8.6.1. Relations between transactional and cycle-accurate levels. If synchronous formalisms can be seen as a global attempt at transferring the notion of cycle-accurate modeling to the design of SW/HW embedded systems, then the existing gap between these levels must also be reconsidered in the light of formal semantics and mathematical models. Currently, there exists virtually no automation for the synthesis of RTL from TLM levels. The previous item, with its well-defined relaxation of synchronous hypothesis at specification time, could be a definite step in this direction (of formally linking two distinct levels of modeling). Relations between cycle-accurate and timed models. Physical timing is of course a big concern in synchronous formalisms, if only to validate the synchronous hypothesis and establish converging stabilization of all values across the system before the next clock tick. While in traditional software implementations one can decide that the instant is over when all treatments were effectively completed, in hardware or other real-time distributed settings a true compile-time timing analysis is in order. Several attempts have been made in this direction [48, 49].

8.6.1 Asynchronous Implementation of Synchronous Specifications The relations between synchronous and asynchronous models have long remained unclear, but investigations in this direction have recently received a boost owing to demands coming from the engineering world. The problem is that many classes of embedded applications are best modeled (at least in part) under the cycle-based synchronous paradigm, while their desired implementation is not. This problem covers implementation classes that become increasingly popular (such as distributed software or even complex digital circuits such as the Systems-on-a-Chip), hence the practical importance of the problem. Such implementations are formed of components that are only loosely connected through communication lines that are best modeled as asynchronous. At the same time, the existing synchronous tools for specification, verification, and synthesis are very efficient and popular, meaning that they should be used for most of the design process. In distributed software, the need for global synchronization mechanisms always existed. However, in order to be used in aerospace and automotive applications, an embedded system must also satisfy very high requirements in the areas of safety, availability, and fault tolerance. These needs prompted the development of integrated platforms, such as TTA [50], which offer higher-level, proven synchronization primitives, more adapted to specification, verification, and certification. The same correctness and safety goals are followed in a purely synchronous framework by two approaches: the AAA methodology and the SynDEx software of Sorel and co-workers [45] and the Ocrep tool of Girault and co-workers [51]. Both approaches take as input a synchronous specification, an architecture model, and some real-time and embedding constraints, and produce a distributed implementation that satisfies the constraints and the synchrony hypothesis (supplementary signals simulate at runtime the global clock of the initial specification). The difference is that Ocrep is rather tailored for control-dominated synchronous programs, while SynDEx works best on data-flow specifications with simple control. In the (synchronous) hardware world, problems appear when the clock speed and circuit size become large enough to make global synchrony unfeasible (or at least very expensive), most notably in what concerns the distribution of the clock and the transmission of data over long wires between functional components. The problem is to ensure that no communication error occurs owing to the clock skew or the interconnect delay between the emitter and the receiver. Given the high cost (in area and power consumption) of precise clock distribution, it appears in fact that the only long-term solution is the division of large systems into several clocking domains, accompanied by the use of novel on-chip communication and synchronization techniques. When the multiple clocks are strongly correlated, we talk about mesochronous or plesiochronous systems. However, when the different clocks are unrelated (e.g., for power saving reasons), the resulting circuit is best modeled as a GALS system where the synchronous domains are connected through asynchronous


8-20


communication lines (e.g., FIFOs). Such approaches are pausible clocking by Yun and Donohue [52], or, in a framework where a global, reference clock is still preserved, latency-insensitive design by Carloni et al. [46]. A multi-clock extension of the Esterel language [53] has been proposed for the description of such systems. A more radical approach to the hardware implementation of a synchronous specification is desynchronization [54], where the clock subsystem is entirely removed and replaced with asynchronous handshake logic. The advantages of such implementations are those of asynchronous logic: smaller power consumption, average-case performance, smaller electromagnetic interference. At an implementation-independent level, several approaches propose solutions to various aspects of the problem of GALS implementation. The loosely TTAs of Benveniste et al. [39] define a sampling-based approach to (inter-process) FIFO construction. More important, Benveniste et al. [55] define semantics preservation — an abstract notion of correct GALS implementation of a synchronous specification (asynchronous communication is modeled here as message passing). Latency insensitivity ensures in a very simple, highly constrained way the semantics preservation. Less constraining and higher-level conditions are the compositional criteria of finite flow-preservation of Talpin et al. [56,57] and of weak endo-isochrony of Potop-Butucaru et al. [58]. While finite-flow preservation focuses on checking equivalence through finite desynchronization protocols, weak endo-isochrony allows to exploit the internal concurrency of synchronous systems in order to minimize signalization, and to handle infinite behaviors.

References [1] Nicolas Halbwachs. Synchronous programming of reactive systems. In Computer Aided Verification (CAV’98). Kluwer Academic Publishers, 1998, pp. 1–16. [2] Gérard Berry. Real-time programming: general-purpose or special-purpose languages. In Information Processing 89, G. Ritter, Ed. Elsevier Science Publishers B.V. (North Holland), Amsterdam, 1989, pp. 11–17. [3] Albert Benveniste and Gérard Berry. The synchronous approach to reactive and real-time systems. Proceedings of the IEEE, 79: 1270–1282, 1991. [4] Albert Benveniste, Paul Caspi, Stephen Edwards, Nicolas Halbwachs, Paul Le Guernic, and Robert de Simone. Synchronous languages twelve years later. Proceedings of the IEEE. Special Issue on Embedded Systems, January 2003. [5] Gérard Berry and Laurent Cosserat. The Synchronous Programming Language Esterel and its Mathematical Semantics, Vol. 197 of Lecture Notes in Computer Science, 1984. [6] Frédéric Boussinot and Robert de Simone. The Esterel language. Proceedings of the IEEE, September 1991. [7] Gérard Berry and Georges Gonthier. The Esterel synchronous programming language: design, semantics, implementation. Science of Computer Programming, 19: 87–152, 1992. [8] Charles André. Representation and analysis of reactive behavior: a synchronous approach. In Computational Engineering in Systems Applications (CESA’96). IEEE-SMC, Lille, France, 1996, pp. 19–29. [9] Nicolas Halbwachs, Paul Caspi, and Pascal Raymond. The synchronous data-flow programming language Lustre. Proceedings of the IEEE, 79: 1991. [10] Albert Benveniste, Paul Le Guernic, and Christian Jacquemot. Synchronous programming with events and relations: the Signal language and its semantics. Science of Computer Programming, 16: 103–149, 1991. [11] Gérard Berry. The Constructive Semantics of Pure Esterel. Esterel Technologies, 1999. Electronic Version Available at http://www.esterel-technologies.com. [12] Tom Shiple, Gérard Berry, and Hervé Touati. Constructive analysis of cyclic circuits. In Proceedings of the International Design and Testing Conference (ITDC). Paris, 1996. [13] Ellen Sentovich, Kanwar Jit Singh, Luciano Lavagno, Cho Moon, Rajeev Murgai, Alexander Saldanha, Hamid Savoj, Paul Stephan, Robert Brayton, and Alberto Sagiovanni-Vincentelli. SIS: a system for sequential circuit synthesis. Memorandum UCB/ERL M92/41, UCB, ERL, 1992.



8-21

[14] Minxi Gao, Jie-Hong Jiang, Yunjian Jiang, Yinghua Li, Subarna Sinha, and Robert Brayton. MVSIS. In Proceedings of the International Workshop on Logic Synthesis (IWLS’01). Tahoe City, June 2001. [15] Ellen Sentovich, Horia Toma, and Gérard Berry. Latch optimization in circuits generated from high-level descriptions. In Proceedings of the International Conference on Computer-Aided Design (ICCAD’96), 1996. [16] Hervé Touati and Gérard Berry. Optimized controller synthesis using Esterel. In Proceedings of the International Workshop on Logic Synthesis (IWLS’93). Lake Tahoe, 1993. [17] Amar Bouali, Jean-Paul Marmorat, Robert de Simone, and Horia Toma. Verifying synchronous reactive systems programmed in esterel. In Proceedings of the FTRTFT’96, Vol. 1135 of Lecture Notes in Computer Science, 1996, pp. 463–466. [18] Amar Bouali. Xeve, an Esterel verification environment. In Proceedings of the Tenth International Conference on Computer Aided Verification (CAV’98), Vol. 1427 of Lecture Notes in Computer Science. UBC, Vancouver, Canada, June 1998. [19] Robert de Simone and Annie Ressouche. Compositional semantics of Esterel and verification by compositional reductions. In Proceedings of the CAV’94, Vol. 818 of Lecture Notes in Computer Science, 1994. [20] Laurent Arditi, Hédi Boufaïed, Arnaud Cavanié, and Vincent Stehlé. Coverage-Directed Generation of System-Level Test Cases for the Validation of a DSP System, Vol. 2021 of Lecture Notes in Computer Science. Springer-Verlag, Heidelberg, 2001. [21] Gérard Berry. Esterel on hardware. Philosophical Transactions of the Royal Society of London, Series A, 19: 87–152, 1992. [22] Daniel Weil, Valérie Bertin, Etienne Closse, Michel Poize, Patrick Vernier, and Jacques Pulou. Efficient compilation of Esterel for real-time embedded systems. In Proceedings of the CASES’00. San Jose, CA, 2000. [23] Stephen Edwards. An Esterel compiler for large control-dominated systems. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 21: 169–183, February 2002. [24] Dumitru Potop-Butucaru and Robert de Simone. Optimizations for faster execution of Esterel programs. In Formal Methods and Models for System Design, Rajesh Gupta, Paul Le Guernic, Sandeep Shukla, and Jean-Pierre Talpin, Eds. Kluwer, Dordrecht, 2004. [25] Steven Muchnick. Advanced Compiler Design and Implementation. Morgan Kaufmann Publishers, San Francisco, 1997. [26] Robert French, Monica Lam, Jeremy Levitt, and Kunle Olukotun. A general method for compiling event-driven simulations. In Proceedings of the 32nd Design Automation Conference (DAC’95). San Francisco, CA, 1995. [27] Jaejin Lee, David Padua, and Samuel Midkiff. Basic compiler algorithms for parallel programs. In Proceedings of the 7th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. Atlanta, GA, 1999. [28] Paul Le Guernic, Jean-Pierre Talpin, and Jean-Christophe Le Lann. Polychrony for system design. Journal of Circuits, Systems and Computers. Special Issue on Application-Specific Hardware Design, 12(3): 261–304, 2003. [29] Albert Benveniste, Paul Caspi, Luca Carloni, and Alberto Sangiovanni-Vincentelli. Heterogeneous reactive systems modeling and correct-by-construction deployment. In Embedded Software Conference (EMSOFT’03). Springer-Verlag, Heidelberg, October 2003. [30] Pascalin Amagbegnon, Loïc Besnard, and Paul Le Guernic. Implementation of the data-flow synchronous language Signal. In Conference on Programming Language Design and Implementation (PLDI’95). ACM Press, New York, 1995. [31] Florence Maraninchi and Lionel Morel. Arrays and contracts for the specification and analysis of regular systems. In Proceedings of the International Conference on Applications of Concurrency to System Design (ACSD’04). IEEE Press, 2004. [32] Jean-Louis Colaço and Marc Pouzet. Clocks as first class abstract types. In Proceedings of the EMSOFT’03, 2003.


8-22


[33] Jean-Louis Colaço, Alain Girault, Grégoire Hamon, and Marc Pouzet. Towards a higher-order synchronous data-flow language. In Proceedings of the EMSOFT’04, 2004. [34] David Nowak, Jean-Rene Beauvais, and Jean-Pierre Talpin. Co-inductive axiomatization of a synchronous language. In Proceedings of the International Conference on Theorem Proving in HigherOrder Logics, Vol. 1479 of Lecture Notes in Computer Science. Springer-Verlag, Heidelberg, 1998. [35] Eduardo Giménez. Un Calcul de Constructions Infinies et son Application à la Vérification des Systèmes Communicants. Ph.D. thesis, Laboratoire de l’Informatique du Parallélisme, Ecole Normale Supérieure de Lyon, December 1996. [36] Benjamin Werner. Une Théorie des Constructions Inductives. Ph.D. thesis, Université Paris VII, Mai, 1994. [37] Mickael Kerboeuf, David Nowak, and Jean-Pierre Talpin. Specification and verification of a steamboiler with Signal-Coq. In International Conference on Theorem Proving in Higher-Order Logics, vol. 1869 of Lecture Notes in Computer Science. Springer-Verlag, Heidelberg, 2000. [38] Mickael Kerboeuf, David Nowak, and Jean-Pierre Talpin. Formal proof of a polychronous protocol for loosely time-triggered architectures. In International Conference on Formal Engineering Methods, vol. 2885 of Lecture Notes in Computer Science. Springer-Verlag, Heidelberg, 2003. [39] Albert Benveniste, Paul Caspi, Paul Le Guernic, Hervé Marchand, Jean-Pierre Talpin, and Stavros Tripakis. A protocol for loosely time-triggered architectures. In Embedded Software Conference (EMSOFT’02), Vol. 2491 of Lecture Notes in Computer Science. Springer-Verlag, Heidelberg, October 2002. [40] Amir Pnueli, O. Shtrichman, and M. Siegel. Translation validation: from Signal to C. In Correct System Design Recent Insights and Advance, Vol. 1710 of Lecture Notes in Computer Science. Springer-Verlag, Heidelberg, 2000. [41] Christine Paulin-Mohring. Circuits as streams in Coq: verification of a sequential multiplier. In Types for Proofs and Programs, TYPES’95, Vol. 1158 of Lecture Notes in Computer Science, S. Berardi and M. Coppo, Eds. Springer-Verlag, Heidelberg, 1996. [42] Cécile Dumas Canovas and Paul Caspi. A PVS proof obligation generator for Lustre programs. In International Conference on Logic for Programming and Reasoning, Vol. 1955 of Lecture Notes in Artificial Intelligence. Springer-Verlag, Heidelberg, 2000. [43] Sylvain Boulme and Grégoire Hamon. Certifying synchrony for free. In Logic for Programming, Artificial Intelligence and Reasoning, Vol. 2250 of Lecture Notes in Artificial Intelligence. SpringerVerlag, Heidelberg, 2001. [44] Christophe Lavarenne, Omar Seghrouchni, Yves Sorel, and Michel Sorine. The SynDEx software environment for real-time distributed systems design and implementation. In Proceedings of the ECC’91. France, 1991. [45] Thierry Grandpierre, Christophe Lavarenne, and Yves Sorel. Optimized rapid prototyping for real time embedded heterogeneous multiprocessors. In Proceedings of the 7th International Workshop on Hardware/Software Co-Design (CODES’99). Rome, 1999. [46] Luca Carloni, Ken McMillan, and Alberto Sangiovanni-Vincentelli. The theory of latencyinsensitive design. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 20(9): 1059–1076, 2001. [47] Alberto Sangiovanni-Vincentelli, Luca Carloni, Fernando De Bernardinis, and Marco Sgroi. Benefits and challenges of platform-based design. In Proceedings of the Design Automation Conference (DAC’04), 2004. [48] George Logothetis and Klaus Schneider. Exact high-level WCET analysis of synchronous programs by symbolic state space exploration. In Proceedings of the DATE2003. IEEE Computer Society, Germany, 2003. [49] Etienne Closse, Michel Poize, Jacques Pulou, Joseph Sifakis, Patrick Venier, Daniel Weil, and Sergio Yovine. TAXYS: a tool for the development and verification real-time embedded systems. In Proceedings of the CAV’01, Vol. 2102 of Lecture Notes in Computer Science. Springer-Verlag, Heidelberg, 2001.



8-23

[50] Hermann Kopetz. Real-Time Systems, Design Principles for Distributed Embedded Applications. Kluwer Academic Publishers, Dordrecht, 1997. [51] Paul Caspi, Alain Girault, and Daniel Pilaud. Automatic distribution of reactive systems for asynchronous networks of processors. IEEE Transactions on Software Engineering, 25: 416–427, 1999. [52] Kenneth Yun and Ryan Donohue. Pausible clocking: a first step toward heterogenous systems. In Proceedings of the International Conference on Computer Design (ICCD’96), 1996. [53] Gérard Berry and Ellen Sentovich. Multiclock Esterel. In Proceedings of the CHARME’01, Vol. 2144 of Lecture Notes in Computer Science, 2001. [54] Ivan Blunno, Jordi Cortadella, Alex Kondratyev, Luciano Lavagno, Kelvin Lwin, and Christos Sotiriou. Handshake protocols for de-synchronization. In Proceedings of the International Symposium on Asynchronous Circuits and Systems (ASYNC’04). Crete, Greece, 2004. [55] Albert Benveniste, Benoît Caillaud, and Paul Le Guernic. Compositionality in dataflow synchronous languages: specification and distributed code generation. Information and Computation, 163: 125–171, 2000. [56] Jean-Pierre Talpin, Paul Le Guernic, Sandeep Kumar Shukla, Frédéric Doucet, and Rajesh Gupta. Formal refinement checking in a system-level design methodology. Fundamenta Informaticae. IOS Press, Amsterdam, 2004. [57] Jean-Pierre Talpin and Paul Le Guernic. Algebraic theory for behavioral type inference. Formal Methods and Models for System Design, Chap. VIII, Kluwer Academic Press, Dordrecht, 2004. [58] Dumitru Potop-Butucaru, Benoît Caillaud, and Albert Benveniste. Concurrency in synchronous systems. In Proceedings of the International Conference on Applications of Concurrency to System Design (ACSD’04). IEEE Press, 2004.


9 Introduction to UML and the Modeling of Embedded Systems 9.1 9.2

Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . UML Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

9-1 9-2

Static System Structure • System Behavior • Execution Architecture • Embedded Systems and UML

9.3

Thomas Weigert Motorola

9-7

Domain Statement and Domain Model • Behavior Overview through a Use Case Diagram • Context Description by a Collaboration Diagram • Behavior Modeling with Interactions • Behavioral Modeling with State Machines • Validation • Generalizing Behavior • Hierarchical Decomposition • The Difference between the UML System and the Final System

Øystein Haugen and Birger Møller-Pedersen University of Oslo

Example — Automatic Teller Machine . . . . . . . . . . . . . . . . .

9.4

A UML Profile for the Modeling of Embedded Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-30 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-32

9.1 Introduction Embedded systems have the following characteristics: most often, their software runs on a hardware platform dedicated to a specific task (e.g., a telephony switch). Naturally, the hardware imposes limits on what the software can do. Most embedded systems can be considered reactive — the system is sent a message and it is supposed to give a response, which may involve a change in the state of the controlled hardware. Such software is usually real-time but the performance requirements are more often statistical than absolute. Embedded systems are often required to handle many independent requests at the same time. Typically, this software is parallel and distributed. These characteristics result in embedded systems being expensive to develop. In response to the constraints imposed by the underlying hardware as well as by the concerns of managing parallel and distributed systems, developers of embedded systems embraced modeling techniques early. As a consequence, software and system modeling for embedded system development has matured considerably. For example, telecommunications equipment manufacturers and operators have standardized notations to describe systems as early as the late seventies: the first standardized notation for state diagrams was the ITU (then CCITT) Specification and Description Language (SDL), originally approved in 1976 [1–3]. It recognized

9-1


9-2


that a telecom system would be built of blocks (we might say “objects” today) with a well-defined boundary (today we might say an “interface”) across which explicitly indicated signals (or messages) pass. Proprietary variations of sequence diagrams (these were often referred to as “message flow graphs” or “bounce diagrams”) had long been in use in the telecom industry. The standardization of Message Sequence Charts (MSC) was triggered by a paper by Grabowski and Rudolph [4], leading to the first MSC recommendation in 1992 (the current version is [23]). As specification techniques matured, notations to capture software and system models proliferated [5–8]. With the adoption of the UML specification [9] by the OMG this situation changed. Users and tool vendors alike began to adopt the emerging UML notation as the primary means of visualizing, specifying, constructing, and documenting software artifacts. A major revision [10] has addressed shortcomings and made UML amenable to systems modeling as well as to the representation of embedded systems. This chapter is meant to give an overview of how UML can be used for modeling embedded systems. Supplementary introductions can be found in Reference 11.1 We begin with a terse overview of UML and discuss those features of UML suited to represent the characteristics of embedded systems outlined above. We will illustrate these key features using examples. We will also show how to use UML when describing a simple, easily understood embedded system; we begin by describing the domain statement and domain model of the example system, give an overview of the system behavior through a use case diagram, and use collaboration diagrams to establish the system context. We then show how to model various aspects of the system behavior using interactions and state machines. Interactions describe the system behavior as a set of partially-ordered sequence of events that the system needs to exhibit. State machines describe the system behavior as an automaton that must induce these event sequences. We illustrate a simple process of validating that the state machine descriptions indeed specify the same behavior as the interactions. We then continue the example by exhibiting UML features useful to represent real-life systems: generalization allows to express variation in system behavior abstractly and hierarchical decomposition allows to make large system descriptions more scalable and understandable. Finally, we outline characteristics of embedded systems that are abstracted away by a UML specification but need to be considered when deploying such systems. This chapter concludes by presenting a standardized UML profile (a specification language instantiated from the UML language family) suitable for the modeling of embedded systems.

9.2 UML Overview The UML defines a number of diagrams to describe various aspects of a system: • Class diagrams, composite structure diagrams, component diagrams, and object diagrams specify its static structure. • Activity diagrams, interaction diagrams, state machine diagrams, and use case diagrams provide different mechanisms of describing its behavior. • Deployment diagrams depict its implementation and execution architecture. • Package diagrams define how a system specification is grouped into modular units as the basis for configuration control, storage, access control, and deployment.

9.2.1 Static System Structure The static structure of a system describes the entities that exist in the system, their structural properties, and their relationships to other entities. Entities of a system are classified according to their features. Instances with common features are specified by a classifier. Classifiers may participate in the generalization relationships to other classifiers; features and constraints specified for instances of the general classifier are implicitly specified for instances 1 Part

of the work reported in this chapter has been sponsored by the SARDAS (Securing Availability by Robust Design, Assessment and Specification) project funded by the Norwegian research council.


Introduction to UML and the Modeling of Embedded Systems

9-3

of a more specific classifier. The structural features of a classifier describe relationships that an instance of this classifier has to other instances: for example, an association declares that there may be runtime links between instances of the associated classifiers (when a structural feature at an end of an association is owned by one of the associated classifiers, this structural feature is referred to as attribute). The dynamic nature of those links can be characterized as reference or composition. Taken jointly, the structural features of a classifier define the state of the specified instances. Structured classifiers specify the behavior of an instance as the result of the cooperation of a set of instances (its parts) which interact with each other and the environment of their container instance through well-defined communication channels (connectors). Ports serve to isolate a classifier from its environment by providing a well-defined point of interaction between the classifier and its environment (or the classifier and its internal parts). A port specifies the services an instance provides (offers) to its environment as well as the services that an instance expects (requires) of its environment. The runtime nature of an instance is described by the particular kind of classifier that specifies an instance: a class specifies an instance that exists at runtime (an object) and has similar structure, behavior, and relationships. An association class specifies a relationship between instances of connected classifiers which has features of its own. An interface specifies a set of public features (a contract) that an instance of a classifier that implements the interface must possess (fulfill). A collaboration describes only the relevant aspects of the cooperation of a set of instances by abstracting those properties into roles that those instances may play (a role specifies the required set of features a participating instance must have). A component is a structured class satisfying constraints such that its instances can be deployed on popular component technologies. Values may be specified through a computation that yields this value (either composed from other values as expression or textually as opaque expression), as literals of various types (such as Boolean, Integer, String, unlimited Naturals), literals of enumerations, or as reference to a specific instance. Instance specifications may describe an instance partially, by listing its salient characteristics only. Entities in a modeled system are described by instance specifications which show the values that correspond to each of the structural features of the classifier that specifies the entity. Dependencies describe relationships between model elements such that the semantics of a set of client elements is dependent on one or more supplier elements. Examples of dependencies are abstraction, realization, substitution, implementation, usage, and permission relationships. For example, one model element implements another if an instance of the former has all the properties specified by the latter; a model element uses another if the former requires the latter to function.

9.2.2 System Behavior System behavior is the direct consequence of the actions of objects. Models of behavior describe how the states of these objects, as reflected by their structural features, change over time. The execution of a behavior is caused by events, such as being directly invoked by an action or triggered indirectly. An executing behavior is performed by an object, while emergent behavior results from the interaction of one or more participant objects. A variety of specification mechanisms of behaviors are supported by the UML, such as automata, Petri-net like graphs, informal descriptions, or partially-ordered sequences of events. The styles of behavioral specification differ in their expressive power, albeit the choice of specification style is often one of convenience and purpose; typically, the same kind of behavior could be described by any of the different mechanisms. They may specify behaviors either explicitly, by describing the observable events resulting from the execution of the behavior, or implicitly, by describing a machine that would induce these events. When a behavior is invoked, argument values corresponding to its formal parameters are made available to the execution. A behavior executes within a context object and independently of any other behavior executions. When a behavior completes its execution, a value or set of values is returned corresponding to its result parameters. Behavior specifications may either describe the overall behavior of an instance which is executed when the instance is created, or they may describe the behaviors executed when a behavioral feature is invoked.


9-4


In addition to structural features, classifiers may have behavioral features which specify that an instance of this classifier will respond to a designated request (such as an operation call or a sent signal) by invoking a behavior. Actions are the fundamental unit of behavior specification. An action takes a set of inputs and converts them into a set of outputs, though either or both sets may be empty. In addition, some actions modify the state of the system in which the action executes. Actions may perform operation calls and signal sends, receive events and reply to invocations, access (read and write) structural features of objects and temporary variables, create and destroy objects, and perform computations. A state machine performs actions based on the current state of the machine and an event that triggers a transition to another state, where a state represents a situation during which some implicit, invariant condition holds. Each transition specifies a trigger upon which the transition will fire, provided that its guard conditions hold and the event is not deferred in this state. Possible triggers are the reception of a signal or the invocation of an operation, the change of a Boolean value to true, or the expiration of a predetermined deadline. Any action that may be associated with that transition will be performed; the next event is only processed when all such actions have been completely processed (run-to-completion semantics). In addition, actions may be performed upon entry to a state, while in a state, or upon exit from a state. Transitions may be broken into segments connecting so-called pseudo states: history, join, fork, junction, and choice vertices, as well as entry and exit points. State machines support two structuring mechanisms: submachine states allow factoring and reuse of common aspects of a state machine. Composite states partition the set of states into disjunct regions. An interaction describes system behavior as a set of partially-ordered sequences of events. The parts of the underlying classifier instance are the participants in an interaction; they are represented by lifelines. Events are depicted along the lifelines, ordered from top to bottom. Typical events are the sending and receiving of signals or operation calls, the invocation, and termination of behaviors, as well as the creation or destruction of instances. Events on the same lifeline retain their order, while events on different lifelines may occur in any order (they are interleaved). However, coregions (and parallel merge) make the order of events on a lifeline irrelevant, and an order may be imposed on events on different lifelines through constraints and a general ordering mechanism. Communication between instances is shown by messages which may specify both the content of the communication as well as the nature of the communication (synchronous versus asynchronous, signal versus call, etc.). The compactness and conciseness of interactions is enhanced by providing combinators for fragments of event sequences: alternatives and options describe a choice in behavior; parallel merge, weak sequencing, and strict sequencing define the ordering of events, while critical regions prevent interleaving of events in a fragment altogether; loops indicate that a fragment is repeated a number of times. Sequence fragments may be prefixed by guards which prevent the execution of the fragment when false. Interaction references allow factoring and reuse of common interaction fragments. In addition, the high-level structure of interactions can be depicted in a flow graph like structure, showing how interaction fragments may follow each other, yet make choices, loops, or parallelism explicit in the flow graph. Interactions may also be used to assert behavioral properties of a system. Activities emphasize the sequence of actions, where these actions are initiated because other actions finish executing, because objects and data become available, or because triggering events occur. Activities are actions connected by flows: a data flow routes values between actions and between parameters and actions; a control flow enables actions without the transmission of data. Activities support simple unstructured sequences of actions involving branching and joining; concurrent control and data flow; structured control flows through loops and conditionals; and exception handling. The semantics of activities is gleaned from Petri-nets: activities are interpreted as graphs of nodes (actions, flow-of-control operators, etc.) and edges (data and control flows). A node begins execution when specified conditions on its input tokens are satisfied; when a node begins execution, tokens are taken from the input edges; upon completion, tokens are offered on its output edges. Execution of the actions comprising an activity are solely constraint by flow relationships. If two actions are not ordered through flow relationships, they may execute concurrently, unless otherwise constrained. In addition, there are various loosely defined mechanisms, such as streaming outputs or token weights to impose or vary the manner according to which tokens flow through



9-5

nodes and along edges (with the intent to support the modeling of processes and work flows). Activities may be hierarchically structured in that actions may invoke behaviors, which may in turn be activities. Further, elements of an activity may be grouped into partitions. These do not affect the token flow, but may, for example, be used to allocate each partition to a separate part of a structured classifier. A use case is the specification of a set of actions performed by the system. Each use case represents a behavior that the system can perform in collaboration with one or more actors, without reference to its internal structure. Use cases are typically leveraged for the description of system requirements, for the specification of requirements that the system imposes on its environment, or for the description of functionality offered by the system.

9.2.3 Execution Architecture The execution architecture (the configuration of runtime processing resources and the processes and objects that execute on them) of a system is specified in terms of a topology of nodes connected through communication paths, representing hardware devices or software execution environments. Deployment is the allocation of a software artifact (physical elements that are the result of a development process) to nodes.

9.2.4 Embedded Systems and UML In Section 9.1, we discussed that embedded systems often run on dedicated hardware, are reactive, respond in real-time with soft time constraints, and are parallel and distributed. Various aspects of UML cater well to modeling these system characteristics. The UML describes system behavior independent of the constraints of a particular platform; a UML system can be viewed as being executed by a virtual UML machine. While this UML machine is not subjected to the limitations of the actual hardware (see Section 9.3.9), it allows to describe system behavior independent of these limitations, or for that matter, special capabilities that the hardware may offer. It will, therefore, be possible to choose the allocation of various aspects of the system functionality to different underlying processing elements based on system needs and architectural considerations. Functionality can be moved from one underlying system component to another, should this be required by the performance requirements of the system. The allocation of functional elements to the underlying hardware can be shown in deployment diagrams. Any ancillary information required to deploy a functional element on a processing element can be expressed as part of the deployment description; the description of the functionality remains clean of these implementation details. A system is reactive when it responds to stimuli received from its environment during the life time of the system. The UML communication model is particularly well-suited for describing system behavior as result of the interaction with the environment. The objects described by a UML specification may have behavioral features, referred to as operations and receptions, that are triggered by environmental stimuli (the receipt of operation calls or signals, respectively). State machines describe how the system responds to such triggers by executing actions in a state-dependent manner. Interactions describe system behavior as a set of (partially ordered) event sequences which typically exhibit how the system responds to stimuli received from its environment. Systems that are required to respond to environmental stimuli in real-time need to be able to process multiple requests concurrently. The environment may present the system with multiple, often concurrent, requests. Handling one request may potentially take a long time but must not delay responding to a different stimulus. The UML provides the description of the system as comprised of multiple, concurrently executing internal components (its parts). The overall system behavior is the result of the cooperation of these parts with each other and the environment. Note that the decomposition into independently executing architectural components is necessary even when assuming infinitely fast processing speeds, as the handling of a given stimulus may require an infinite amount of time, yet must not delay the handling of other requests. All behavioral descriptions afforded by UML are sensitive to the internal architecture


9-6


of a system as described by its composite structure. In addition, time constraints can be expressed for all behavioral elements. A UML system specification can state hard real-time constraints, but it cannot guarantee that the system will meet these constraints. As described above, the system behavior is represented as the result of the cooperation of the internal parts of the system. The UML assumes that the parts of the system execute concurrently and it makes no assumptions as to the allocation on processing elements. Consequentially, UML describes a system as fully parallel and distributed. However, the system functionality can be allocated to underlying processing elements using deployment diagrams, thus constraining the actual system implementation. Of course, these features may also be useful when modeling other systems, but they are essential in representing embedded systems. The UML is a very rich specification language, as it aims to be applicable to any software or system application domain. Consequentially, not each of its constructs is equally amenable to deployment in every possible application domain. Practitioners have proposed methodologies that focus on deploying UML in the embedded system domain. For example, the ITU service description methodology (see Figure 9.1), as adapted to UML [12], identifies the following development stages: Stage 1 gives the description of the services expected of the system to be developed, from the user point of view; Stage 2 describes the user–system interface and the interfaces between different service access points within the system; and Stage 3 gives the description of the detailed system components, as well as of the protocols and message formats. This methodology leverages the following UML diagrams: • The Stage 1 descriptions treat the system as a single entity that provides services to the users. Use cases or use case maps [13] are used to describe the services provided by the system in terms of the

Stage 1 (Service aspects) Step 1.2

Step 1.1

Static description of service

Stage 2 (Functional network aspects) Step 2.2

Information flow diagrams

Service definition and description Step 1.3

Dynamic description of service

Step 2.1

Derivation of a functional model

Step 2.3

Dynamic description of functional entity

Step 2.5

Allocation of functional entities

Step 2.4

Stage 3 (Network implementation aspects) Step 3.1

FIGURE 9.1

Protocols and formats

ITU method for service description.


Step 3.2

Switching and service nodes

Functional entity actions


9-7

perception of the users as well as the users involved in a service. The dynamic information that is sent and received by the user is described by simple state machines. • Stage 2 identifies the functional capabilities and the information flows needed to support the service as described in Stage 1. Composite structure diagrams describe the architecture of the functional entities comprising the system. Interaction diagrams specify the information flows between functional entities for both successful operation and error conditions. State machines are used to give the dynamic description of each functional entity. The actions performed by each functional entity are characterized in terms of pseudo-code or a formalized action language. In the final step at this stage, functional entities are allocated to underlying system components in composite structure diagrams. • At Stage 3, for each system component, the functional requirements of these elements are defined using class diagrams, composite structure diagrams (for hierarchical decomposition), and state machines (for specification of behavior). In addition, the relationship supported between two functional entities located in different system components must be realized by protocol(s) supported between those system components. These protocols are typically described in the protocol definition language ASN.1 [14]. In the following, we shall leverage some of the key features of UML identified above, composite structure diagrams, sequence diagrams, and state machine diagrams, to illustrate, by example, the modeling of embedded systems using UML.

9.3 Example — Automatic Teller Machine This chapter will use one example — an automated teller machine (ATM) — to illustrate how UML can be used to model an embedded system in a compact, but readable way. While an automated teller machine is not the most exciting example of an embedded system, its functionality and requirements are well understood.

9.3.1 Domain Statement and Domain Model An ATM is a system with mechanical as well as electronic parts. Its purpose is to provide a bank user with cash provided that the user can be authenticated and has adequate funds in the bank account. • In order to be authenticated, the user presents a card to the ATM card reader, and provides a Personal Identification Number (PIN) through the ATM keyboard (which may be a physical keypad or a touch screen pad, or similar). • The ATM is connected electronically, possibly through a network, to the bank such that the account status may be checked online. • The ATM is refilled with cash notes regularly or when the number of specific notes falls below a predetermined limit. • The ATM may provide foreign currency to the customer. We begin by constructing a simple class model describing the domain of the ATM system, see Figure 9.2 and Figure 9.3. The domain model does not yet describe the classes that will be developed as part of the system; rather, it captures the essence of the system as described by the domain statement. We see a number of classes that we expect to appear in the final system as well as high-level relationships between these classes, based on the domain statement: an ATM is associated with users and a bank, and it consists of keyboard, card reader, screen and cash dispenser. Users will have a number of cards that gives them access to accounts maintained by the bank. We may use multiplicities and role names on the relationship to document important characteristics of the domain. Note that according to the domain model, each card gives access to exactly one account.


9-8


User

*

*

ATM

1

*

1

Bank 1

*

* 1

Card

1

Account

myAccount

FIGURE 9.2

Domain class model-I.

ATM

CardReader

FIGURE 9.3

Keyboard

Screen

CashDispenser

Domain model-II.

ATM

CashRepository

Withdrawal User

«include» Authentication «include»

Bank

Currency

FIGURE 9.4

Use cases of the ATM.

9.3.2 Behavior Overview through a Use Case Diagram We rely on use cases in a very restricted way, simply to give an overview of the different services that external entities (users and other systems) expect from the ATM system. As we represent the system by means of a class (here ATM), we define use cases as part of a class, see Figure 9.4. This provides a direct and simple way of specifying what the subject for the use cases is, namely, the ATM. An alternative would be to define use cases as part of packages, combined with some informal indication of what the subject for the use cases is.



9-9

Bank Context

User-reader ATM-bank :User

:Bank

:ATM User-screen User-keyboard User-cash

FIGURE 9.5

The context of an ATM.

9.3.3 Context Description by a Collaboration Diagram We use collaboration diagrams for mainly two purposes: • To define the context for interactions between the system and entities in its environment. • To tell which other entities the system is dependent upon. In this case only the ATM part will be developed, but it depends upon agreed interfaces to users and banks. While the interaction with the user is elaborated as part of the development of the ATM, the interaction with the bank may be given in advance. Figure 9.5 shows the bank context of the ATM class. The context in which the ATM operates is comprised of two other parts: User and Bank. The parts are linked by connectors, which specify communication paths between the objects playing these parts. Note the difference between this model and the domain model: here we only describe the parts that interact, while classes such as Card and Account are not involved. In addition we do not use the fact that ATMs consist of parts such as, for example, CardReader; instead, we use the port User-Reader to indicate the connection to this internal part of the ATM, thus isolating the internal structure of the ATM from the environment. This collaboration is used to specify a view of the cooperating entities only. It specifies the required features of the parts as well as required communications between them. Any object that plays these parts must at least have the properties specified by the classifiers that type the parts of the collaboration. The system specified by the collaboration may have additional parts not shown, and the objects playing the parts may have additional properties. In addition, we can specify constraints on the number of objects that may play each part. In Figure 9.6 it is specified that there may be up to 10,000 users, up to 100 ATMs, and just one bank. The collaboration in Figure 9.6 specifies that the context in which the ATM works is comprised of three sets of objects. Each part of a collaboration represents a set of objects, in this case a set of User objects, a set of ATM objects, and a set of Bank objects. The specified system will be made up of objects corresponding to each of these parts, as specified by the multiplicities of the parts, but there will not be any object corresponding to the BankContext as such.

9.3.4 Behavior Modeling with Interactions In this section, we will describe the services of the ATM based on the domain statement and the concepts and structure described above. We shall use sequence diagrams to describe the services of the system. A formalization of the approach to sequence diagram modeling shown here has been presented in Reference 15.


9-10


BankContext

User-reader

:User [1...10,000]

ATM-bank User-screen

:Bank

:ATM [1...100]

User-keyboard User-cash

FIGURE 9.6

Bank context with multiplicities on parts.

sd EnterPIN

:User

:ATM

:Bank

msg(“Enter PIN”) Digit Digit Digit Digit Code(cid, pin) OK PIN OK

FIGURE 9.7

EnterPIN — a very simple sequence diagram.

9.3.4.1 A Simple Sequence Diagram for a Simple Situation We first present a very simple sequence diagram describing how to enter a PIN on the ATM. We are later going to apply this simple behavior in more elaborated scenarios. EnterPIN is a sequence diagram that requires no structuring mechanisms. In Figure 9.7, we see that the communicating objects refer to the parts of the collaboration depicted in Figure 9.5 and Figure 9.6. The objects in an interaction are called lifelines as their most significant asset is the vertical line on which the different messages are attached. Each lifeline represents exactly one object and not a set of objects such as in the collaboration of Figure 9.6. The single object may be identified by a selector expression, but typically the interactions refer to an arbitrary object of a set since the objects of the set are indistinguishable for most purposes. Most people, regardless of their competence in UML and computer science, will intuitively interpret EnterPIN correctly. First, the ATM will give a message to the user that a PIN needs to be provided.



9-11

We rely on a message type msg with a plain string literal as actual parameter. This is a very simplistic approach; in practice, such systems will use strings from a language specific repository to accommodate for different locales (the language may actually be selectable by the user during the interaction). The user knows that this number consists of four digits and presses those digits. The ATM will combine those four digits into the pin code and transmit that code as well as the earlier collected card identification cid to the bank for authentication. In this scenario, the bank returns a message ok to indicate that the pin and the card identification were accepted. At the bottom of the sequence diagram we have placed a continuation with the label PIN OK to indicate that further continuations could start with the assumption that card and pin were accepted. We shall look at continuations shortly. The scenario in Figure 9.7 is obviously not the complete story, even regarding the simple matter of entering the PIN. We have given only one of many possible scenarios that a real system would have to take into account. We have only considered a case where the card and pin are actually accepted. But cards and pins are not always accepted, which is the whole purpose of authentication. To define a situation where the card and/or the pin are rejected, we could construct another scenario that would start out identically to that of Figure 9.7, but would end up with a rejection of the user. Then, we could add yet another scenario where the user keyed in only three digits, or five, or the user entered a letter instead of a digit, and so on. We quickly would arrive at a large number of scenarios for the same small detail. The need for organizing these scenarios and categorizing them is apparent. (For brevity, in this section we limit ourselves to describing that card and pin may be rejected as well as accepted.) In Figure 9.8, we have introduced a combined fragment with the operator alt (alternative) to indicate that there are alternative scenarios. The upper operand ends in a continuation PIN NOK and the lower is identical to that of Figure 9.7. Since the first messages were identical, it is more compact and more transparent to describe the difference only where it appears. Messages are represented by the arrows attached to either side of a lifeline. In Figure 9.8, all the arrows have an open stick arrow head which means that these messages are asynchronous and have an associated signal that is transferred from the sender to the receiver. Real-time embedded systems typically rely on asynchronous messages; for simplicity we shall only use this kind of messages in this chapter. For other

sd EnterPIN

:User

:ATM

:Bank

msg(“Enter PIN”) Digit Digit Digit Digit Code(cid, pin) alt

NOK PIN NOK OK PIN OK

FIGURE 9.8 EnterPIN — with alternatives.


9-12


purposes messages may also describe synchronous or asynchronous (remote) calls requiring a return message. A call is indicated by a filled arrow head. The message is attached to either side of a lifeline. Events are represented by the point where a message meets a lifeline. In other words, a message is the pair of a sending event and a receiving event. There are two important invariants about (simple) sequence diagrams: • Message invariant : the sending event of a message must precede the receiving event of the same message. • Lifeline invariant : events are ordered from top to bottom on a lifeline (with exceptions for events on different combined fragment operands). The formal meaning of a sequence diagram is the set of event traces that it describes in the presence of the two invariants. Even simple diagrams often represent more traces than first meets the eye especially because the different lifelines are independent. For example, the diagram shown in Figure 9.8 describes several traces owing to that the user keys in numbers independently of the ATM consuming them. Thus, the sending events and receiving events may be arbitrarily interleaved, resulting in many different traces. Sometimes different traces represent significant behavioral differences, while in other cases, such as in our example diagram, there is little significance to the relative order of sending and receiving digits as long as the digits are consumed in the order they are sent, which is precisely described by the diagram. 9.3.4.2 Factoring out General/Common Interactions EnterPIN is a simple scenario that is not even mentioned in the use cases of Figure 9.4. We could have introduced it as being used by the Authentication use case. We decided that the card holder has three tries to enter the PIN code correctly. This means that the interaction representing the entering of the PIN code will occur several times in the full interaction for authentication, as shown in Figure 9.9. The scenario described by Authenticate it is also quite intuitive. First, the user will insert a card and thus transfer the card identification cid to the ATM. Then the EnterPIN scenario will take place. The final loop shows a situation that only occurs if the card and pin are not accepted. This is shown by the loop starting with the continuation PIN NOK. Within the loop the EnterPIN scenario will again occur. The loop may execute from 0 to 2 times, as shown by the parameters in the operator tab.

sd Authenticate

:User

:ATM Idle Cardid(cid)

ref

EnterPIN

loop(0,2) PIN NOK msg("Try again!") ref

FIGURE 9.9

Interaction for authentication.


EnterPIN

:Bank


9-13

The reference to EnterPIN is given by an interaction occurrence as indicated by the symbol ref in a corner tab. Interaction occurrences allow to structure sequence diagrams, but must not be confused with operation calls. These are sub-scenarios; they are not invoked by an executing entity. 9.3.4.3 Describing the Full Withdrawal Service Now, we are ready to describe the full Withdrawal service using the Authenticate scenario as a sub-scenario, as shown in Figure 9.10. There are no new language features in this scenario apart from the obvious fact that combined fragments (in this case, alternatives) may be nested. We have assumed that the ATM will keep the user’s card during the execution of the withdrawal service. The card is returned at the end of the sequence. Again, we show not all possible scenarios that may occur when withdrawing money, but only some of the scenarios that are important and must be considered by the implementers. When the sequence diagrams become complex, consisting of several lifelines and nested combined fragments, a designer may choose to utilize an interaction overview diagram. Such diagrams also describe interactions, but the detailed messages and lifelines are abstracted away while the control structure is shown not as nested rectangles, but as a branching graph.

sd Withdrawal

:User

:ATM

ref

:Bank

Authenticate

alt

PIN OK msg(“Select service”) Withdrawal msg(“Enter amount!”) amount(v)

alt

checkaccount(v)

money(v)

ok

receipt(v)

nok msg(“Amount too large”)

PIN NOK msg(“Illegal entry”)

card card taken

FIGURE 9.10 Withdrawal of native money.


9-14


sd Withdrawal

ref

Authenticate PIN NOK

PIN OK

sd

ref

SpecifyAmount

:User

:ATM

msg(“Illegal entry”)

sd

sd

:User

:ATM money(v)

:Bank ok

:User

:ATM

:Bank

msg(“Amount too large”) nok

receipt(v)

sd :User

:ATM card

card taken

FIGURE 9.11 Withdrawal as interaction overview diagram.

Figure 9.11 shows Withdrawal as an interaction overview diagram. Nodes in the graphs are either references to other interaction diagrams or inlined sequence diagrams. The graph notation is inherited from activity diagrams, but this diagram defines an interaction. The interaction overview diagram provides a high-level overview when the nodes are interaction occurrences, but on the other hand isolating every small piece of an interaction in a separate diagram leads to a loss of understanding, as there may be too many concepts to keep track of. The designer needs to find the appropriate balance of when to apply interaction overview diagrams instead of sequence diagrams. In Figure 9.11, we have chosen to isolate the interaction of specifying the withdrawal amount in a separate sequence diagram (not shown here), and kept the other trivial sequences as inline diagrams.

9.3.5 Behavioral Modeling with State Machines Interactions describe behavior as a sequence of events and may focus on a subset of possible behaviors only. Alternatively, behavior can be described by a state machine that induces those sequences of events.



9-15

sm EnterPIN send(msg(“Give PIN”)); n=1; PIN=0

enterDigit

[n EnterPIN n:integer PIN:integer

FIGURE 9.15 Attributes of EnterPIN.

The attribute cid used when sending of the signal Code(cid,PIN) is the one defined in the ATM state machine. 9.3.5.3 Withdrawal Withdrawal of funds requires that the user provides an amount and that the transaction is verified, see Figure 9.16. The attribute sa represents the selected amount that is to be verified against the account. This attribute is set by the GetAmount submachine. If the amount is too large, a message is given to the user who may enter another amount. The operation sendMoney represents the actions needed in order to deliver the requested amount of funds (Figure 9.17). 9.3.5.4 Flexibility through Specialization To describe the state machine GetAmount, we start out with a simple version that merely lets the user select among predefined amounts (such as 100, 200, 300, 400, etc.). At any time during this selection, the user may cancel the selection, see Figure 9.18. We omit the details of selecting an amount, which is covered by the state machine SelectAmount. A more flexible ATM would give the user the opportunity to enter the desired amount, in addition to selecting among predefined amounts. We define FlexibleATM as a subclass of ATM and extend GetAmount in the subclass, as shown in Figure 9.19. By extending the state GetAmount, the GetAmount submachine of FlexibleATM maintains some of the behavior of the more general GetAmount submachine, see Figure 9.20. The extended state machine GetAmount adds the state EnterAmount and also adds transitions to and from this new state. In addition, it redefines the transition from the entry point again so that the target state is the added state EnterAmount. The inherited state SelectAmount is drawn with a dashed line. Although inherited, this state is shown in the extended state diagram in order to add the transition with the trigger otherAmount, leading to the new state EnterAmount. In general, states and transitions that are not shown are inherited from the general state machine. Inherited states and transitions may be drawn using dashed lines in order to establish the context for extension and redefinition. The resultant state machine defined implicitly for the extended GetAmount state machine is depicted in Figure 9.21. This example illustrates the difference to state machines in earlier versions of UML. Without the structuring mechanisms recently introduced to UML, this state machine would have to be specified as shown in Figure 9.22.


9-18


sm Withdrawal

Cancelled :GetAmount

Cancelled

Again send(CheckAccount(sa))

nok/ send(msg(“Amount too large”))

VerifyTransaction

ok/ sendMoney(sa); send(Receit(sa));

ok

FIGURE 9.16 The withdrawal sub-state machine.

> Withdrawal sa: Amount sendMoney(a:Amount)

FIGURE 9.17 Properties of withdrawal.

sm GetAmount Send(msg(“select amount”))

Cancel Cancelled : Select Amount

Send(msg(“select another amount”)) amount(sa);

Again

FIGURE 9.18 Simple version of GetAmount.



9-19

> ATM > GetAmount

> FlexibleATM > GetAmount {extended}

FIGURE 9.19

Flexible ATM as a specialization of ATM.

sm GetAmount {extended}

:SelectAmount otherAmount/ send(msg(“enter amount”))

:enterAmount

Cancelled

Cancel

ok

Send(msg(“select another amount”))

Again

FIGURE 9.20

GetAmount extended.

This example also illustrates that the use of exit points gives us the ability to defining actions on transitions where they really belong. Consider the transitions triggered by cancel (with state Cancelled as target). In Figure 9.22, this transition is a single transition crossing the boundary of the state GetAmount having a single action. However, in Figure 9.21, where GetAmount is defined as a submachine state with entry and exit points, these transitions will be composed of two (partial) actions: one inside GetAmount (and therefore with access to whatever attributes it may have) and one outside, in the scope of Withdrawal (and therefore with access to attributes of Withdrawal only). The inner actions is performed before the outer action.


9-20


sm GetAmount Send(msg(“select amount”))

Cancel :SelectAmount

otherAmount/ send(msg(“enter amount”)) amount(sa); Cancelled :EnterAmount Cancel

ok

Send(msg(“select another amount”)) Again

FIGURE 9.21 Resulting GetAmount.

sm Withdrawal GetAmount Send(msg(“select amount”))

Cancel SelectAmount otherAmount/ Cancel send(msg(“enter amount”)) amount(sa);

ok

EnterAmount

send(CheckAccount(sa))

nok/ send(msg(“Amount too large”)) ok/ sendMoney(sa); send(Receit(sa));

VerifyTransaction

FIGURE 9.22 Withdrawal with a composite state GetAmount.


Cancelled


9-21

9.3.6 Validation We have now presented (part of) the requirements on the ATM in the form of sequence diagrams and (part of) its design in the form of a structured state machine. The interesting question now is, of course, whether the design satisfies the requirements. We take the specification of the withdrawal service from Figure 9.10 and walk through the described behavior while checking the state machine design of the ATM, applying the following principles: • Establish an initial alignment between the interaction and the state machine. What is the state of the state machine? • Assume that the messages to the state machine are determined by the interaction. • Check that the actions and especially the output messages owing to the transition of the state machine correspond with the event occurrences on the lifeline. • Perform this test for all traces of the interaction. This procedure of consistency checking can be automated provided that the model is sufficiently precise. In our example, informal text is used at several occasions to simplify the illustration, and this would obstruct automatic checking. It is, however, possible to define the interactions and the state machines such that automatic checking is feasible. To align Withdrawal, we assume that the beginning of the Withdrawal sequence correspond to the Idle state in Figure 9.13. Withdrawal starts with the reference to Authenticate (Figure 9.9). Authenticate begins with the continuation Idle and even though it is not a state invariant it does correspond well with our chosen alignment to the state Idle in the state machine. Then the ATM receives the Cardid message and we look into the state machine to see whether this will trigger a transition to EnterPIN. Hopefully the submachine (see Figure 9.12) corresponds directly to the sub-scenario EnterPIN (see Figure 9.8). The transition triggered by the Cardid message is continued inside the EnterPIN submachine and transmits msg("Give PIN") before it enters the state EnterDigit. We check the sequence diagram and find that this corresponds well. Then the sequence diagram continues to describe that the ATM receives four digits. In the state machine the pin is built up and the loop will terminate before the fourth digit. At the fourth digit another transition will be triggered, a message will be sent to the bank, and the waitOK state will be entered. Depending on the answer from the bank, exit from the state will be either through the ok exit point or the nok exit point. Let us first assume that the sequence diagram entered the nok branch. In this case, the sequence diagram will end in the PIN NOK continuation and it will continue into the loop that starts with the same continuation. There we can see that the ATM transmits the msg("Try again!") message to the user. In the state machine, we have now left the EnterPIN submachine through the nok exit point and returned to the top level ATM state machine. Since the loop has just been entered, the transition will send the appropriate message to the user and return to the EnterPIN state. This corresponds well with the situation within the sequence diagram where we find a reference within the loop to EnterPIN. If the next time around we follow the ok branch, the state machine will transit through the ok exit point and enter the Service state. The sequence diagram has finished the Authenticate scenario and continues in the Withdrawal diagram within the alternative starting with the PIN OK continuation. Prompted by a “Select service!” message, the ATM will now receive a Withdrawal message, which in the state machine Service triggers a transition to the Withdrawal submachine. Within Withdrawal the transition continues into GetAmount where the user is prompted with msg("Select Amount") and then the state SelectAmount is entered. Without going into further detail, we conclude that the user is given the chance to provide an amount and this is input to the ATM. The state machine leaves GetAmount on this trigger and continues to send CheckAccount on the transition to VerifyTransaction within Withdrawal. An ok will cause the state machine to exit through the exit point ok after having transmitted money and a receipt. Finally, the card is ejected and the user takes


9-22


the card to return the state machine to the Idle state. This matches the sequence diagram, and we can conclude that at least for these few traces the ATM state machine correspond to the requirements. There are more traces described in the sequence diagrams; an examination of these traces is left to the reader. On the other hand, there are also more traces induced by the state machine execution than captured in the sequence diagrams. Sequence diagrams, as is the case for most property descriptions, are only partial and concentrate on the most important aspects of the system behavior.

9.3.7 Generalizing Behavior To show how UML diagrams can be used to define generalization of behavior, we start by describing an ATM service where foreign currency can be obtained. 9.3.7.1 Generalizing with Interactions A scenario describing the withdrawal of foreign currency is shown in Figure 9.23; the similarities to the withdrawal of local currency as shown in Figure 9.10 is apparent. The challenge is to highlight the differences and to only describe the commonalities once. In Figure 9.23, we have also shown guards (interaction constraints) on the operands of the inner alternative combined fragment. A guard refers to data values accessible to the instance represented by the lifeline at the first event of the operand. We have stated the guards in informal text since UML has not defined a concrete data language and the evaluation of the guard expressions is not a focus of this chapter. The retrieval of local and foreign currencies can be seen as two specializations of an abstract withdrawal service. The hierarchy of withdrawal services expressed through interactions as classifiers is shown in Figure 9.24. The utility interactions getAmount and giveMoney are defined local to GenWithdrawal so that they can be redefined in specializations. The sequence diagram for GenWithdrawal is shown in Figure 9.25. The interactions for getting the amount and delivery of money have been separated out and replaced by references to local sub-scenarios. To achieve the same behavioral description as that of Figure 9.10 we need to redefine getAmount and giveMoney as shown in Figure 9.26. This figure depicts message gates, a feature that is new in UML sequence diagrams. In the general withdrawal service (Figure 9.25), the ok message enters the giveMoney interaction occurrence; that connection point is called an actual gate. The corresponding formal gate can be found in the definition of giveMoney in Figure 9.26, with the meaning that the message is routed as shown. No event occurs when a message passes through a gate. Gates may be named explicitly, but implicit naming is often sufficient, such as when the identity of the gate is determined by the message name and the direction of the message. Gates are visible from the immediate outside of the interaction occurrence and from the inside of the definition diagram. In order to describe the behavior of currency purchase as first shown in Figure 9.23, we redefine getAmount and giveMoney as shown in Figure 9.27. 9.3.7.2 Generalizing with State Machines In order to express the interaction generalization given earlier, we similarly generalize the ATM state machine to express only the general behavior, that is, we only include behavior that is specific to obtaining the PIN code and to authorization, see Figure 9.28. We assume this state machine to be the state machine of the ATM class, and introduce two subclasses of ATM as shown in Figure 9.29. The state machines for the two subclasses of ATM, as shown in Figure 9.30, add the states Withdrawal and Currency, respectively, together with their transitions, and extend the Service state by the additional exit points Withdrawal and Currency, respectively.



9-23

sd Currency

:User

:ATM

ref

:Bank

Authenticate

alt

PIN OK msg(“Select service!”) Currency msg(“Enter currency!”) CurrencySelected(curtyp) msg(“Enter amount!”) checkaccount(v(e))

amount(e) alt

[enough on account] money(curtyp,e)

ok

receipt(v)

msg(“Amount too large”)

[inadequate funds] nok


card card taken

FIGURE 9.23 Withdrawing foreign currency.

We will not delve into the details of the states Withdrawal and Currency, but it can readily be seen that the submachines SelectAmount and EnterAmount may in fact be the same for both as they merely model the behavior of obtaining the amount independently of the chosen currency. While the ordinary withdrawal service assumes this number is the amount in local currency, the CurrencyATM will convert this number from the foreign currency to local currency. Verification against the account is performed in local currency.

9.3.8 Hierarchical Decomposition Until now we have considered the ATM as one atomic entity even though we earlier hinted in the domain model that it was internally structured (Figure 9.3). We shall further explain the behavior of the ATM as the result of the interaction between the parts of the composite structure of the ATM.


9-24


GenWithdrawal sd getAmount sd giveMoney

Withdrawal redefined getAmount redefined giveMoney

FIGURE 9.24

Currency redefined getAmount redefined giveMoney

Generalized withdrawal inheritance hierarchy.

9.3.8.1 Decomposition of Classes While the bank context was specified as a collaboration with a composite structure in terms of User, ATM, and Bank (Figure 9.6), the ATM itself is further specified as a class with a composite structure representing the fact that an ATM consists of a screen, a card reader, a keyboard, and a cash dispenser. The domain model had simply specified that an ATM consists of these parts; we now need to impose additional architectural constraints. For example, the user does not interact directly with the card reader. Instead, the user interacts with the card reader as a part of the ATM. This is expressed by defining the ATM as a composite class with ports for each kind of interaction, and with parts that internally are connected to these ports, as shown in Figure 9.31. The oval shapes indicate behavior ports, which are ports through which the behavior of the ATM class, in this case, the ATM state machine, communicates. Behavior directed at a behavior port is directed at the behavior of the containing class. Note that the types of the parts are not owned by the ATM class, they are simply used. We have not yet examined how the collection of classes of the ATM is organized, but typically one would define a package of classes related to ATM, see Figure 9.32. 9.3.8.2 Decomposition of Lifelines We indicate that the lifeline is detailed in another sequence diagram by a reference in the lifeline header to that sequence diagram (see Figure 9.33). The reference clause indicates which sequence diagram defines the decomposition of the lifeline. Decomposition is relative to both the lifeline and the behavior in which the lifeline exists. Thus, for example, ATM_Withdrawal details the withdrawal service within the ATM (see Figure 9.34). There are strong syntactic requirements ensuring the correspondence between the constructs on the decomposed lifeline (in this case, the ATM in Withdrawal) and the referenced detail diagram (here, ATM_Withdrawal). The series of fragments on the lifeline should have an exact corresponding global fragment in the referenced diagram. In this example, this condition holds: for every interaction occurrence covering ATM there is a corresponding interaction occurrence that covers all the lifelines of ATM in ATM_Withdrawal. The same holds for combined fragments — even nested fragments. The combined fragments in Figure 9.34 extend beyond the boundaries of the decomposition diagram to show that the semantics of this fragment takes into account the enclosing level. There are interaction references of two kinds. Interaction occurrences refer to sub-scenarios at the same level of hierarchical abstraction (such as when Authenticate is referenced from Withdrawal), while decomposition refers from one hierarchy level to the next (such as when ATM_Withdrawal is referenced from ATM in Withdrawal).



9-25

sd GenWithdrawal

:User

:ATM

ref

:Bank

Authentication

alt

PIN OK

ref getAmount

checkaccount(v(e))

alt

ref

[enough on account] ok

giveMoney receipt(v) msg(“Amount too large”)

nok

[inadequate funds]


card card taken

FIGURE 9.25

General withdrawal service.

These two kinds of reference must be consistent: if we start with Withdrawal (Figure 9.10) and follow the reference Authenticate (Figure 9.9) and then decompose ATM in Authenticate referencing ATM_Authenticate (Figure 9.35), we should reach the same diagram as if we decomposed ATM in Withdrawal to ATM_Withdrawal (Figure 9.34) and then followed the reference to ATM_Authenticate. Decomposition commutes, that is, the final diagram in such chain of references represents the description of the behavioral intersection of the lifeline (here, the ATM) and the reference (in this case, Authenticate).

9.3.9 The Difference between the UML System and the Final System A UML system defined by a set of interacting state machines can be viewed as being executed by a UML virtual machine. Most often the UML machine (or runtime system) has properties that the real system


9-26


sd giveMoney

sd getAmount :User

:ATM

:User

msg("Select service!")

:ATM money(e)

ok

Withdrawal

msg("Enter amount!") amount(e)

FIGURE 9.26 getAmount and giveMoney for withdrawing native money.

sd giveMoney

sd getAmount :User

:ATM

:User

msg(“Select service!”)

:ATM money(curtyp,e)

ok

Currency msg(“Enter currency!”) CurrencySelected(curtyp) msg(“Enter amount!”) amount(e)

FIGURE 9.27 getAmount and giveMoney for purchasing currency.

will not have. The following differences between the ideal UML machine and the final system have been discussed extensively by Bræk and Haugen [16]. 9.3.9.1 Processing Time In the definition of the ATM, we have not given any time constraints or made assumptions about the processing time of the system functionality. Such time constraints can be introduced by using the simple time model that is part of the common behavior of UML. The ideal UML machine has no limitations in processing capacity; neither is it constrained by finite processor speed or clock frequency. Thus we cannot know whether our ATM will actually fulfill the real requirements that would include adequate response times for the user trying to obtain cash. 9.3.9.2 Errors and Noise UML system specifications may have logical errors, but they do not suffer from physical errors. It is assumed that whenever a message is sent, it will arrive at its destination. The UML machine does not stop without reason, and the contents of signals will not change. In the real world, however, such malfunction does occur and a good designer will have to cope with them during system development.



9-27

sm ATM Idle /authN=0 CardId(cid) /authN=0 :EnterPIN

:Service ok Status

nok [authN { [..]*((not @irdy_rise) and (not change (frame_))); @ data_phase_complete} @sys.pci_clk; else dut_error("Error, IRDY# or FRAME# changed", "before current data phase completed.");

Here, suppose that the events (shown as @event) have been defined already. The shown expression specifies that whenever IRDY# is asserted (@new_irdy), de-assertion of IRDY# (@irdy_rise) or a change in FRAME# should not occur, until the data phase completes (@data_phase_complete). The use of 3A

struct type basically corresponds to a class type in C++, that is, it allows method definitions along with data definitions. Since it is conceptually similar to other object-oriented languages, we omit the actual syntax.


Verification Languages

10-9

@sys.pci_clk denotes that the event pci_clk is used for sampling signals in evaluating the given TE. This feature is also useful for verifying multiple clocked designs.

10.3.4 OpenVera and OVA OpenVera from Synopsys is another testbench language similar to e in terms of functionality and similar to C++ in terms of syntax. Since conceptually OpenVera is very similar to e, we do not include here testbench examples for OpenVera. It has similar constructs for coverage, random stimuli generation, data packing, etc. OpenVera Assertions (OVA) is a standalone language, which is also part of the OpenVera suite [25]. OpenVera comes with a checker library (OVA IP), which is similar to OVL. OVA and OpenVera also have event definitions, repetition operators (∗ [..]), and sequencing, where different sequences can be combined to create more complex sequences using logical and repetition operators. Example 10.7 The following example shows the OVA description of the PCI specification from Example 10.3: clock posedge clk { event chk: if (negedge stop_) then #[1..3]posedge frame_; } assert frame_chk : check (chk)

Note the specification of the sampling clock, the event, and the corresponding action. The implication operation (if–then) is similar to conditionals in programming languages, and # is the cycle delay operator.

10.3.5 ForSpec ForSpec is a specification and modeling language developed at Intel [26]. The underlying temporal logic ForSpec is ForSpec Temporal Logic (FTL), which is composed of regular language operators, and LTL-style temporal operators. ForSpec is aimed at an assume-guarantee verification paradigm, which is suitable for modular verification of large designs. The language also provides explicit support for multiple clocks, reset signals, and past-time temporal operators. Although none of these additional features increases the expressiveness of the basic language, they clearly ease specification of properties in practice. Example templates for asynchronous set/reset using FTL are shown below: accept(boolean_expr & clock)in formula_expr reject(boolean_expr & clock)in formula_expr

Recently, some constructs of ForSpec have also been added to OVA Version 2.3.4 In particular, the concept of “temporal formulas” is added, which can be composed by applying temporal operators on sequences. The supported temporal operators are: followed_by, triggers, until, wuntil, next, wnext [25,27], which can be expressed in terms of standard temporal operators (described in Section 10.2.4). Asynchronous set/reset have also been added.

10.3.6 Property Specification Language Property Specification Language (PSL) is the language standard established by Accellera for formal property specification [3]. It originated from the language Sugar, developed at IBM. The first version of Sugar was based on CTL, and was aimed at easing expression of CTL properties by users for the RuleBase model checker [28]. PSL is based on Sugar Version 2.0, where the default temporal logic is based on LTL, called PSL/Sugar Foundation Language. The main advantage of LTL in assertion-based verification is that its 4 Not

yet supported by Synopsys at the time of this writing.


10-10


semantics, that is, evaluation on a single execution path, is more natural for simulation-based methods where a single path is traced at a time. Though CTL is more efficient in model-checking complexity, it is harder to support in simulation-based methods. For formal verification, CTL continues to be supported in PSL through an Optional Branching Extension (OBE). According to the Accellera standard, PSL assertions are declarative. Since many users prefer procedural assertions, nonstandard pragma-based assertions5 are also supported. PSL properties consist of four layers: Boolean layer, temporal layer, verification layer, and modeling layer. The bottom-most is the Boolean layer, which specifies the Boolean expressions that are combined using operators and constructs from the upper layers. The syntax of this layer depends on the hardware modeling language used to represent the design. Currently, PSL supports both Verilog and VHDL. (Sugar also supports the internal language EDL of RuleBase.) The temporal layer specifies the temporal properties, through the use of temporal operators and regular expression operators. The temporal operators (eventually, until, before, next), and some regular expression operators have two different types. One is called the strong suffix, indicated with an “!” at the end (e.g., eventually!), which specifies that the operator must hold before the end of the simulation. The other one, called the weak suffix (without the “!”), denotes that it does not hold only when the chance of it being true vanishes completely. PSL/Sugar also has some predefined temporal constructs regarded as syntactic sugar, targeted at easing usage without adding expressiveness. The verification layer specifies the role of the property for the purpose of verification. This layer supports keywords — assert, assume, assume-guarantee, restrict, restrict-guarantee, cover, and fairness. The keyword assert indicates that the property should be checked as an assertion. The keywords assume and assume-guarantee denote that the property should be used as an assumption, rather than checked as an assertion. The keywords restrict and restrict-guarantee can be used to force the design into a specified state. The keyword cover is used for coverage analysis, and fairness to specify fairness constraints for the DUV. This layer also provides support for organizing properties into modular units. Finally, the modeling layer provides support for modeling the environment of the DUV, in terms of the behavior of the design inputs, and the auxiliary variables and signals required for verification. Example 10.8 Going back to the same PCI specification, as shown in Examples 10.3 and 10.7, the property is written in PSL as follows: assert always (!stop_-> eventually! frame_);

Note that the above specification is qualitative in terms of when the frame_ signal should be de-asserted, that is, the eventually! operator does not specify that it should do so within one to three cycles after stop_ is asserted. Example 10.9 Consider a property that states that a request-acknowledge sequence should be followed by exactly eight data transmissions, not necessarily consecutive. This can be expressed using a SERE (Sugar Extended Regular Expression) as follows: always {req;ack} |=> {start_trans;data[=8];end_trans}

In addition to the number of data transmissions, the property also specifies that start_trans is asserted exactly one cycle after ack is asserted (|=>). The notation [=8] signifies a (not necessarily consecutive) repetition operator, with parameter 8. Another SERE that is equivalent to data[=8] is shown as follows: {!data[*];data;!data[*]}[*8]

PSL/Sugar also allows specification of a sampling clock for each property, or the use of a default clock (as shown in the previous examples). The abort operator can be used to specify reset-related properties, inspired by ForSpec. To support modularity and reusability, SEREs can be named and used in other SEREs or properties. (This feature is similar to events in e and OVA.) 5 These

assertions are escaped from simulators by defining them inside comments.



10-11

Example 10.10 Another PCI specification states that the least significant address bit AD[0] should never be 1 during a data transfer. This can be expressed in PSL as follows [17]: sequence SERE_MEM_ADDR_PHASE = {frame_; !frame_ && mem_cmd}; property PCI_VALID_MEM_BURST_ENCODING = always {SERE_MEM_ADDR_PHASE} |-> {!ad[0]} abort !rst_ @(posedge clk); assert PCI_VALID_MEM_BURST_ENCODING;

In this example, “|->” is a weak suffix implication operator, which denotes that whenever the first sequence holds we should expect to see the second sequence. Note the definition of the named SERE using keyword sequence, and its use within a property specification. The property is used as assertion, to be checked by the verification tool.

10.4 Languages for Software Verification In this section, we focus on verification efforts for software relevant to embedded system applications. (For more general software systems and software analysis techniques, see overview articles [29,30] for useful pointers.) In comparison to the hardware design industry, which has been very active in standardization efforts for verification languages, there has been little activity in the area of software systems. Instead, the efforts are targeted at taming the complexity of verification, largely by focusing on specific forms of correctness or specific applications. The correctness requirements for software are typically expressed in the form of Floyd–Hoare style assertions [31] (invariants, pre/postconditions of actions represented in predicate logic) for functional verification, or temporal logic formulas for behavioral verification.

10.4.1 Programming Languages Given the popularity of C/C++ and Java, verifying programs directly written in these languages is very attractive in principle. However, there are many challenging issues — handling of integers/floating point data variables, handling of pointers (in C/C++), function/procedure calls, and object-oriented features, such as classes, dynamic objects, and polymorphism. 10.4.1.1 Verification of C/C++ Programs The VeriSoft project [32] at Lucent Technologies focused on guided search for deadlocks and violations of assertions expressed directly in C/C++, and has been used for verifying many communication protocols. The SLAM project [33] has been successfully used at Microsoft for proving the correctness of device drivers written in C. Designers use a special language called SLIC [34] to provide correctness requirements. These requirements are translated automatically into special procedures, which essentially implement automata for checking safety properties. These procedures are added to the given program, such that a bug exists if a statement labeled “error” can be reached in the modified program. Verification is performed by — obtaining a finite state abstract model, performing model checking, and refining the abstract model if needed. Other similar efforts [35,36] have focused largely on the abstraction and refinement techniques so far. There has been relatively little effort in development of a separate verification language. 10.4.1.2 Verification of Java Programs There have been many efforts for verification of Java programs also. The Extended Static Checker for Java (ESC/Java) [37] performs static checks that go beyond type checking. It automatically generates verifications conditions for catching common programming errors (such as null dereferences, array bound errors, etc.), and synchronization errors (race conditions, deadlocks). These conditions are transparently checked by a backend theorem prover. It also uses a simple annotation language for the user to provide object invariants and requirements, which aid the theorem prover. The annotation language, called JML


10-12


(Java Modeling Language), is part of a broader effort [38,39]. JML is used to specify the behavior and syntactic interfaces in Java programs, through pre/postconditions and invariants for classes and methods. Its syntax is quite close to that of Java, thereby making it easy to learn by programmers. JML is used by a host of verification tools, which span the spectrum from dynamic runtime assertion checkers and unit testers, to static theorem provers and invariant generators. It has been applied successfully in smartcard applications implemented using a dialect of Java called Java Card [40]. Example 10.11 Consider the following code [37], which shows Java code for a class definition of integer multisets: 1 class Bag { 2 //@invariant 0 denotes an implication operator which matches the ending of the previous subsequence with the beginning of the next.) Different directives act on properties to indicate their role — assert (check for violations), cover (observe and track when property is exercised), and bind (attaches an externally specified assertion to the code). Assertions are classified as immediate or concurrent. The immediate assertions are like assert statements in an imperative programming language, such as C/C++. They are executed immediately when encountered in the SystemVerilog code, while following its event-based simulation semantics. On the other hand, concurrent assertions usually describe behavior that spans time, and they are executed in the special scheduling phase using sampled values. In both cases, the action taken after evaluating an assertion may include system tasks to control severity, such as “$error,” “$fatal,” and “$warning” [17]. Immediate assertions can be either declarative or procedural. Declarative assertions require enabling conditions to be explicitly provided by the user, which are monitored continuously on clock edges. In the case of procedural assertions, the enabling conditions are partially inferred from the code context. This makes the procedural assertions easier to maintain when the code changes. Example 10.15 Consider a simple parameterized property declaration, and its use as a concurrent assertion [17]: property mutex (clk, reset_n, a, b); @(posedge clk) disable iff (reset_n) (! (a & b )) endproperty assert_mutex: assert property (mutex (clk_a, master_reset_n, write_en, read_en)) ;

The disable iff clause in the mutex property indicates that the property is disabled, that is, treated as if it holds, when the reset_n signal is active. The assertion checks that write_en and read_en cannot occur at the same time, provided master_reset_n is not active.

10.5.2 Domain-Specific System Languages In this section, we describe some domain-specific languages for embedded system applications, and highlight the support they provide for verification. Unlike the system-level languages influenced by HDLs, there has been relatively little standardization effort aimed at their verification features. 10.5.2.1 Synchronous Languages Esterel is a programming language used for the design of synchronous reactive systems [60]. The language provides features for describing the control aspects of parallel compositions of multiple processes, including elaborate clocking and exception-handling mechanisms. Typically, the control part of a reactive application is written in Esterel, which is combined with a functional part written in C. Esterel has a well-defined semantics, which is used by the associated compilers to automatically generate sequential C code for software modules, or Boolean gate-level netlists implementing FSMs for hardware modules. Standard simulation or model checking can be performed on the synthesized FSMs.


10-16


A central assumption in Esterel and other synchronous languages is the synchrony hypothesis, which assumes that a process can react infinitely fast to its environment. In practice, it is important to validate this assumption for the target machine. An example is the tool TAXYS [61], which checks this assumption by modeling the Esterel application and its environment as a real-time system. Two kinds of real-time constraints are specified as annotations in Esterel code — throughput constraints (which express the requirements that a system react fast enough for the given environment model), and deadline constraints (which express maximum delay between a given input and output of a system). These constraints are checked by Kronos [62], a model checker for real-time systems. Example 10.16 An example of annotated application code [61] in Esterel with deadline constraints is shown below: 1 loop 2 await A; %{# Y = clock(last A) %} 3 call F( ); %{# Fmin < CPU < Fmax %} 4 %{# 0 < clock(last A) < d1 %} 5 end loop 6 || 7 loop 8 await B; 9 call G( ); %{# Gmin < CPU < Gmax %} 10 %{# 0 < Y < d2 %} 11 end loop

In this example, it is required that the function F must terminate within d1 time units after arrival of event A, and also that the function G must terminate within d2 time units after the arrival of event A which was consumed by F. These constraints are specified as shown in Lines 4 and 10 (with Line 2), respectively. Note also that the estimated runtimes for functions F and G are provided in Lines 3 and 9, respectively. 10.5.2.2 Languages for Hybrid Systems Many embedded system applications require interaction between digital and analog components, for example, automotive controllers, avionics systems, robotic systems, manufacturing plant controllers, etc. Since many such applications are also safety-critical, there has been a great deal of interest in the verification of such systems, called hybrid systems. The verification efforts can be broadly classified into those using classical control-theoretic methods, and others using automata-based methods. A popular paradigm for control-theoretic methods is the MATLAB-based toolset, with the Simulink/ Stateflow modeling language [63]. It provides standard control engineering components to model the continuous domain, and a Statechart-like language to model the discrete controller. MATLAB-based tools are used for analysis, optimization, and simulation of the continuous behavior specified in terms of differential equations. These tools have been used very successfully for applications dominated by continuous dynamics. However, complex interaction between the continuous and discrete control components has not received much attention. Furthermore, the simulation semantics has not been related to a formal semantics in any standard way. The automata-based methods typically use discrete abstractions of the hybrid system in a way that preserves the properties of interest, typically expressed in temporal logic [64]. Many model checkers for handling these abstractions have been applied in industry settings, for example, Hytech [65], Uppaal [66], Kronos [62], Charon [67]. The details of the discrete abstractions, and the resulting automata models, are beyond the scope of this chapter. In most cases, the correctness requirement is to avoid a set of “bad” states, typically specified using the syntax of the modeling language itself. The model checkers perform an exact (or approximate) reachability analysis to (conservatively) check that all reachable states are safe.



10-17

There has also been work on translating subsets of the popular Simulink/Stateflow-based models to the formal automata-based models, using abstraction and model-checking techniques [68,69]. Recently, there has been some standardization effort in this domain — the Hybrid Systems Interchange Format (HSIF) standard is aimed at defining an interchange format for hybrid system models that can be shared between modeling and analysis tools. So far, the focus has been on representation of the system dynamics, which include both continuous and discrete behaviors. Support for verification-related features will potentially follow its wider adoption.

10.6 Conclusions We have presented a tutorial on verification languages in industry practice for verification of embedded systems and SoCs. We have described features that aid development of testbenches, and specification of correctness properties, to be used by simulation-based as well as formal verification methods. With verification becoming a critical activity in the design cycle of such systems today, these languages are receiving a lot of attention, and there are several standardization efforts underway.

Acknowledgments The authors would like to thank Sharad Malik for valuable comments on the manuscript, and Rajiv Alur and Franjo Ivancic for helpful discussions on verification of hybrid systems.

References [1] S. Edwards, Languages for Embedded Systems. In The Industrial Information Technology Handbook, D.R. Zurawski, Ed. CRC Press, Boca Raton, FL, 2004. [2] A. Bunker, G. Gopalakrishnan, and S. McKee, Formal Hardware Specification Languages for Protocol Compliance Verification. ACM Transactions on Design Automation of Electronic Systems, 9: 1–32, 2004. [3] Accellera Organization. http://www.accellera.org [4] IEEE, Design Automation Standards Committee (DASC). http://www.dasc.org [5] D.E. Thomas and P.R. Moorby, The Verilog Hardware Description Languages. Kluwer Academic Publishers, Norwell, MA, 1991. [6] D.R. Coelho, The VHDL Handbook. Kluwer Academic Publishers, Norwell, MA, 1989. [7] J. Bergeron, Writing Testbenches, Functional Verification of HDL Models. Kluwer Academic Publishers, Dordrecht, 2003. [8] E.M. Clarke, J. Wing et al., Formal Methods: State of the Art and Future Directions. ACM Computing Surveys, 28: 626–643, 1997. [9] E.M. Clarke, O. Grumberg, and D. Peled, Model Checking. MIT Press, Lancaster, England, 1999. [10] K.L. McMillan, Symbolic Model Checking: An Approach to the State Explosion Problem. Kluwer Academic Publishers, Dordrecht, 1993. [11] G.J. Holzmann, The Model Checker SPIN. IEEE Transactions of Software Engineering, 23: 279–295, 1997. [12] M.J.C. Gordon, R. Milner, and C.P. Wadsworth, Edinburgh LCF: A Mechanized Logic of Computation, Vol. 78. Springer-Verlag, Heidelberg, 1979. [13] R.S. Boyer and J.S. Moore, A Computational Logic Handbook. Academic Press, New York, 1988. [14] S. Owre, J.M. Rushby, and N. Shankar, PVS: A Prototype Verification System. In Proceedings of the International Conference on Automatic Deduction (CADE), Vol. 607 of Lecture Notes on Computer Science. Springer-Verlag, Saratoga, NY, 1992. [15] R.E. Bryant, Graph-Based Algorithms for Boolean Function Manipulation. IEEE Transactions on Computers, C-35: 677–691, 1986.


10-18


[16] A. Biere, A. Cimatti, E.M. Clarke, and Y. Zhu, Symbolic Model Checking without BDDs. In Proceedings of the Workshop on Tools and Algorithms for Analysis and Construction of Systems (TACAS), Vol. 1579 of Lecture Notes on Computer Science. Springer-Verlag, London, UK, 1999. [17] H. Foster, A. Krolnik, and D. Lacey, Assertion Based Design. Kluwer Academic Publishers, Dordrecht, 2003. [18] M.R. Garey and D.S. Johnson, Computers and Intractability: A Guide to the Theory of NP-Completeness. W.H. Freeman and Co., San Francisco, CA, 1979. [19] C. Barrett, D. Dill, and J. Levitt, Validity Checking for Combinations of Theories with Equality. In Proceedings of the Formal Methods in Computer Aided Design (FMCAD), Vol. 1166 of Lecture Notes in Computer Science. Springer-Verlag, 1996. [20] R.E. Bryant, S. Lahiri, and S. Seshia, Modeling and Verifying Systems using a Logic of Counter Arithmetic with Lambda Expressions and Uninterpreted Functions. In Proceedings of the Conference on Computer Aided Verification, Vol. 2404 of Lecture Notes in Computer Science. Springer-Verlag, 2002. [21] A. Pnueli, The Temporal Logic of Programs. In Proceedings of the 18th IEEE Symposium on Foundation of Computer Science. IEEE Press, 1977, pp. 46–57. [22] M. Vardi, Branching vs. Linear Time: Final Showdown. In Proceedings of the Tools and Algorithms for Analysis and Construction of Systems (TACAS), Vol. 2031 of Lecture Notes in Computer Science. Springer-Verlag, London, UK, 2003. [23] W. Thomas, Automata on Infinite Objects. In Handbook of Theoretical Computer Science, Vol. B. Elsevier and MIT Press, Cambridge, MA, 1990, pp. 133–191. [24] Z. Kirshenbaum, Understanding the “e” Verification Languages, 2003. http://www.eetimes.com/ story/OEG20030529S0072 [25] OpenVera Language Reference Manual: Assertions Version 2.3. 2003. http://www.openvera.org [26] R. Armoni, L. Fix, A. Flaisher, R. Gerth, B. Ginsburg, T. Kanza, and A. Landver, The ForSpec Temporal Logic: A New Temporal Property-Specification Language. In Proceedings of Tools and Algorithms for Analysis and Construction of Systems (TACAS), Vol. 2280 of Lecture Notes on Computer Science. Springer-Verlag, London, UK, 2001. [27] Synopsys, OVA White Paper, 2003. www.openvera.org [28] I. Beer, S. Ben-David, C. Eisner, D. Fisman, A. Gringauze, and Y. Rodeh, The Temporal Logic Sugar. In Proceedings of the International Conference on Computer Aided Verification, Vol. 2102 of Lecture Notes in Computer Science. Springer-Verlag, 2001. [29] D. Craigen, S. Gerhart, and T. Ralston, Formal Methods Reality Check: Industrial Usage. IEEE Transactions of Software Engineering, 21: 90–98, 1995. [30] D. Jackson and M. Rinard, Software Analysis: A Roadmap. In The Future of Software Engineering, A. Finkelstein, Ed. ACM Press, 2000. [31] C.A.R. Hoare, An Axiomatic Basis for Computer Programming. Communications of the ACM, 12: 576–580, 1969. [32] P. Godefroid, Model Checking for Programming Languages using VeriSoft. In Proceedings of the ACM Symposium on Principles of Programming Languages. ACM Press, 1997. [33] T. Ball and S. Rajamani, The SLAM Toolkit. In Proceedings of the Conference on Computer Aided Verification, Vol. 2102 of Lecture Notes on Computer Science. Springer-Verlag, London, UK, 2001. [34] T. Ball and S. Rajamani, SLIC: A Specification Language for Interface Checking (of C). Microsoft Research MSR-TR-2001-21, 2001. [35] T.A. Henzinger, R. Jhala, R. Majumdar, and G. Sutre, Software Verification with Blast. In Proceedings of the 10th SPIN Workshop on Model Checking, Vol. 2648 of Lecture Notes on Computer Science. Springer-Verlag, Heidelberg, 2003. [36] D. Kroening, E.M. Clarke, and K. Yorav, Behavioral Consistency of C and Verilog Programs using Bounded Model Checking. In Proceedings of the ACM/IEEE Design Automation Conference. ACM Press, 2003.



10-19

[37] C. Flanagan, K.R.M. Leino, M. Lillibridge, G. Nelson, J. Saxe, and R. Stata, Extended Static Checking for Java. In Proceedings of the ACM Conference on Programming Language Design and Implementation (PLDI). ACM Press, 2002. [38] G.T. Leavens, A.L. Baker, and C. Ruby, Preliminary Design of JML: A Behavioral Interface Specification Language for Java. Technical report 98-06t, Iowa State University, Department of Computer Science, 2002. [39] G.T. Leavens, K.R.M. Leino, E. Poll, C. Ruby, and B. Jacobs, JML: Notations and Tools Supporting Detailed Design in Java. In Proceedings of the Conference on Object-Oriented Programming, Systems, Languages, and Applications. ACM Press, 2000. [40] E. Poll, J.v.d. Berg, and B. Jacobs, Specification of the JavaCard API in JML. In Proceedings of the Smart Card Research and Advanced Application Conference, 2000. [41] J.C. Corbett, M.B. Dwyer, J. Hatcliff, S. Laubach, C.S. Pasareanu, Robby, and H. Zheng, Bandera: Extracting Finite-State Models from Java Source Code. In Proceedings of the International Conference on Software Engineering. IEEE Press, 2000. [42] K. Havelund and T. Pressburger, Model Checking Java Programs Using Java PathFinder. International Journal on Software Tools for Technology Transfer (STTT), 2: 366–381, 2000. [43] W. Visser, K. Havelund, G. Brat, and S. Park, Model Checking Programs. In Proceedings of IEEE International Conference on Automated Software Engineering. IEEE Press, 2000. [44] J.C. Corbett, M.B. Dwyer, J. Hatcliff, and Robby, A Language Framework for Expressing Checkable Properties of Dynamic Software. In Proceedings of the SPIN Software Model Checking Workshop, 2000. [45] M.B. Dwyer, G. Avrunin, and J.C. Corbett, Patterns in Property Specifications for Finite-State Verification. In Proceedings of the International Conference on Software Engineering. IEEE Press, 1999. [46] J. Rumbaugh, I. Jacobson, and G. Booch, The Unified Modeling Language User’s Guide. Addison Wesley, Reading, MA, 1999. [47] G. Martin, UML for Embedded Systems Specification and Design: Motivation and Overview. In Proceedings of Design Automation and Test Europe (DATE), 2002. [48] B. Selic, The Real-Time UML Standard: Definition and Application. In Proceedings of Design Automation and Test Europe (DATE), 2002. [49] D. Harel, StateCharts: A Visual Formalism for Complex Systems. Science of Computer Programming, 8: 231–274, 1987. [50] ITU, Message Sequence Chart: International Telecommunication Union-T Recommendation, 1999. [51] A. Muscholl and D. Peled, From Finite State Communication Protocols to High-Level Message Sequence Charts. In Proceedings of the International Symposium on Mathematical Foundations of Computer Science, Vol. 2076 of Lecture Notes on Computer Science. Springer-Verlag, 2001. [52] J. Warmer and A. Kleppe, The Object Constraint Language: Precise Modeling with UML. AddisonWesley, Reading, MA, 2000. [53] J. Ellsberger, D. Hogrefe, and A. Sarma, SDL: Formal Object-Oriented Language for Communication Systems. Prentice Hall, New York, 1997. [54] V. Levin and H. Yenigun, SDLcheck: A Model Checking Tool. In Proceedings of the Conference on Computer Aided Verification, Vol. 2102 of Lecture Notes on Computer Science. Springer-Verlag, 2001. [55] J.M. Spivey, The Z Notation: A Reference Manual. Prentice-Hall, New York, 1992. [56] D. Jackson, Alloy: A Lightweight Object Modelling Notation. ACM Transactions on Software Engineering and Methodology (TOSEM), 11: 256–290, 2002. [57] N. Ip and S. Swan, Using Transaction-Based Verification in SystemC, 2002. http://www.systemc.org/ [58] D.D. Gajski, J. Zhu, J. Doemer, A. Gerstlauer, and S. Zhao, SpecC: Specification Language and Methodology. Kluwer Academic Publishers, Dordrecht, 2000.


10-20


[59] Accellera, SystemVerilog 3.1: Accellera’s Extensions to Verilog, 2003. http://www.eda.org/ sv-ec/SystemVerilog_3.1_final.pdf [60] G. Berry and G. Gonthier, The Esterel Synchronous Programming Language: Design, Semantics, Implementation. Science of Computer Programming, 19: 87–152, 1992. [61] E. Closse, M. Poize, J. Pulou, J. Sifakis, P. Venier, D. Weill, and S. Yovine, TAXYS: A Tool for the Development and Verification of Real-Time Embedded Systems. In Proceedings of the Conference on Computer Aided Verification (CAV), 2001. [62] C. Daws, A. Olivero, S. Tripakis, and S. Yovine, The Tool Kronos. In Proceedings of the Hybrid Systems III, Verification and Control, Vol. 1066 of Lecture Notes on Computer Science. Springer-Verlag, NY, 1996. [63] MATLAB/Simulink, http://www.mathworks.com [64] R. Alur, T.A. Henzinger, G. Lafferriere, and G. Pappas, Discrete Abstractions of Hybrid Systems. In Proceedings of the IEEE, 2000. [65] T.A. Henzinger, P.-H. Ho, and H. Wong-Toi, HyTech: A Model Checker for Hybrid Systems. International Journal of Software Tools for Technology Transfer, 1: 110–122, 1997. [66] K. Larsen, P. Pettersson, and W. Yi, UPPAAL in a Nutshell. International Journal of Software Tools for Technology Transfer, 1: 134–152, 1997. [67] R. Alur, T. Dang, J. Esposito, R. Fierro, Y. Hur, F. Ivancic, V. Kumar, I. Lee, P. Mishra, G. Pappas, and O. Sokolsky, Hierarchical Hybrid Modeling of Embedded Systems. In Proceedings of the EMSOFT ’01: First Workshop on Embedded Software, Vol. 2211 of Lecture Notes on Computer Science. Springer-Verlag, 2001. [68] B.I. Silva, K. Richeson, B.H. Krogh, and A. Chutinam, Modeling and Verification of Hybrid Dynamical System Using CheckMate. In Proceedings of the Automation of Mixed Processes: Hybrid Dynamic Systems (ADPM), 2000. [69] A. Tiwari, N. Shankar, and J. M. Rushby, Invisible Formal Methods for Embedded Control Systems. Proceedings of the IEEE, 91: 29–39, 2003.


Operating Systems and Quasi-Static Scheduling 11 Real-Time Embedded Operating Systems: Standards and Perspectives Ivan Cibrario Bertolotti

12 Real-Time Operating Systems: The Scheduling and Resource Management Aspects Giorgio C. Buttazzo

13 Quasi-Static Scheduling of Concurrent Specifications Alex Kondratyev, Luciano Lavagno, Claudio Passerone, and Yosinori Watanabe


11 Real-Time Embedded Operating Systems: Standards and Perspectives 11.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-1 11.2 Operating System Architecture and Functions . . . . . . . . 11-3 Overall System Architecture • Process and Thread Model • Processor Scheduling • Interprocess Synchronization and Communication • Network Support • Additional Functions

11.3 The POSIX Standard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-9 Attribute Objects • Multithreading • Process and Thread Scheduling • Real-Time Signals and Asynchronous Events • Interprocess Synchronization and Communication • Thread-Specific Data • Memory Management • Asynchronous and List Directed Input and Output • Clocks and Timers • Cancellation

11.4 Real-Time, Open-Source Operating Systems . . . . . . . . . . 11-23 11.5 Virtual-Machine Operating Systems . . . . . . . . . . . . . . . . . . . 11-25

Ivan Cibrario Bertolotti IEIIT-CNR – Istituto di Elettronica e di Ingegneria dell’Informazione e delle Telecomunicazioni

Related Work • Views of Processor State • Operating Principle • Virtualization by Instruction Emulation • Processor-Mode Change • Privileged Instruction Emulation • Exception Handling • Interrupt Handling • Trap Redirection • VMM Processor Scheduler

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-33

11.1 Introduction Informally speaking, a real-time computer system is a system where a computer senses events from the outside world and reacts to them; in such an environment, the timely availability of computation results is as important as their correctness. Instead, the exact definition of embedded system is somewhat less clear. In general, an embedded system is a special-purpose computer system built into a larger device and is usually not programmable by the end user.

11-1


11-2


The major areas of difference between general purpose and embedded computer systems are cost, performance, and power consumption: often, embedded systems are mass produced, thus reducing their unit cost is an important design goal; in addition, mobile battery-powered embedded systems, such as cellular phones, have severe power budget constraints to enhance their battery life. Both these constraints have a profound impact on system performance, because they entail simplifying the overall hardware architecture, reducing clock speed, and keeping memory requirements to a minimum. Moreover, embedded computer systems often lack traditional peripheral devices, such as a disk drive, and interface with application-specific hardware instead. Another common requirement for embedded computer systems is some kind of real-time behavior; the strictness of this requirement varies with the application, but it is so common that, for example, many operating system vendors often use the two terms interchangeably, and refer to their products either as “embedded operating systems” or “real-time operating systems for embedded applications.” In general, the term “embedded” is preferred when referring to smaller, uniprocessor computer systems, and “real-time” is generally used when referring to larger appliances, but the today’s rapid increase of available computing power and hardware features in embedded systems contributes to shade this distinction. Recent examples of real-time systems include many kinds of widespread computer systems, from large appliances like phone switches to mass-market consumer products such as printers and digital cameras. Therefore, a real-time operating system must not only manage system resources and offer a welldefined set of services to application programs, like any other operating system does, but must also provide guarantees about the timeliness of such services and honor them, that is, its behavior must be predictable. Thus, for example, the maximum time the operating system will take to perform any service it offers must be known in advance. This proves to be a tight constraint, and implies that real-time does not have the same meaning as “real fast,” because it often conflicts with other operating system’s goals, such as good resource utilization and coexistence of real-time and nonreal-time jobs, and adds further complexity to the operating system duty. Also, it is highly desirable that a real-time operating system optimize some operating system parameters, mainly context switch time and interrupt latency, which have a profound influence on the overall response time of the system to external events; moreover, in embedded systems, the operating system footprint, that is, its memory requirements, must be kept to a minimum to reduce costs. Last, but not the least, due to the increasing importance of open system architectures in software design, the operating system services should be made available to the real-time application through a standard application programming interface. This approach promotes code reuse, interoperability, and portability, and reduces the software maintenance cost. The chapter is organized as follows: Section 11.2 gives a brief refresher on the main design and architectural issues of operating systems and on how these concepts have been put in practice. Section 11.3 discusses the main set of international standards concerning real-time operating systems and their application programming interface; this section also includes some notes on mechanisms seldom mentioned in operating system theory but of considerable practical relevance, namely real-time signals, asynchronous I/O operations, and timers. Next, Section 11.4 gives a short description of some widespread open-source real-time operating systems, another recent and promising source of novelty in embedded system software design. In particular, since open-source operating systems have no purchasing cost and are inherently royalty free, their adoption can easily cut down the cost of an application. At the end of the chapter, Section 11.5 presents an overview of the operating principle and goals of a seldom-mentioned class of operating systems, namely operating systems based on virtual machines. Although they are perceived to be very difficult to implement, these operating systems look very promising for embedded applications, in which distinct sets of applications, each with its own requirements in term of real-time behavior and security, are executed on the same physical processor; hence, they are an active area of research.


Real-Time Embedded Operating Systems: Standards and Perspectives

11-3

11.2 Operating System Architecture and Functions The main goal of this section is to give a brief overview on the architecture of operating systems of interest to real-time application developers, and on the functions they accomplish on behalf of the applications that run on them. See, for example, References 1 and 2 for more general information, and Reference 3 for an in-depth discussion about the internal architecture of the influential Unix operating system.

11.2.1 Overall System Architecture An operating system is a very complex piece of software; accordingly, its internal architecture can be built around several different designs. Some designs that have been tried in practice and are in common use are: Monolithic systems. This is the oldest design, but it is still popular for very small real-time executives intended for deeply embedded applications, and for the real-time portion of more complex systems, due to its simplicity and very low processor and memory overhead. In monolithic systems, the operating system as a whole runs in privileged mode, and the only internal structure is usually induced by the way operating system services are invoked: applications, running in user mode, request operating system services by executing a special trapping instruction, usually known as the system call instruction. This instruction brings the processor into privileged mode and transfers control to the system call dispatcher of the operating system. The system call dispatcher then determines which service must be carried out, and transfers control to the appropriate service procedure. Service procedures share a set of utility procedures, which implement generally useful functions on their behalf. Interrupt handling is done directly in the kernel for the most part and interrupt handlers are not fullfledged processes. As a consequence, the interrupt handling overhead is very small because there is no full task switching at interrupt arrival, but the interrupt handling code cannot invoke most system services, notably blocking synchronization primitives. Moreover, the operating system scheduler is disabled while interrupt handling is in progress, and only hardware prioritization of interrupt requests is in effect, hence the interrupt handling code is implicitly executed at a priority higher than the priority of all other tasks in the system. To further reduce processor overhead on small systems, it is also possible to run the application as a whole in supervisor mode. In this case, the application code can be bound with the operating system at link time and system calls become regular function calls. The interface between application code and operating system becomes much faster, because no user-mode state must be saved on system call invocation and no trap handling is needed. On the other hand, the overall control that the operating system can exercise on bad application behavior is greatly reduced and debugging may become harder. In this kind of systems, it is usually impossible to upgrade individual software components, for example, an application module, without replacing the executable image as a whole and then rebooting the system. This constraint can be of concern in applications where software complexity demands the frequent replacement of modules, and no system down time is allowed. Layered systems. A refinement and generalization of the monolithic system design consists of organizing the operating system as a hierarchy of layers at system design time. Each layer is built upon the services offered by the one below it and, in turn, offers a well-defined and usually richer set of services to the layer above it. Operating system interface and interrupt handling are implemented like in monolithic systems; hence, the corresponding overheads are very similar. Better structure and modularity make maintenance easier, both because the operating system code is easier to read and understand, and because the inner structure of a layer can be changed at will without interfering with other layers, provided the interlayer interface does not change. Moreover, the modular structure of the operating system enables the fine-grained configuration of its capabilities, to tailor the operating system itself to its target platform and avoid wasting valuable memory space for operating system functions that are never used by the application. As a consequence, it is possible to enrich the operating system with many capabilities, for example, network support, without sacrificing


11-4


its ability to run on very small platforms when these features are not needed. A number of operating systems in use today evolved into this structure, often starting from a monolithic approach, and offer sophisticated build or link-time configuration tools. Microkernel systems. This design moves many operating system functions from the kernel up into operating system server processes running in user mode, leaving a minimal microkernel and reducing to an absolute minimum the amount of privileged operating system code. Applications request operating system services by sending a message to the appropriate operating system server and waiting for a reply. The main purpose of the microkernel is to handle the communication between applications and servers, to enforce an appropriate security policy on such communication, and to perform some critical operating system functions, such as accessing I/O device registers, that would be difficult, or inefficient, to do from user-mode processes. This kind of design makes the operating system easier to manage and maintain. Also, the messagepassing interface between user processes and operating system components encourages modularity and enforces a clear and well-understood structure on operating system components. Moreover, the reliability of the operating system is increased: since the operating system servers run in user mode, if one of them fails some operating system functions will no longer be available, but the system will not crash. Moreover, the failed component can be restarted and replaced without shutting down the whole system. Last, the design is easily extended to distributed systems, where operating system functions are split across a set of distinct machines connected by a communication network. This kind of systems are very promising in terms of performance, scalability, and fault tolerance, especially for large and complex real-time applications. By contrast, making the message-passing communication mechanism efficient can be a critical issue, especially for distributed systems, and the system call invocation mechanism induces more overhead than in monolithic and layered systems. Interrupt requests are handled by transforming them into messages directed to the appropriate interrupt handling task as soon as possible: the interrupt handler proper runs in interrupt service mode and performs the minimum amount of work strictly required by the hardware, then synthesizes a message and sends it to an interrupt service task. In turn, the interrupt service task concludes interrupt handling running in user mode. Being an ordinary task, the interrupt service task can, at least in principle, invoke the full range of operating system services, including blocking synchronization primitives, and must not concern itself with excessive usage of the interrupt service processor mode. On the other hand, the overhead related to interrupt handling increases, because the activation of the interrupt service task requires a task switch. Virtual machines. The internal architecture of operating systems based on virtual machines revolves around the basic observation that an operating system must perform two essential functions: multiprogramming and system services. Accordingly, those operating systems fully separate these functions and implement them as two distinct operating system components: a virtual machine monitor that runs in privileged mode, implements multiprogramming, and provides many virtual processors identical in all respects to the real processor it runs on, and one or more guest operating systems that run on the virtual processors, and implement system services. Different virtual processors can run different operating systems, and they must not necessarily be aware of being run in a virtual machine. In the oldest approach to virtual machine implementation, guest operating systems are given the illusion of running in privileged mode, but are instead constrained to operate in user mode; in this way, the virtual machine monitor is able to intercept all privileged instructions issued by the guest operating systems, check them against the security policy of the system, and then perform them on behalf of the guest operating system itself. Interrupt handling is implemented in a similar way: the virtual machine monitor catches all interrupt requests and then redirects them to the appropriate guest operating system handler, reverting to user mode in the process; thus, the virtual machine monitor can intercept all privileged instructions issued by the guest interrupt handler, and again check and perform them as appropriate.



11-5

The full separation of roles and the presence of a relatively small, centralized arbiter of all interactions between virtual machines has the advantage of making the enforcement of security policies easier. The isolation of virtual machines from each other also enhances reliability because, even if one virtual machine fails, it does not bring down the system as a whole. In addition, it is possible to run a distinct operating system in each virtual machine thus supporting, for example, the orderly coexistence between a real-time and a general-purpose operating system. By contrast, the perfect implementation of virtual machines requires hardware assistance, both to make it feasible and to be able to emulate privileged instructions with a reasonable degree of efficiency. A variant of this design adopts an interpretive approach, and allows the virtual machines to be different from the physical machine. For example, Java programs are compiled into byte-code instructions suitable for execution by an abstract Java virtual machine. On the target platform an interpreter executes the byte code on the physical processor, thus implementing the virtual machine. More sophisticated approaches to virtual machine implementation are also possible, the most common one being on-the-fly code generation, also known as just-in-time compilation.

11.2.2 Process and Thread Model A convenient and easy to understand way to design real-time software applications is to organize them as a set of cooperating sequential processes. A process is an activity, namely the activity of executing a program, and encompasses the program being executed, the data it operates on, and the current processor state, including the program counter and registers. In particular, each process has its own address space. In addition, it is often convenient to support multiple threads of control within the same process, sharing the same address space. Threads can be implemented for the most part in user mode, without the operating system’s kernel intervention; moreover, when the processor is switched between threads, the address space remains the same and must not be switched. Both these facts make processor switching between threads very fast with respect to switching between processes. On the other hand, since all threads within a process share the same address space, there can be only a very limited amount of protection among them with respect to memory access; hence, for example, a thread is allowed to pollute by mistake another thread’s data and the operating system has no way to detect errors of this kind. As a consequence, many small operating systems for embedded applications only support threads to keep overheads and hardware requirements to a minimum, while larger operating systems for more complex real-time applications offer the user a choice between a single or multiple process model to enhance the reliability of complex systems. Another important operating system design issue is the choice between static and dynamic creation of processes and threads: some operating systems, usually oriented toward relatively simple embedded applications, only support static tasks, that is, all tasks in the system are known in advance and it is not possible to create and destroy tasks while the system is running; thus, the total number of tasks in the system stays constant for all its life. Other operating systems allow us to create and destroy tasks at runtime, by means of a system call. Dynamic task creation has the obvious advantage of making the application more flexible, but it increases the complexity of the operating system, because many operating system data structures, first of all the process table, must be allocated dynamically and their exact size cannot be known in advance. In addition, the application code requires a more sophisticated error-handling strategy, with its associated overheads, because it must be prepared to cope with the inability of the operating system to create a new task, due to lack of resources.

11.2.3 Processor Scheduling The scheduler is one of the most important components of a real-time operating system, as it is responsible for deciding to which runnable threads the available processors must be assigned, and for how long. Among dynamic schedulers — that is, schedulers that perform scheduling computations at runtime — while the


11-6


application is running, several algorithms are in use and offer different tradeoffs between real-time predictability, implementation complexity, and overhead. Since the optimum compromise often depends on the application’s characteristics, most real-time operating systems support multiple scheduling policies simultaneously and the responsibility of a correct choice falls on the application programmer. The most common scheduling algorithms supported by real-time operating systems and specified by international standards are: First in, first out with priority classes. Under this algorithm, also known as fixed priority scheduling, there is a list of runnable threads for each priority level. When a processor is idle, the scheduler takes the runnable thread at the head of the highest-priority, nonempty thread list and runs it. When the scheduler preempts a running thread, because a higher-priority task has become runnable, the preempted thread becomes the head of the thread list for its priority; when a blocked thread becomes runnable again, it becomes the tail of the thread list for its priority. The first in, first out scheduler never changes thread priorities at runtime; hence, the priority assignment is fully static; a well-known approach to static priority assignment for periodic tasks is the rate monotonic policy, in which task priorities are inversely proportional to their periods. In order to ensure that none of the threads can monopolize the processor, when multiple, runnable threads share the same priority level, the basic algorithm is often enhanced with the additional constraint that, when a running thread has been executing for more than a maximum amount of time, the quantum, that thread is forcibly returned to the tail of its thread list and a new thread is selected for execution; this approach is known as round-robin scheduling. Earliest deadline first. The earliest deadline first scheduler assigns thread priorities dynamically. In particular, this scheduler always executes the thread with the nearest deadline. It can be shown that this algorithm is optimal for uniprocessor systems, and supports full processor utilization in all situations. However, its performance under overload can be poor and dynamically updating thread priorities on the base of their deadlines may be computationally expensive, especially when this scheduler is layered on a fixed-priority lower-level scheduler. Sporadic server. This scheduling algorithm was first introduced in Reference 4, where a thorough description of the algorithm can be found. The sporadic server algorithm is suitable for aperiodic event handling where, for timeliness, events must be handled at a certain, usually high, priority level, but lower-priority threads with real-time requirements could suffer from excessive preemption if that priority level were maintained indefinitely. It acts on the base of two main scheduling parameters associated with each thread: the execution capacity and the replenishment period. Informally, the execution capacity of a thread represents the maximum amount of processor time that the thread is allowed to consume at high priority in a replenishment period. The execution capacity of a thread is preserved until an aperiodic request for that task occurs, thus making it runnable; then, thread execution depletes its execution capacity. The sporadic server algorithm replenishes the thread’s execution capacity after some or all of its capacity is consumed by thread execution; the schedule for replenishing the execution capacity is based on the thread’s replenishment period. Should the thread reach its processor usage upper limit, its execution capacity becomes zero and it is demoted to a lower-priority level, thus avoiding excessive preemption against other threads. When replenishments have restored the execution capacity of the thread above a certain threshold level, the scheduler promotes the thread to its original priority again.

11.2.4 Interprocess Synchronization and Communication An essential function of a multiprogrammed operating system is to allow processes to synchronize and exchange information; as a whole, these functions are known as InterProcess Communication (IPC). Many interprocess synchronization and communication mechanisms have been proposed and were objects of



11-7

extensive theoretical study in the scientific literature. Among them, we recall: Semaphores. A semaphore, first introduced by Dijkstra in 1965, is a synchronization device with an integer value, and on which the following two primitive, atomic operations are defined: • The P operation, often called DOWN or WAIT, checks if the current value of the semaphore is greater than zero. If so, it decrements the value and returns to the caller; otherwise, the invoking process goes into the blocked state until another process performs a V on the same semaphore. • The V operation, also called UP, POST, or SIGNAL, checks whether there is any process currently blocked on the semaphore. In this case, it wakes exactly one of them up, allowing it to complete its P; otherwise, it increments the value of the semaphore. The V operation never blocks. Semaphores are a very low-level IPC mechanism; therefore, they have the obvious advantage of being simple to implement, at least on uniprocessor systems, and of having a very low overhead. By contrast, they are difficult to use, especially in complex applications. Also related to mutual exclusion with semaphores is the problem of priority inversion. Priority inversion occurs when a high-priority process is forced to wait for a lower-priority process to exit a critical region, a situation in contrast with the concept of relative task priorities. Most real-time operating systems take this kind of blocking into account and implement several protocols to bound it; among these we recall the priority inheritance and the priority ceiling protocols. Monitors. To overcome the difficulties of semaphores, in 1974/1975, Hoare and Brinch Hansen introduced a higher-level synchronization mechanism, the monitor. A monitor is a set of data structures and procedures that operate on them; data structures are shared among all processes that can use the monitor, and it effectively hides the data structures it contains; hence, the only way to access the data associated with the monitor is through the procedures in the monitor itself. In addition, all procedures in the same monitor are implicitly executed in mutual exclusion. Unlike semaphores, the responsibility of ensuring mutual exclusion falls on the compiler, not on the programmer, because monitors are a programming language construct, and the language compiler knows about them. Inside a monitor, condition variables can be used to wait for events. Two atomic operations are defined on a condition variable: • The WAIT operation releases the monitor and blocks the invoking process until another process performs a SIGNAL on the same condition variable. • The SIGNAL operation unblocks exactly one process waiting on the condition variable. Then, to ensure that mutual exclusion is preserved, the invoking process is either forced to leave the monitor immediately (Brinch Hansen’s approach), or is blocked until the monitor becomes free again (Hoare’s approach). Message passing. Unlike all other IPC mechanisms described so far, message passing supports explicit data transfer between processes; hence, it does not require a shared memory and lends itself well to be extended to distributed systems. This IPC method provides for two primitives: • The SEND primitive sends a message to a given destination. • Symmetrically, RECEIVE receives a message from a given source. Many variants are possible on the exact semantics of these primitives; they mainly differ in the way messages are addressed and buffered. A commonplace addressing scheme is to give to each process/thread in the system a unique address, and to send messages directly to processes. Otherwise, a message can be addressed to a mailbox, a message container whose maximum capacity is usually specified upon creation; in this case, message source and destination addresses are mailbox addresses. When mailboxes are used, they also provide some amount of message buffering, that is, they hold messages that have been sent but have not been received yet. Moreover, a single task can own multiple mailboxes and use them to classify messages depending on their source or priority. A somewhat contrary


11-8


approach to message buffering, simpler to implement but less flexible, is the rendezvous strategy: the system performs no buffering. Hence, when using this scheme the sender and the receiver are forced to run in lockstep because the SEND does not complete until another process executes a matching RECEIVE and, conversely, the RECEIVE waits until a matching SEND is executed.

11.2.5 Network Support There are two basic approaches to implement network support in a real-time operating system and to offer it to applications: • The POSIX standard (Portable Operating System Interface for Computing Environments) [5] specifies the socket paradigm for uniform access to any kind of network support that many realtime operating systems provide them. Sockets, fully described in Reference 3, were first introduced in the “Berkeley Unix” operating system and are now available on virtually all general-purpose operating systems; as a consequence, most programmers are likely to be proficient with them. The main advantage of sockets is that they support in a uniform way any kind of communication network, protocol, naming conventions, hardware, and so on. Semantics of communication and naming are captured by communication domains and socket types, both specified upon socket creation. For example, communication domains are used to distinguish between IPv4 and X25 network environments, whereas the socket type determines whether communication will be streambased or datagram-based and also implicitly selects which network protocol a socket will use. Additional socket characteristics can be setup after creation through abstract socket options; for example, socket options provide a uniform, implementation-independent way to set the amount of receive buffer space associated with a socket. • Some operating systems, mostly focused on a specific class of embedded applications, offer network support through a less general, but more rich and efficient, application programming interface. For example, Reference 6 is an operating system specification oriented to automotive applications; it specifies a communication environment (OSEK/VDX COM) less general than sockets and oriented to real-time message-passing networks, such as the Controller Area Network (CAN). In this case, for example, the application programming interface allows applications to easily set message filters and perform out-of-order receives, thus enhancing their timing behavior; both these functions are supported with difficulty by sockets, because they do not fit well with the general socket paradigm. In both cases, network device drivers are usually supplied by third-party hardware vendors and conform to a well-defined interface defined by the operating system vendor. The network software itself, although it is often bundled with the operating system and provided by the same vendor, can be obtained from third-party software houses, too. Often, these products are designed to run on a wide variety of hardware platforms and operating systems, and come with source code; hence, it is also possible to port them to custom operating systems developed in-house, and they can be extended and enhanced by the end user.

11.2.6 Additional Functions Even if real-time operating systems sometimes do not implement several major functions that are now commonplace in general-purpose operating systems, such as demand paging, swapping, and filesystem access, they must be concerned with other, less well-known functions that ensure or enhance system predictability, for example: • Asynchronous, real-time signals and cancellation requests, to deal with unexpected events, such as software and hardware failures, and to gracefully degrade the system’s performance should a processor overload occur. © 2006 by Taylor & Francis Group, LLC


11-9

• High-resolution clocks and timers, to give real-time processes an accurate notion of elapsed time. • Asynchronous I/O operations, to decouple real-time processes from the inherent unpredictability of many I/O devices.

11.3 The POSIX Standard The original version of the Portable Operating System Interface for Computing Environments, better known as “the POSIX standard,” was first published between 1988 and 1990, and defines a standard way for applications to interface with the operating system. The set now includes over 30 individual standards, and covers a wide range of topics, from the definition of basic operating system services, such as process management, to specifications for testing the conformance of an operating system to the standard itself. Among these, of particular interest is the System Interfaces (XSH) Volume of IEEE Std 1003.1-2001 [5], which defines a standard operating system interface and environment, including real-time extensions. The standard contains definitions for system service functions and subroutines, language-specific system services for the C programming language, and notes on portability, error handling, and error recovery. This standard has been constantly evolving since it was first published in 1988; the latest developments have been crafted by a joint working group of members of the IEEE Portable Applications Standards Committee, members of The Open Group, and members of ISO/IEC Joint Technical Committee 1. The joint working group is known as the Austin Group, after the location of the inaugural meeting held at the IBM facility in Austin, Texas in September 1998. The Austin Group formally began its work in July 1999, after the subscription of a formal agreement between The Open Group and the IEEE, with the main goal of revising, updating, and combining into a single document the following standards: ISO/IEC 9945-1, ISO/IEC 9945-2, IEEE Std 1003.1, IEEE Std 1003.2, and the Base Specifications of The Open Group Single UNIX Specification Version 2. For real-time systems, the latest version of IEEE Std 1003.1 [5] incorporates the real-time extension standards listed in Table 11.1. Since embedded systems can have strong resource limitations, the IEEE Std 1003.13-1998 [7] profile standard groups functions from the standards mentioned above into units of functionality. Implementations can then choose the profile most suited to their needs and to the computing resources of their target platforms. Operating systems invariably experience a delay between the adoption of a standard and its implementation, hence functions defined earlier in time are usually supported across a wider number of operating systems. For this reason, in this section we concentrate only on the functions that are both related to real-time software development and are actually available on most real-time operating systems at the date of writing, including multithreading support. In addition, we assume that the set of functions common to both the POSIX and the ISO C [8] standards is well known to readers, hence we will not describe it. Table 11.2 summarizes the functional groups of IEEE Std 1003.1-2001 that will be discussed next.

11.3.1 Attribute Objects Attribute objects are a mechanism devised to support future standardization and portable extension of some entities specified by the POSIX standard, such as threads, mutual exclusion devices, and condition variables, without requiring that the functions operating on them be changed. TABLE 11.1 Real-Time Extensions Incorporated into IEEE Std 1003.1-2001 Standard 1003.1b 1003.1c 1003.1d 1003.1j


Description Basic real-time extensions; first published in 1993 Threads extensions; published in 1995 Additional real-time extensions; published in 1999 Advanced real-time extensions; published in 2000

11-10 TABLE 11.2

Embedded Systems Handbook Basic Functional Groups of IEEE Std 1003.1-2001

Functional group

Main functions

Multiple threads

pthread_create pthread_exit pthread_join pthread_detach pthread_equal pthread_self

Process and thread scheduling

sched_setscheduler sched_getscheduler sched_setparam sched_getparam pthread_setschedparam pthread_getschedparam pthread_setschedprio pthread_attr_setschedpolicy pthread_attr_getschedpolicy pthread_attr_setschedparam pthread_attr_getschedparam sched_yield sched_get_priority_max sched_get_priority_min sched_rr_get_interval

Real-time signals

sigqueue pthread_kill sigaction sigaltstack sigemptyset sigfillset sigaddset sigdelset sigismember sigwait sigwaitinfo sigtimedwait

Interprocess synchronization and communication

mq_open mq_close mq_unlink mq_send mq_receive mq_timedsend mq_timedreceive mq_notify mq_getattr mq_setattr sem_init sem_destroy sem_open sem_close sem_unlink sem_wait sem_trywait sem_timedwait sem_post sem_getvalue pthread_mutex_destroy pthread_mutex_init pthread_mutex_lock pthread_mutex_trylock pthread_mutex_timedlock pthread_mutex_unlock pthread_cond_init pthread_cond_destroy pthread_cond_wait pthread_cond_timedwait pthread_cond_signal pthread_cond_broadcast shm_open close shm_unlink mmap munmap

Thread-specific data

pthread_key_create pthread_getspecific pthread_setspecific pthread_key_delete

Mem. management

mlock mlockall munlock munlockall mprotect

Asynchronous and list directed I/O

aio_read aio_write lio_listio aio_error aio_return aio_fsync aio_suspend aio_cancel

Clocks and timers

clock_gettime clock_settime clock_getres timer_create timer_delete timer_getoverrun timer_gettime timer_settime

Cancellation

pthread_cancel pthread_setcancelstate pthread_setcanceltype pthread_testcancel pthread_cleanup_push pthread_cleanup_pop

In addition, they provide for a clean isolation of the configurable aspects of said entities. For example, the stack address (i.e., the location in memory of the storage to be used for the thread’s stack) is an important attribute of a thread, but it cannot be expressed portably and must be adjusted when the program is ported to a different architecture. The use of attribute objects allows the programmer to specify thread’s attributes in a single place, rather than spread them across every instance of thread creation; moreover, the same set of attributes can be shared for multiple objects of the same kind so as, for example, to set up classes of threads with similar attributes. Figure 11.1 shows how attribute objects are created, manipulated, used to configure objects when creating them, and finally destroyed. As an example, the function and attribute names given in the figure are those used for threads, but the same general architecture is also used for the attributes of mutual exclusion devices (Section 11.3.5.3) and condition variables (Section 11.3.5.4). In order to be used, an attribute object must be first initialized by means of the pthread_attr_init function; this function also fills the attribute object with a default value for all attributes defined by the



Use attribute object to create one or more threads pthread_create()

stackaddr

0x24000

...

Get/Set individual attributes, e.g., pthread_attr_getstackaddr() pthread_attr_setstackaddr()

Destroy attribute object pthread_attr_destroy()

...

Initialize attribute object pthread_attr_init()

11-11

Name

Value

Attribute object with named attributes

FIGURE 11.1 Attribute objects in POSIX.

implementation. When it is no longer needed, an attribute object should be destroyed by invoking the pthread_attr_destroy function on it. An attribute object holds one or more named attributes; for each attribute, the standard specifies a pair of functions that the user can call to get and set the value of that attribute. For example, the pthread_attr_getstackaddr and pthread_attr_setstackaddr functions get and set the stack address attribute, respectively. After setting the individual attributes of an attribute object to the desired value, the attribute object can be used to configure one or more entities specified by the standard. Hence, for example, the thread attribute object being described can be used as an argument to the pthread_create function which, in turn, creates a thread with the given set of attributes. Last, it should be noted that the attribute objects are defined as opaque types, so they shall be accessed only by the functions just presented, and not by manipulating their representation directly, even if it is known, because doing this is not guaranteed to be portable across different implementations of the standard.

11.3.2 Multithreading The multithreading capability specified by the POSIX standard includes functions to populate a process with new threads. In particular, the pthread_create function creates a new thread within a process and sets up a thread identifier for it, to be used to operate on the thread in the future. After creation, the thread immediately starts executing a function passed to pthread_create as an argument; moreover, it is also possible to pass an argument to the function in order, for example, to share the same function among multiple threads and nevertheless be able to distinguish them. The pthread_create function also takes an optional reference to an attribute object as argument. The attributes of a thread determine, for example, the size and location of the thread’s stack and its scheduling parameters; the latter will be described further in Section 11.3.3. A thread may terminate in three different ways: • By returning from its main function • By explicitly calling the pthread_exit function • By accepting a cancellation request (Section 11.3.10) In any case, the pthread_join function allows the calling thread to synchronize with, that is wait for, the termination of another thread. When the thread finally terminates, this function also returns to the caller a summary information about the reason of the termination. For example, if the target thread terminated itself by means of the pthread_exit function, pthread_join returns the status code passed to pthread_exit in the first place.


11-12


If this information is not desired, it is possible to save system resources by detaching a thread, either dynamically by means of the pthread_detach function, or statically by means of a thread’s attribute. In this way, the storage associated to that thread can be immediately reclaimed when the thread terminates. Additional utility functions are provided to operate on thread identifiers; hence, for example, the pthread_equal function checks whether two thread identifiers are equal (i.e., whether they refer to the same thread), and the pthread_self function returns the identifier of the calling thread.

11.3.3 Process and Thread Scheduling Functions in this group allow the application to select a specific policy that the operating system must follow to schedule a particular process or thread, and to get and set the scheduling parameters associated with that process or thread. In particular, the sched_setscheduler function sets both the scheduling policy and parameters associated to a process, and sched_getscheduler reads them back for examination. The sched_setparam and sched_getparam functions are somewhat more limited, because they set and get the scheduling parameters but not the policy. All functions take a process identifier as argument, to uniquely identify a process. For threads, the pthread_setschedparam and pthread_getschedparam functions set and get the scheduling policy and parameters associated with a thread; pthread_setschedprio directly sets the scheduling priority of the given thread. All these functions take a thread identifier as argument and perform a dynamic access to the thread scheduling parameters; in other words, they can be used when the thread already exists in the system. On the other hand, the functions pthread_attr_setschedpolicy, pthread_attr_getschedpolicy, pthread_attr_setschedparam, and pthread_attr_getschedparam store and retrieve the scheduling policy and parameters of a thread into and from an attribute object, respectively; in turn, the attribute object can subsequently be used to create one or more threads. The general mechanism of attribute objects in the POSIX standard has been discussed in Section 11.3.1. An interesting and useful side effect of pthread_setschedprio is that, when the effect of the function is to lower the priority of the target thread, the thread is inserted at the head of the thread list of the new priority, instead of the tail. Hence, this function provides a way for an application to temporarily raise its priority and then lower it again to its original value, without having the undesired side effect of yielding to other threads of the same priority. This is necessary, for example, if the application is to implement its own strategies for bounding priority inversion. Last, the sched_yield function allows the invoking thread to voluntarily relinquish the CPU in favor of other threads of the same priority; the invoking thread is linked to the end of the list of ready processes for its priority. In order to support the orderly coexistence of multiple scheduling policies, the conceptual scheduling model defined by the standard and depicted in Figure 11.2 assigns a global priority to all threads in the system and contains one ordered thread list for each priority; any runnable thread will be on the thread list for that thread’s priority. When appropriate, the scheduler shall select the thread at the head of the highest-priority, nonempty thread list to become a running thread, regardless of its associated policy; this thread is then removed from its thread list. When a running thread yields the CPU, either voluntarily or by preemption, it is returned to the thread list it belongs to. The purpose of a scheduling policy is then to determine how the operating system scheduler manages the thread lists, that is, how threads are moved between and within lists when they gain or lose access to the CPU. Associated with each scheduling policy is a priority range, which must span at least 32 distinct priority levels; all threads scheduled according to that policy must lie within that priority range, and priority ranges belonging to different policies can overlap in whole or in part. The sched_get_priority_min and



11-13

Multiple scheduling policies; each has its own local priority range

Local priority range

Thread lists, one for each global priority in the system

Local priority range

..

..

.

Global priority range

Each scheduling policy controls the placement of threads in its priority range

The mapping between local and global priority ranges is chosen at system configuration time

Selection of the highest-priority thread by the scheduler. Threads are returned to their thread list on yield or preemption

FIGURE 11.2 Processor scheduling in POSIX.

sched_get_priority_max functions return the allowable range of priority for a given scheduling policy. The mapping between the multiple local priority ranges, one for each scheduling policy active in the system, and the single global priority range is usually performed by a simple relocation and is either fixed or programmable at system configuration time, depending on the operating system. In addition, operating systems may reserve some global priority levels, usually the higher ones, for interrupt handling. The standard defines the following three scheduling policies, whose algorithms have been briefly presented in Section 11.2.3: • First in, first out (SCHED_FIFO) • Round robin (SCHED_RR) • Optionally, a variant of the sporadic server scheduler (SCHED_SPORADIC) Most operating systems set the execution time limit of the round-robin scheduling policy statically, at system configuration time; the sched_rr_get_interval function returns the execution time limit set for a given process; the standard provides no portable way to set the execution time limit dynamically. A fourth scheduling policy, SCHED_OTHER, can be selected to denote that a thread no longer needs a specific real-time scheduling policy: general-purpose operating systems with real-time extensions usually revert to the default, nonreal-time scheduler when this scheduling policy is selected. Moreover, each implementation is free to redefine the exact meaning of the SCHED_OTHER policy and can provide additional scheduling policies besides those required by the standard, but any application using them will no longer be fully portable.


11-14


11.3.4 Real-Time Signals and Asynchronous Events Signals are a facility specified by the ISO C standard and widely available on most operating systems; they provide a mechanism to convey information to a process or thread when it is not necessarily waiting for input. The IEEE Std 1003.1-2001 further extends the signal mechanism to make it suitable for real-time handling of exceptional conditions and events that may occur asynchronously with respect to the notified process like, for example: • An error occurring during the execution of the process, for example, a memory reference through an invalid pointer. • Various system and hardware failures, such as a power failure. • Explicit generation of an event by another process; as a special case, a process can also trigger a signal directed to itself. • Completion of an I/O operation started by the process in the past, and for which the process did not perform an explicit wait. • Availability of data to be read from a message queue. The signal mechanism has a significant historical heritage; in particular, it was first designed when multithreading was not yet in widespread use and its interface and semantics underwent many changes since their inception. Therefore, it owes most of its complexity to the need of maintaining compatibility with the historical implementations of the mechanism made, for example, by the various flavors of the influential Unix operating systems; however, in this section the compatibility interfaces will not be discussed for the sake of clarity and conciseness. With respect to the ISO C signal behavior, the IEEE Std 1003.1-2001 specifies two main enhancements of interest to real-time programmers: 1. In the ISO C standard, the various kinds of signals are identified by an integer number (often denoted by a symbolic constant in application code) and, when multiple signals of different kind are pending, they are serviced in an unspecified order; the IEEE Std 1003.1-2001 continues to use signal numbers but specifies that for a subset of their allowable range, between SIGRTMIN and SIGRTMAX, a priority hierarchy among signals is in effect, so that the lowest-numbered signal has the highest priority of service. 2. In the ISO C standard, there is no provision for signal queues, hence when multiple signals of the same kind are raised before the target process had a chance of handling them, all signals but the first are lost; the IEEE Std 1003.1-2001 specifies that the system must be able to keep track of multiple signals with the same number by enqueuing and servicing them in order. Moreover, it also adds the capability of conveying a limited amount of information (a union sigval, capable of holding either an integer or a pointer) with each signal request, so that multiple signals with the same signal number can be distinguished from each other. The queueing policy is always FIFO, and cannot be changed by the user. As outlined above, each signal has a signal number associated to it, to identify its kind; for example, the signal associated to memory access violations has the number SIGSEGV associated to it. Figure 11.3 depicts the life of a signal from its generation up to its delivery. Depending on their kind and source, signals may be directed to either a specific thread in the process, or to the process as a whole; in the latter case, every thread belonging to the process is a candidate for the delivery of the signal, by the rules described later. It should also be noted that for some kinds of events, the POSIX standard specifies that the notification can also be carried out by the execution of a handling function in a separate thread, if the application so chooses; this mechanism is simpler and clearer than the signal-based notification, but requires multithreading support on the system side.


Real-Time Embedded Operating Systems: Standards and Perspectives Process-level action (may ignore the signal completely) sigaction()

11-15

Per-thread signal masks and explicit wait pthread_sigmask() Thread 1

1 Thread 2

Signal generation, directed to a specific thread or to the process as a whole sigqueue(), pthread_kill(), event notification 3

Execution of the action associated with the signal: * Return of sigwait() * Default action (with process-level side effects) * Signal handler

Thread n

2 Choice of a single target thread (only for signals directed to the process) The signal stays pending if there is no thread suitable for immediate delivery

FIGURE 11.3 Signal generation and delivery in POSIX.

11.3.4.1 Generation of a Signal As outlined above, most signals are generated by the system rather than by an explicit action performed by a process. For these, the POSIX standard specifies that the decision of whether the signal must be directed to the process as a whole or to a specific thread within a process must be carried out at the time of generation and must represent the source of the signal as closely as possible. In particular, if a signal is attributable to an action carried out by a specific thread, for example, a memory access violation, the signal shall be directed to that thread and not to the process. If such an attribution is either not possible or not meaningful as it is the case, for example, of the power failure signal, the signal shall be directed to the process. Besides various error conditions, an important source of signals generated by the system relate to asynchronous event notification and are always directed to the process; as an example, Section 11.3.8 will describe the mechanism behind the notification of completion for asynchronous I/O operations. On the other hand, processes have the ability of synthesizing signals by means of two main interfaces, depending on the target of the signal: • The sigqueue function, given a process identifier and a signal number, generates a signal directed to that process; an additional argument, a union sigval, allows the caller to associate a limited amount of information to the signal, provided that the SA_SIGINFO flag is set for that signal number. Additional interfaces exist to generate a signal directed to a group of processes, for example, the killpg function. However, they have not been extended for real time and hence they do not have the ability of associating any additional information to the signal. • The pthread_kill function generates a signal directed to a specific thread within the calling process and identified by its thread identifier. It is not possible to generate a signal directed to a specific thread of another process.


11-16


11.3.4.2 Process-Level Action For each kind of signal defined in the system, that is, for each valid signal number, processes may set up an action by means of the sigaction function; the action may consist of either: • Ignore the signal completely • A default action performed by the operating system on behalf of the process, and possibly with process-level side effects, such as the termination of the process itself • The execution of a signal handling function specified by the programmer In addition, the same function allows the caller to set zero or more flags associated to the signal number. Of the rather large set of flags specified by the POSIX standard, the following ones are of particular interest to real-time programmers: • The SA_SIGINFO flag, when set, enables the association of a limited amount of information to each signal; this information will then be conveyed to the signaled process or thread. In addition, if the action associated with the signal is the execution of a user-specified signal handler, setting this flag extends the arguments passed to the signal handler to include additional information about the reason why the signal was generated and about the receiving thread’s context that was interrupted when the signal was delivered. • The SA_RESTART flag, when set, enables the automatic, transparent restart of interruptible system calls when the system call is interrupted by the signal. If this flag is clear, system calls that were interrupted by a signal fail with an error indication and must be explicitly restarted by the application, if appropriate. • The SA_ONSTACK flag, when set, commands the switch of the process or thread to which the signal is delivered to an alternate stack for the execution of the signal handler; the sigaltstack function can be used to set the alternate stack up. If this flag is not set, the signal handler executes on the regular stack of the process or thread. It should be noted that the setting of the action associated with each kind of signal takes place at the process level, that is, all threads within a process share the same set of actions; hence, for example, it is impossible to set two different signal handling functions (for two different threads) to be executed in response to the same kind of signal. Immediately after generation, the system checks the process-level action associated with the signal in the target process, and immediately discards the signal if that action is set to ignore it; otherwise, it proceeds to check whether the signal can be acted on immediately. 11.3.4.3 Signal Delivery and Acceptance Provided that the action associated to the signal at the process level does not specify to ignore the signal in the first place, a signal can be either delivered to or accepted by a thread within the process. Unlike the action associated to each kind of signal discussed above, each thread has its own signal mask; by means of the signal mask, each thread can selectively block some kinds of signals from being delivered to it, depending on their signal number. The pthread_sigmask function allows the calling thread to examine or change (or both) its signal mask. A signal mask can be set up and manipulated by means of the functions: sigemptyset: initializes a signal mask so that all signals are excluded from the mask. sigfillset: initializes a signal mask so that all signals are included in the mask. sigaddset: given a signal mask and a signal number, adds the specified signal to the signal mask; it has no effect if the signal was already in the mask. sigdelset: given a signal mask and a signal number, removes the specified signal from the signal mask; it has no effect if the signal was not in the mask. sigismember: given a signal mask and a signal number, checks whether the signal belongs to the signal mask or not.



11-17

A signal can be delivered to a thread if and only if that thread does not block the signal; when a signal is successfully delivered to a thread, that thread executes the process-level action associated with the signal. On the other hand, a thread may perform an explicit wait for one or more kinds of signals, by means of the sigwait function; that function stops the execution of the calling thread until one of the signals passed as argument to sigwait is conveyed to the thread. When this occurs, the thread accepts the signal and continues past the sigwait function. Since the standard specifies that signals in the range from SIGRTMIN to SIGRTMAX are subject to a priority hierarchy, when multiple signals in this range are pending, the sigwait shall consume the lowest-numbered one. It should also be noted that for this mechanism to work correctly, the thread must block the signals that it wishes to accept by means of sigwait (through its signal mask), otherwise signal delivery takes precedence. Two, more powerful, variants of the sigwait function exist: sigwaitinfo has an additional argument used to return additional information about the signal just accepted, including the information associated with the signal when it was first generated; furthermore, sigtimedwait also allows the caller to specify the maximum amount of time that shall be spent waiting for a signal to arrive. The way in which the system selects a thread within a process to convey a signal depends on where the signal is directed: • If the signal is directed toward a specific thread, only that thread is a candidate for delivery or acceptance. • If the signal is directed to a process as a whole, any thread belonging to that process is a candidate to receive the signal; in this case, the system selects exactly one thread within the process with the appropriate signal mask (for delivery), or performing a suitable sigwait (for acceptance). If there is no suitable thread to convey the signal when it is first generated, the signal remains pending until its delivery or acceptance becomes possible, by following the same rules outlined above, or the process-level action associated with that kind of signal is changed and set to ignore it. In the latter case, the system forgets everything about the signal, and all other signals of the same kind.

11.3.5 Interprocess Synchronization and Communication The main interprocess synchronization and communication mechanisms offered by the standard are the semaphore and the message queue, both described in Section 11.2.4. The blocking synchronization primitives have a nonblocking and a timed counterpart, to enhance their real-time predictability. Moreover, multithreading support also adds support for mutual exclusion devices, condition variables, and other synchronization mechanisms. The scope of these mechanisms can be limited to threads belonging to the same process to enhance their performance. 11.3.5.1 Message Queues The mq_open function either creates or opens a message queue and connects it with the calling process; in the system, each message queue is uniquely identified by a name, like a file. This function returns a message queue descriptor that refers to and uniquely identifies the message queue; the descriptor must be passed to all other functions that operate on the message queue. Conversely, mq_close removes the association between the message queue descriptor and its message queue. As a result, the message queue descriptor is no longer valid after successful return from this function. Last, the mq_unlink function removes a message queue, provided no other processes reference it; if this is not the case, the removal is postponed until the reference count drops to zero. The number of elements that a message queue is able to buffer, and their maximum size, are constant for the lifetime of the message queue, and are set when the message queue is first created. The mq_send and mq_receive functions send and receive a message to and from a message queue, respectively. If the message cannot be immediately stored or retrieved (e.g., when mq_send is executed on a full message queue) these functions block as long as appropriate, unless the message queue was opened


11-18


with the nonblocking option set; if this is the case, these functions return immediately if they are unable to perform their job. The mq_timedsend and mq_timedreceive functions have the same behavior, but allow the caller to place an upper bound on the amount of time they may spend waiting. The standard allows to associate a priority to each message, and specifies that the queueing policy of message queues must obey the priority so that, for example, mq_receive retrieves the highest-priority message that is currently stored in the queue. The mq_notify function allows the caller to arrange for the asynchronous notification of message arrival at an empty message queue, when the status of the queue transitions from empty to nonempty, according to the mechanism described in Section 11.3.4. The same function also allows the caller to remove a notification request it made previously. At any time, only a single process may be registered for notification by a message queue. The registration is removed implicitly when a notification is sent to the registered process, or when the process owning the registration explicitly removes it; in both cases, the message queue becomes available for a new registration. If both a notification request and an mq_receive call are pending on a given message queue, the latter takes precedence, that is, when a message arrives at the queue, it satisfies the mq_receive and no notification is sent. Last, the mq_getattr and mq_setattr functions allow the caller to get and set, respectively, some attributes of the message queue dynamically after creation; these attributes include, for example, the nonblocking flag just described and may also include additional, implementation-specific flags. 11.3.5.2 Semaphores Semaphores come in two flavors: unnamed and named. Unnamed semaphores are created by the sem_init function and must be shared among processes by means of the usual memory sharing mechanisms provided by the system. On the other hand, named semaphores created and accessed by the sem_open function exist as named objects in the system, like the message queues described above, and can therefore be accessed by name. Both functions, when successful, associate the calling process with the semaphore and return a descriptor for it. Depending on the kind of semaphore, either the sem_destroy (for unnamed semaphores) of the sem_close function (for named semaphores) must be used to remove the association between the calling process and a semaphore. For unnamed semaphores, the sem_destroy function also destroys the semaphore; instead, named semaphores must be removed from the system with a separate function, sem_unlink. For both kinds of semaphore, a set of functions implements the classic P and V primitives, namely: • The sem_wait function performs a P operation on the semaphore; the sem_trywait and sem_timedwait functions perform the same function in polling mode and with a user-specified timeout, respectively. • The sem_post function performs a V operation on the semaphore. • The sem_getvalue function has no counterpart in the definition of semaphore found in literature and returns the current value of a semaphore. 11.3.5.3 Mutexes A mutex is a very specialized binary semaphore that can only be used to ensure the mutual exclusion among multiple threads; it is therefore simpler and more efficient than a full-fledged semaphore. Optionally, it is possible to associate to each mutex a protocol to deal with priority inversion. The pthread_mutex_init function initializes a mutex and prepares it for use; it takes an attribute object as argument, working according to the general mechanism described in Section 11.3.1 and useful to specify the attributes of the mutex like, for example, the priority inversion protocol to be used for it. When default mutex attributes are appropriate, a static initialization technique is also available; in particular, the macro PTHREAD_MUTEX_INITIALIZER can be used to initialize a mutex that the application has statically allocated.



11-19

In any case, the pthread_mutex_destroy function destroys a mutex. The following main functions operate on the mutex after creation: • The pthread_mutex_lock function locks the mutex if it is free; otherwise, it blocks until the mutex becomes available and then locks it; the pthread_mutex_trylock function does the same, but returns to the caller without blocking if the lock cannot be acquired immediately; the pthread_mutex_timedlock function allows the caller to specify a maximum amount of time to be spent waiting for the lock to become available. • The pthread_mutex_unlock function unlocks a mutex. Additional functions are defined for particular flavors of mutexes; for example, the pthread_mutex_getprioceiling and pthread_mutex_setprioceiling functions allow the caller to get and set, respectively, the priority ceiling of a mutex, and make sense only if the priority ceiling protocol has been selected for the mutex, by means of a suitable setting of its attributes. 11.3.5.4 Condition Variables A set of condition variables, in concert with a mutex, can be used to implement a synchronization mechanism similar to the monitor without requiring the notion of monitor to be known at the programming language level. A condition variable must be initialized before use by means of the pthread_cond_init function; this function takes an attribute object as argument, which can be used to configure the condition variable to be created, according to the general mechanism described in Section 11.3.1. When default attributes are appropriate, the macro PTHREAD_COND_INITIALIZER is available to initialize a condition variable that the application has statically allocated. Then, the mutex and the condition variables can be used as follows: • Each procedure belonging to the monitor must be explicitly bracketed with a mutex lock at the beginning, and a mutex unlock at the end. • To block on a condition variable, a thread must call the pthread_cond_wait function giving both the condition variable and the mutex used to protect the procedures of the monitor as arguments. This function atomically unlocks the mutex and blocks the caller on the condition variable; the mutex will be reacquired when the thread is unblocked, and before returning from pthread_cond_wait. To avoid blocking for a (potentially) unbound time, the pthread_cond_timedwait function allows the caller to specify the maximum amount of time that may be spent waiting for the condition variable to be signaled. • Inside a procedure belonging to the monitor, the pthread_cond_signal function, taking a condition variable as argument, can be called to unblock at least one of the threads that are blocked on the specified condition variable; the call has no effect if no threads are blocked on the condition variable. The rather relaxed specification of unblocking at least one thread, instead of exactly one, has been adopted by the standard to simplify the implementation of condition variables on multiprocessor systems, and to make it more efficient, mainly because condition variables are often used as the building block of higher-level synchronization primitives. • A variant of pthread_cond_signal, called pthread_cond_broadcast, is available to unblock all threads that are currently waiting on a condition variable. As before, this function has no effect if no threads are waiting on the condition variable. When no longer needed, condition variables shall be destroyed by means of the pthread_cond_destroy function, to save system resources. 11.3.5.5 Shared Memory Except message queues, all IPC mechanisms described so far only provide synchronization among threads and processes, and not data sharing.


11-20


Moreover, while all threads belonging to the same process share the same address space, so that they implicitly and inherently share all their global data, the same is not true for different processes; therefore, the POSIX standard specifies an interface to explicitly set up a shared memory object among multiple processes. The shm_open function either creates or opens a new shared memory object and associates it with a file descriptor, which is then returned to the caller. In the system, each shared memory object is uniquely identified by a name, like a file. After creation, the state of a shared memory object, in particular all data it contains, persists until the shared memory object is unlinked and all active references to it are removed. Instead, the standard does not specify whether a shared memory object remains valid after a reboot of the system or not. Conversely, close removes the association between a file descriptor and the corresponding shared memory object. As a result, the file descriptor is no longer valid after successful return from this function. Last, the shm_unlink function removes a shared memory object, provided no other processes reference it; if this is not the case, the removal is postponed until the reference count drops to zero. It should be noted that the association between a shared memory object and a file descriptor belonging to the calling process, performed by shm_open, does not map the shared memory into the address space of the process. In other words, merely opening a shared memory object does not make the shared data accessible to the process. In order to perform the mapping, the mmap function must be called; since the exact details of the address space structure may be unknown to, and uninteresting for the programmer, the same function also provides the capability of choosing a suitable portion of the caller’s address space to place the mapping automatically. The function munmap removes a mapping.

11.3.6 Thread-Specific Data All threads belonging to the same process implicitly share the same address space, so that they have shared access to all their global data. As a consequence, only the information allocated on the thread’s stack, such as function arguments and local variables, is private to each thread. On the other hand, it is often useful in practice to have data structures that are private to a single thread, but can be accessed globally by the code of that thread. The POSIX standard responds to this need by defining the concept of thread-specific data, of which Figure 11.4 depicts the general usage. The pthread_key_create function creates a thread-specific data key visible to, and shared by, all threads in the process. The key values provided by this function are opaque objects used to access thread-specific data. In particular, the pair of functions pthread_getspecific and pthread_setspecific take a key as argument and allow the caller to get and set, respectively, a pointer uniquely bound with the given Thread-specific data key, created by pthread_key_create()

Thread-specific data key

Thread-specific data of T1 pthread_getspecific Thread-specific data of T2 Thread T1 pthread_getspecific Thread T2

FIGURE 11.4 Thread-specific data in POSIX.



11-21

key, and private to the calling thread. The pointer bound to the key by pthread_setspecific persists for the life of the calling thread, unless it is replaced by a subsequent call to pthread_setspecific. An optional destructor function may be associated with each key when the key itself is created. When a thread exits, if a given key has a valid destructor, and the thread has a valid (i.e., not NULL) pointer associated with that key, the pointer is disassociated and set to NULL, and then the destructor is called with the previously associated pointer as argument. When it is no longer needed, a thread-specific data key should be deleted by invoking the pthread_key_delete function on it. It should be noted that, unlike in the previous case, this function does not invoke the destructor function associated with the key, so that it is the responsibility of the application to perform any cleanup actions for data structures related to the key being deleted.

11.3.7 Memory Management The standard allows processes to lock parts or all of their address space in main memory by means of the mlock and mlockall functions; in addition, mlockall also allows the caller to demand that all of the pages that will become mapped into the address space of the process in the future must be implicitly locked. The lock operation both forces the memory residence of the virtual memory pages involved, and prevents them from being paged out in the future. This is vital in operating systems that support demand paging and must nevertheless support any real-time processing, because the paging activity could introduce undue and highly unpredictable delays when a real-time process attempts to access a page that is currently not in the main memory and must therefore be retrieved from secondary storage. When the lock is no longer needed, the process can invoke either the munlock or the munlockall function to release it and enable demand-paging again. Other memory management functions, such as the mmap function already described in Section 11.3.5.5, establish a mapping between the address space of the calling process and a memory object, possibly shared between multiple processes. The mapping facility is general enough; hence, it can also be used to map other kinds of objects, such as files and devices, into the address space of a process, provided both the hardware and the operating system have this capability, which is not mandated by the standard. For example, once a file is mapped, a process can access it simply by reading or writing the data at the address range to which the file was mapped. Finally, it is possible for a process to change the access protections of portions of its address space by means of the mprotect function; in this case, it is assumed that protections will be enforced by the hardware. For example, to prevent inadvertent data corruption due to a software bug, one could protect critical data intended for read-only usage against write access.

11.3.8 Asynchronous and List Directed Input and Output Many operating systems carry out I/O operations synchronously with respect to the process requesting them. Thus, for example, if a process invokes a file read operation, it stays blocked until the operating system has finished it, either successfully or unsuccessfully. As a side effect, any process can have at most one pending I/O operation at any given time. While this programming model is simple, intuitive, and perfectly adequate for general-purpose systems, it shows its limits in a real-time environment, namely: • I/O device access timings can vary widely, especially when an error occurs; hence, it is not always wise to suspend the execution of a process until the operation completes, because this would introduce a source of unpredictability in the system. • It is often desirable, for example, to enhance system performance by exploiting I/O hardware parallelism, to start more than one I/O operation simultaneously, under the control of a single process.


11-22


To satisfy these requirements, the standard defines a set of functions to start one or more I/O requests, to be carried out in parallel with process execution, and whose completion status can be retrieved asynchronously by the requesting process. Asynchronous and list-directed I/O functions revolve around the concept of asynchronous I/O control block, struct aiocb; this structure contains all the information needed to describe an I/O operation, and contains members to: • Specify the operation to be performed, read or write. • Identify the file on which the operation must be carried out, by means of a file descriptor. • Determine what portion of the file the operation will operate upon, by means of a file offset and a transfer length. • Locate a data buffer in memory to be used to store or retrieve the data read from, or to be written to, the file. • Give a priority classification to the operation. • Request the asynchronous notification of the completion of the operation, either by a signal or by the asynchronous execution of a function, as described in Section 11.3.4. Then, the following functions are available: • The aio_read and aio_write functions take an I/O control block as argument and schedule a read or a write operation, respectively; both return to the caller as soon as the request has been queued for execution. • As an extension, the lio_listio function schedules a list of (possibly asynchronous) I/O requests, each described by an I/O control block, with a single function call. • The aio_error and aio_return functions allow the caller to retrieve the error and status information associated with an I/O control block, after the corresponding I/O operation has been completed. • The aio_fsync function asynchronously forces all I/O operations associated with the file indicated by the I/O control block passed as argument and currently queued to the synchronized I/O completion state. • The aio_suspend function can be used to block the calling thread until at least one of the I/O operations associated with a set of I/O control blocks passed as argument completes, or up to a maximum amount of time. • The aio_cancel function cancels an I/O operation that has not been completed yet.

11.3.9 Clocks and Timers Real-time applications very often rely on timing information to operate correctly; the POSIX standard specifies support for one or more timing bases, called clocks, of known resolution and whose value can be retrieved at will. In the system, each clock has its own unique identifier. The clock_gettime and clock_settime functions get and set the value of a clock, respectively, while the clock_getres function returns the resolution of a clock. Clock resolutions are implementation-defined and cannot be set by a process; some operating systems allow the clock resolution to be set at system generation or configuration time. In addition, applications can set one or more per-process timers, using a specified clock as a timing base, by means of the timer_create function. Each timer has a current value and, optionally, a reload value associated with it. The operating system decrements the current value of timers according to their clock and, when a timer expires, it notifies the owning process with an asynchronous notification of timer expiration; as described in Section 11.3.4, the notification can be carried out either by a signal, or by awakening a thread belonging to the process. On timer expiration, the operating system also reloads the timer with its reload value, if it has been set, thus possibly realizing a repetitive timer.



11-23

When a timer is no longer needed, it shall be removed by means of the timer_delete function, that both stops the timer and frees all resources allocated to it. Since, due to scheduling or processor load constraints, a process could lose one or more notifications of expiration, the standard also specifies a way for applications to retrieve, by means of the timer_getoverrun function, the number of “missed” notifications, that is, the number of extra timer expirations that occurred between the time at which a given timer expired and when the notification associated with the expiration was eventually delivered to, or accepted by, the process. At any time, it is also possible to store a new value into, or retrieve the current value of a timer by means of the timer_settime and timer_gettime functions, respectively.

11.3.10 Cancellation Any thread may request the cancellation of another thread in the same process by means of the pthread_cancel function. Then, the target thread’s cancelability state and type determines whether and when the cancellation takes effect. When the cancellation takes effect, the target thread is terminated. Each thread can atomically get and set its own way to react to a cancellation request by means of the pthread_setcancelstate and pthread_setcanceltype functions. In particular, three different settings are possible: • The thread can ignore cancellation requests completely. • The thread can accept the cancellation request immediately. • The thread can be willing to accept the cancellation requests only when its execution flow crosses a cancellation point. A cancellation point can be explicitly placed in the code by calling the pthread_testcancel function. Also, it should be remembered that many functions specified by the POSIX standard act as implicit cancellation points. The choice of the most appropriate response to cancellation requests depends on the application and is a trade-off between the desirable feature of really being able to cancel a thread, and the necessity of avoiding the cancellation of a thread while it is executing in a critical section of code, both to keep the guarded data structures consistent and to ensure that any IPC object associated with the critical section, for example, a mutex, is released appropriately; otherwise, the critical region would stay locked forever, likely inducing a deadlock in the system. As an aid to do this, the POSIX standard also specifies a mechanism that allows any thread to register a set of cleanup handlers on a stack to be executed, in LIFO order, when the thread either exits voluntarily, or accepts a cancellation request. The pthread_cleanup_push and pthread_cleanup_pop functions push and pop a cleanup handler into and from the handler stack; the latter function also has the ability to execute the handler it is about to remove.

11.4 Real-Time, Open-Source Operating Systems Although in the general-purpose operating system camp a handful of products dominates the market, there are more than 30 real-time operating systems available for use today, both commercial and experimental, and new ones are still being developed. This is due both to the fact that there is much research in the latter area, and that real-time embedded applications are inherently less homogeneous than generalpurpose applications, such as those found in office automation; hence, ad hoc operating system features are often needed. In addition, the computing power and the overall hardware architecture of embedded systems are much more varied than, for example, those of personal computers. This section focuses on open-source real-time operating systems, a recent source of novelty in embedded system software design. Open-source operating systems are especially promising, for two main reasons: 1. The source code of an open-source real-time operating system can be used both to develop real-world applications, and for study and experimentation. Therefore, open-source operating


11-24


systems often implement the most advanced, state-of-the-art architectures and algorithms because researchers can play with them at will and their work can immediately be reflected into real applications. 2. Open-source operating systems have no purchasing cost and are inherently royalty free, so their adoption can cut down the costs of an application. Moreover, one of the most well-known issues of open-source operating systems, namely the lack of official technical support, has recently found its way to a solution with more and more consulting firms specializing in their support. Among the open-source operating systems that have found their way into commercial products we recall: eCos. The development of eCos [9] is coordinated by Red Hat Inc. and is based on a modular, layered real-time kernel. The most important innovation in eCos is its extensive configuration system that operates at kernel build time with a large number of configuration points placed at source code level, and allows a very fine-grained adaptation of the kernel itself to both application needs and hardware characteristics. The output of the configuration process is an operating system library that can then be linked with the application code. Its application programming interface is compatible with the POSIX standard, but it does not support multiple processes with independent address spaces, even when a Memory Management Unit (MMU) is available. µClinux. It is a stripped-down version of the well-known Linux operating system. The most interesting features of µClinux [10] are the ability to run on microcontrollers that lack an MMU and its small size, compared with a standard Linux kernel. As is, µClinux does not have any real-time capability, because it inherits its standard processor scheduler from Linux; however, both RT-Linux [11] and RTAI [12] real-time extensions are available for it. RT-Linux and RTAI. These are hard real-time-capable extensions to the Linux operating system; they are similar, and their architecture was first outlined in 1997 [11–13]. The main design feature of both RT-Linux and RTAI is the clear separation between the real-time and nonreal-time domains: in RT-Linux and RTAI, a small monolithic real-time kernel runs real-time tasks, and coexists with the Linux kernel. As a consequence, nonreal-time tasks running on the Linux kernel have the sophisticated services of a standard time-sharing operating system at their disposal, whereas real-time tasks operate in a protected, predictable and low-latency environment. The real-time kernel performs first-level real-time scheduling and interrupt handling, and runs the Linux kernel as its lowest-priority task. In order to keep changes in the Linux kernel to an absolute minimum, the real-time kernel provides for an emulation of the interrupt control hardware. In particular, any interrupt disable/enable request issued by the Linux kernel is not passed to the hardware, but is emulated in the real-time kernel instead; thus, for example, when Linux disables interrupts, the hardware interrupts actually stay enabled and the real-time kernel queues and delays the delivery of any interrupt of interest to the Linux kernel. Real-time interrupts are not affected at all, and are handled as usual, without any performance penalty. To handle communication between real-time and nonreal-time tasks, RT-Linux and RTAI implement lock-free queues and shared memory. In this way, real-time applications can rely on Linux system services for nontime-critical operations, such as filesystem access and graphics user interface. RTEMS. The development of RTEMS [14] is coordinated by On-Line Applications Research Corporation, which also offers paid technical support. Its application development environment is based on open-source GNU tools, and has a monolithic/layered architecture, commonly found in high-performance real-time executives. RTEMS complies with the POSIX 1003.1b application programming interface and supports multiple threads of execution, but does not implement a multiprocess environment with independent application address spaces. It also supports networking and a filesystem.



11-25

11.5 Virtual-Machine Operating Systems According to its most general definition, a virtual machine is an abstract system, composed of one or more virtual processors and zero or more virtual devices. The implementation of a virtual machine can be carried out in a number of different ways, such as, for example, interpretation, (partial) just-in-time compilation, and hardware-assisted instruction-level emulation. Moreover, these techniques can be, and usually are, combined to obtain the best compromise between complexity of implementation and performance for a given class of applications. In this section, we focus on perfect virtualization, that is, the implementation of virtual machines whose processors and I/O devices are identical in all respects to their counterpart in a physical machine, that is, also the machine on which the virtualization software runs. The implementation of virtual machines is carried out by means of a peculiar kind of operating system kernel; hardware assistance keeps overheads to a minimum. As described in Section 11.2.1, the internal architecture of an operating system based on virtual machines revolves around the basic observation that an operating system must perform two essential functions: • Multiprogramming • System services Accordingly, those operating systems fully separate these functions and implement them as two distinct operating system components: • A virtual machine monitor • A guest operating system

11.5.1 Related Work A particularly interesting application of the layered approach to operating system design, the virtual machines, were first introduced by Meyer and Seawright in the experimental CP-40 and, later, CP-67 system [15] on an IBM 360/67. This early system was peculiar because it provided each user with a virtual IBM 360/65 (not 67) including I/O devices. So, processor and I/O virtualization was perfect, but MMU virtualization was not attempted at all, and the virtual machines thus lacked an MMU. Later on, MMU virtualization was added by A. Auroux, and CP-67 evolved into a true IBM product, VM/370. Offsprings of VM/370 are still in use today on IBM mainframes; for example, z/VM [16] runs on IBM zSeries mainframes and supports the execution of multiple z/OS and Linux operating system images, each in its own virtual machine. An extensive, early discussion of virtual machines and their properties can be found in References 17 and 18. More recently, microcode support for virtual machines was added to the 680x0 microprocessor family in the transition between the 68000 and the 68010 [19]; in particular, the privilege level required to execute some instructions was raised to make processor virtualization feasible. Commercial virtualization products now available include, for example, the VMware virtualization software for the Intel Pentium architecture [20]. For example, in Reference 21 this product was used to give a prototype implementation of a virtual machine-based platform for trusted computing. More advanced attempts at virtualization in various forms can be found in References 22 and 23. In particular, Reference 22 discusses in detail the trade-off between “perfect” virtualization and efficiency.

11.5.2 Views of Processor State The execution of each machine instruction both depends on, and affects, the internal processor state. In addition, the semantics of an instruction depend on the processor execution mode the processor was in


11-26


when the instruction itself was executed, since the processor mode directly determines the privilege level of the processor itself. For example: • On the ARM V5 [24] processor, the execution of the ADD R13, R13, #1 instruction increments by one the contents of register R13. • The outcome of the BEQ label instruction depends on the current value of the Z processor state flag, and conditionally updates the contents of the program counter. The view that machine code has of the internal processor state depends on the mode the processor is running in. In particular, let us define two, somewhat simplified, views of the processor state: User-mode view. It is the portion of the processor state that is accessible through machine instructions, with either read-only or read-write access rights, when the processor is in user mode. In other words, it is the portion of the processor state that can be accessed by unprivileged machine instructions. Privileged-mode view. It is the portion of the processor state that is accessible through machine instructions, with either read-only or read-write access rights, when the processor is in privileged mode. It usually is a superset of the user-mode state and, if the processor supports a single privileged mode, coincides with full access to the processor state as a whole. When the processor supports either privilege rings or multiple, independent privileged modes, the definition of privileged-mode view becomes more complicated, and involves either: • A nested set of views when the processor supports privilege rings. In this case, the inner view, corresponding to the most privileged processor mode, encompasses the machine state as a whole with the most powerful access rights; outer, less privileged modes have more restricted views of the processor state. Above, nested means that the outer view either has no visibility of a processor state item that is visible from the inner view, or that the outer view has less powerful access rights than the inner view on one or more processor state items. • A collection of independent views when the processor supports multiple, independent privileged modes. In this case, it should be noted that the intersection among views can be, and usually is, not empty; for example, in the ARM V5 processor, user registers R0 through R7 are accessible from, and common to all unprivileged and privileged processor modes. In addition, registers R8 through R12 are common to all but one processor mode. It should be noted also that only the union of all views give full access to the processor state: in general, no individual view can do the same, not even the view corresponding to the “most privileged” privileged mode, even if the processor specification contains such a hierarchical classification of privileged modes. Continuing the first example above, if the processor implements multiple, mode-dependent instances of register R13, the execution of the ADD instruction presented above in user mode will update the user-mode view of R13, but will not affect the view of the same register in any other mode. As customary, we define a process as the activity of executing a program; a process therefore encompasses both the program text and a view of the current processor state when the execution takes place. The latter includes the notion of execution progress that could be captured, for example, by a program counter.

11.5.3 Operating Principle The most important concept behind processor virtualization is that a low-level system software component, the Virtual Machine Monitor (or VMM for short), also historically known as the Control Program (or CP), performs the following functions, likely with hardware assistance: • It gives to a set of machine code programs, running either in user or privileged mode, their own, independent view of the processor state; this gives rise to a set of independent sequential processes. • Each view is “correct” from the point of view of the corresponding program, in the sense that the view is indistinguishable from the view the program would have when run on a bare machine,



11-27

without a VMM in between. This requirement supports the illusion that each process runs on its own processor, and that processor is identical to the physical processor below the VMM. • It is able to switch the processor among the processes mentioned above; in this way, the VMM implements multiprogramming. Switching the processor involves both a switch among possibly different program texts, and among distinct processor state views. The key difference of the VMM approach with respect to traditional multiprogramming is that a traditional operating system confines user-written programs to run in user mode only, and to access privileged mode by means of a system call. Thus, each service request made by a user program traps to the operating system and switches the processor to privileged mode; the operating system, running in privileged mode, performs the service on behalf of the user program and then returns control to it, simultaneously bringing the processor back to user mode. A VMM system, instead, supports virtual processors that can run either in user or in privileged mode, just like the real, physical processor they mimic; the real processor must necessarily execute the processes inside the virtual machine — also called the virtual machine guest code — in user mode, to keep control of the system as a whole. We must therefore distinguish between: • The real, physical processor mode, that is, the processor mode the physical processor is running in. At each instant, each physical processor in the system is characterized by its current processor mode. • The virtual processor mode, that is, the processor mode each virtual processor is running in. At each instant, each virtual machine is characterized by its current processor mode, and it does not necessarily coincide with the physical processor mode, even when the virtual machine is being executed.

11.5.4 Virtualization by Instruction Emulation The classic approach to processor virtualization is based on privileged instruction emulation; with this approach, the VMM maintains a virtual machine control block (or VMCB for short) for each virtual machine. Among other things, the VMCB holds the full processor state, both unprivileged and privileged, of a virtual processor; therefore, it contains state information belonging to, and accessible from, distinct views of the processor state, with different levels of privilege. When a virtual machine is being executed by a physical processor, the VMM transfers part of the VMCB into the physical processor state; when the VMM assigns the physical processor to another virtual machine, the physical processor state is transferred back into the VMCB. It is important to notice that virtual machines are always executed with the physical processor in user mode, regardless of the virtual processor mode. Most virtual machine instructions are executed directly by the physical processor, with zero overhead; however, some instructions must be emulated by the VMM to incur a trap handling overhead. In particular: 1. Unprivileged instructions act on, and depend on, the current view of the processor state only, and are executed directly by the physical processor. Two subcases are possible, depending on the current virtual processor mode: (a) Both the virtual and the physical processor are in user mode. In this case, virtual and physical instruction execution and their corresponding processor state views fully coincide, and no further manipulation of the processor state is necessary. (b) The virtual processor is running in a privileged mode and the physical processor is in user mode. So, instruction execution acts on the user-mode view of the processor state, and the intended effect is to act on one of the privileged views. To compensate for this, the VMM must update the contents of the user-mode view in the physical processor from the appropriate


11-28


portion of the VMCB whenever the virtual processor changes state. Even in this case, the overhead incurred during actual instruction execution is zero. 2. Privileged instructions act on one of the privileged views of the processor state. So, when the execution of a privileged instruction is attempted in physical user mode, the physical processor takes a trap to the VMM. In turn, the VMM must emulate either the trap or the trapped instruction, depending on the current virtual privilege level, and reflect the outcome of the emulation in the virtual processor state stored in the VMCB: (a) If the virtual processor was in user mode when the trap occurred, the VMM must emulate the trap. Actual trap handling will be performed by the privileged software inside the virtual machine, in virtual privileged mode, because the emulation of the trap, among other things, switches the virtual processor into privileged mode. The virtual machine privileged software actually receives the emulated trap. (b) If the virtual processor was in privileged mode, and the trap was triggered by lack of the required physical processor privilege level, the VMM must emulate the privileged instruction; in this case, the VMM itself performs trap handling and the privileged software inside the virtual machine does not receive the trap at all. Instead, it sees the outcome of the emulated execution of the privileged instruction. (c) If the virtual processor was in privileged mode, and the trap was triggered by other reasons, the VMM must emulate the trap; the actual trap handling, if any, will be performed by the privileged software inside the virtual machine, in virtual privileged mode. It should be noted in fact that, in most simple processor architectures and operating systems, a trap occurring in a privileged mode is usually considered to be a fatal condition, and triggers the immediate shutdown of the operating system itself. In the first and third case above, the behavior of the virtual processor exactly matches the behavior of a physical processor in the same situation, except that the trap enter mechanism is emulated in software instead of being performed either in hardware or in microcode. In the second case, the overall behavior of the virtual processor still matches the behavior of a physical processor in the same situation, but the trap is kept invisible to the virtual machine guest software because, in this case, the trap is instrumental for the VMM to properly catch and emulate the privileged instruction. 3. A third class of instructions includes unprivileged instructions whose outcome depends on a physical processor state item belonging to privileged processor state views only. The third and last class of instructions is anomalous and problematic in nature from the point of view of processor virtualization, because these instructions allow a program to infer something about a processor state item that would not be accessible from its current view of the processor state itself. The presence of instructions of this kind hampers the privileged instruction emulation approach to processor virtualization just discussed, because it is based on the separation between physical processor state and virtual processor state, and enforces this separation by trapping (and then emulating in the virtual processor context) all instruction that try to access privileged processor state views. Instructions of this kind are able to bypass this mechanism as a whole, because they generate no trap, so the VMM is unable to detect and emulate them; instead, they take information directly from the physical processor state.

11.5.5 Processor-Mode Change A change in the mode of a virtual processor may occur for several reasons, of which the main ones are: • When the execution of an instruction triggers a trap; in this case, trap handling is synchronous with respect to the code being executed.



11-29

• When an interrupt request for the virtual machine comes in, and is accepted; the interrupt request is asynchronous with respect to the code being executed, but the hardware implicitly synchronizes interrupt handling with instruction boundaries. In both cases, the VMM takes control of the physical processor and then implements the mode change by manipulating the virtual machine’s VMCB and, if the virtual machine was being executed when the mode change was requested, a portion of the physical processor state. In particular, to implement the processor-mode change the VMM must perform the following actions: • Save the portion of the physical processor state pertaining to the processor state view of the old processor state into the VMCB; for example, if the processor was running in user mode, the user-mode registers currently loaded in the physical processor registers must be saved. • Update the VMCB to reflect the effects of the mode change to the virtual processor state; for example, a system call instruction will likely modify the program counter of the virtual processor and load a new processor status word. • Load the physical processor state pertaining to the state view of the new processor state from the VMCB; for example, if the processor is being switched to a privileged mode, the privileged-mode registers must be transferred from the VMCB into the user-mode registers of the physical processor. Notice that the mode of the registers saved in the VMCB and the accessibility mode of the physical processor registers into which they are loaded do not coincide.

11.5.6 Privileged Instruction Emulation To perform its duties, the VMM must be able to receive and handle traps on behalf of a virtual machine, and these traps can be triggered for a variety of reasons. When using the privileged instruction emulation approach to processor virtualization, the most common trap reason is the request to emulate a privileged instruction. The VMM must perform privileged instruction emulation when a virtual processor attempts to execute a legal privileged instruction while in virtual privileged mode. In this case, the physical processor (running in physical user mode) takes a “privileged instruction” trap that would have not been taken if it were in privileged mode as the virtual machine software expects it to be. The main steps of the instruction emulation sequence are: • Save into the VMCB all registers in the view corresponding to the current virtual processor mode. This both “freezes” the virtual machine state for subsequent instruction emulation, and frees the physical processor state for VMM use. • Locate and decode the instruction to be emulated in the virtual processor instruction stream. This operation may involve multiple steps because, for example, on superscalar or deeply pipelined architecture, the exact value of the program counter at the time of the trap can not be easy to compute. • Switch the physical processor into the appropriate privileged mode for instruction emulation, that is, to the processor mode of the virtual processor. The trap handling mechanism of the physical processor always switches the processor into a privileged mode, but if the processor supports multiple privileged modes then the privileged mode might not coincide with the actual privileged mode of the virtual processor. • Emulate the instruction using the VMCB as the reference machine state for the emulation, and reflect its outcome into the VMCB itself. Notice that the execution of a privileged instruction may update both the privileged and the unprivileged portion of the virtual processor state, so the VMCB as a whole is involved. Also, the execution of a privileged instruction may change the processor mode of the virtual processor.


11-30


• Update the virtual program counter in the VMCB to the next instruction in the instruction stream of the virtual processor. • Restore the virtual processor state from the updated VMCB and return from the trap. In the last step above, the virtual processor state can in principle be restored either: • From the VMCB of the virtual machine that generated the trap in the first place, if the processor scheduler of the VMM is not invoked after instruction emulation; this is the case just described. • From the VMCB of another virtual machine, if the processor scheduler of the VMM is invoked after instruction emulation.

11.5.7 Exception Handling When any synchronous exception other than a privileged instruction trap, occurs in either virtual user or virtual privileged modes, the VMM, and not the guest operating system of the virtual machine, receives the trap in the first place. When the trap is not instrumental to the implementation of virtual machines, as it happens in most cases, the VMM must simply emulate the trap mechanism itself inside the virtual machine, and appropriately update the VMCB to reflect the trap back to the privileged virtual machine code. Another situation in which the VMM must simply propagate the trap is the occurrence of a privileged instruction trap when the virtual processor is in virtual user mode. This occurrence usually, but not always, indicates a bug in the guest software: an easy counterexample can be found when a VMM is running inside the virtual machine. A special case of exception is that generated by the system call instruction, whenever it is implemented. However, from the point of view of the VMM, this kind of exception is handled exactly as all others; only the interpretation given to the exception by the guest code running in the virtual machine is different. It should also be noted that asynchronous exceptions, such as interrupt requests, must be handled in a different and more complex way, as described in the following section.

11.5.8 Interrupt Handling We distinguish between three kinds of interrupt; each of them requires a different handling strategy by the VMM: • Interrupts triggered by, and destined to, the VMM itself, for example, the VMM processor scheduler timeslice interrupt, and the VMM console interrupt; in this case, no virtual machine ever notices the interrupt. • Interrupts destined to a single virtual machine, for example, a disk interrupt for a physical disk permanently assigned to a virtual machine. • Interrupts synthesized by the VMM, either by itself or as a consequence of another interrupt, and destined to a virtual machine, for example, a disk interrupt for a virtual disk emulated by the VMM, or a network interrupt for a virtual communication channel between virtual machines. In either case, the general approach to interrupt handling is the same, and the delivery of an interrupt request to a virtual machine implies at least the following steps: • If the processor was executing in a virtual machine, save the status of the current virtual machine, if any, into the corresponding VMCB; then, switch the processor onto the VMM context and stack, and select the most privileged processor mode. Else, the processor was already executing the VMM; the processor already is in the VMM context and stack, and runs at the right privilege level. In both cases, after this phase, the current virtual machine context has been secured in its VMCB and the physical processor can freely be used by VMM code; this is also a good boundary for the transition between the portion of the VMM written in assembly code and the bulk of the VMM, written in a higher-level programming language. © 2006 by Taylor & Francis Group, LLC


11-31

• Determine the type of interrupt request and to which virtual machine it must be dispatched; then, emulate the interrupt processing normally performed by the physical processor in the corresponding VMCB. An additional complication arises if the target virtual machine is the current virtual machine and the VMM was in active execution, that is, it was emulating an instruction on behalf of the virtual machine itself, when the request arrived. In this case, the simplest approach, which also adheres most to the behavior of the physical processor, is to defer interrupt emulation to the end of the current emulation sequence. To implement the deferred handling mechanism efficiently, some features of the physical processor, such as Asynchronous System Traps (ASTs) and deferrable software interrupts, may be useful; unfortunately, they are now uncommon on RISC (Reduced Instruction Set Computer) machines. • Return either to the VMM or the virtual machine code that was being executed when the interrupt request arrived. Notice that, at this point, no actual interrupt handling took place yet, and that some devices may require some limited intervention before returning from their interrupt handler, for example, to release their interrupt request line. In this case, it may be necessary to incorporate this low-level interrupt handling in the VMM directly, and at the same time ensure that it is idempotent when repeated by the virtual machine interrupt handler. The opportunity and the relative advantages and disadvantages of invoking the VMM processor scheduler after each interrupt request will be discussed in Section 11.5.10.

11.5.9 Trap Redirection A problem common to privileged instruction emulation, exception handling, and interrupt handling is that the VMM must be able to intercept any exception the processor takes while executing on behalf of a virtual machine and direct it toward its own handler. Most modern processors use an unified trap vector or dispatch table for all kinds of trap, exception, and interrupt. Each trap type has its own code that is used as an index in the trap table to fetch the address in memory at which the corresponding trap handler starts. A slightly different approach is to execute the instruction inside the table directly (in turn, the instruction will usually be an unconditional jump instruction), but the net effect is the same. A privileged register, usually called the trap table base register, gives the starting address of the table. In either case, all vectors actually used by the physical processor when handling a trap reside in the privileged address space, and are accessed after the physical processor has been switched into an appropriate privileged mode. The VMM must have full control on these vectors, because it relies on them to intercept traps at runtime. On the other hand, the virtual machine guest code should be able to set its own trap table, with any vectors it desires; the latter table resides in the virtually privileged address space of the virtual machine, and must be accessible to the virtual machine guest code in read and write mode. The content of this table is not used by the physical processor, but by the VMM to compute the target address to which to redirect traps via emulation. The simplest approach to accommodate these conflicting needs, when it is not possible to map the same virtual address to multiple, distinct physical addresses depending on the processor mode without software intervention — which is quite a common restriction on simple MMUs — is to reserve in the addressing space of each virtual machine a phantom page that is not currently in use by the guest code, grant read-only access to it only when the processor is in physical privileged mode to keep it invisible to the guest code, and store the actual trap table there and direct the processor to use it by setting its trap table base register appropriately. The initialization of the actual trap table is made by the VMM for each virtual machine during the initial instantiation of the virtual machine itself. Since the initialization is performed by the VMM (and not by the virtual machine), the read-only access restriction described above does not apply. © 2006 by Taylor & Francis Group, LLC

11-32


The VMM must then intercept any access made by the virtual machine guest code to the trap table base register, in order to properly locate the virtual trap table and be able to compute the target address to which to redirect traps. In other words, the availability of a trap table base register allows the guest code to set up its own, virtual trap table, and the VMM to set up the trap table obeyed by the physical processor, without resorting to virtual/physical address mapping functions that depend on the processor mode. Two complementary approaches can be followed to determine the exact location of the phantom page in the address space of the virtual machine. In principle, the two approaches are the same; the difference between them is the compromise between ease of implementation, overheads at runtime, and flexibility: • If the characteristics of the guest code, in particular of the guest operating systems, are well known, the location of the phantom page can be fixed in advance. This is the simpler choice, and has also the least runtime overhead. • If one does not want to make any assumption at all about the guest code, the location of the phantom page must be computed dynamically from the current contents of the page table set up by the guest code and may change with time (e.g., when the guest code decides to map the location in which the phantom page currently resides for its own use). When the location of the phantom page changes, the VMM must update the trap table base register accordingly; moreover, it must ensure that the first-level trap handlers contain only position-independent code.

11.5.10 VMM Processor Scheduler The main purpose of the VMM, with respect to processor virtualization, is to emulate the privileged instructions issued by virtual processors in virtual privileged mode. Code fragments implementing the emulation are usually short and often require atomic execution, above all with respect to interrupt requests directed to the same virtual machine. The processor scheduling activity carried out by the VMM by means of the VMM processor scheduler is tied to the emulation of privileged instructions quite naturally, because the handling of a privileged instruction exception is a convenient scheduling point. On the other hand, the main role of the VMM in interrupt handling is to redirect each interrupt to the virtual machine(s) interested in it; this action, too, must be completed atomically with respect to instruction execution on the same virtual machine, like the physical processor itself would do. This suggests to disable the rescheduling of the physical processor if the processor was executing the VMM when the interrupt arrived, and to delay the rescheduling until the end of the current VMM emulation path. The advantage of this approach is twofold: • The state of the VMM must never be saved when switching between different virtual machines. • The VMM must not be reentrant with respect to itself, because a VMM/VMM context switch cannot occur. By contrast, the processor allocation latency gets worse because the maximum latency, not taking higher-priority activities into account, becomes the sum of: • The longest emulation path in the VMM. • The longest sequence of instructions in the VMM to be executed with interrupts disabled, due to synchronization constraints. • The scheduling time. • The virtual machine context save and restore time. In a naive implementation, making the VMM not preemptable seems promising, because it is conceptually simple and does impose a negligible performance penalty in the average case, if it is assumed that the occurrence of an interrupt that needs an immediate rescheduling while VMM execution is in progress is rare.



11-33

Also, some of the contributions to the processor allocation latency described earlier, mainly the length of instruction emulation paths in the VMM and the statistical distribution of the different instruction classes to be emulated in the instruction stream, will be better known only after some experimentation because they also depend on the behavior of the guest operating systems and their applications. It must also be taken into account that making the VMM preemptable will likely give additional performance penalties in the average case: • The switch between virtual machines will become more complex, because their corresponding VMM state must be switched as well. • The preemption of some VMM operations, for example, the propagation of an interrupt request, may be ill-defined if it occurs while a privileged instruction of the same virtual machine is being emulated. So, at least for soft real-time applications, implementing preemption of individual instruction emulation in the VMM should be done only if it is strictly necessary to satisfy the latency requirements of the application, and after extensive experimentation. From the point of view of the scheduling algorithms, at least in a naive implementation, a fixed-priority scheduler with a global priority level assigned to each virtual machine is deemed to be the best choice, because: • It easily accommodates the common case in which nonreal-time tasks are confined under the control of a general-purpose operating system in a virtual machine, and real-time tasks either run under the control of a real-time operating system in another virtual machine, or have a private virtual machine each. • The sensible selection of a more sophisticated scheduling algorithm can be accomplished only after extensive experimentation with the actual set of applications to be run, and when a detailed model of the real-time behavior of the application itself and of the devices it depends on is available. • The choice of algorithms to be used when multiple, hierarchical schedulers are in effect in a real-time environment has not yet received extensive attention in the literature.

References [1] A. Silberschatz, P.B. Galvin, and G. Gagne. Applied Operating Systems Concepts. John Wiley & Sons, Hoboken, NJ, 1999. [2] A.S. Tanenbaum and A.S. Woodhull. Operating Systems — Design and Implementation. Prentice Hall, Englewood Cliffs, NJ, 1997. [3] M.K. McKusick, K. Bostic, M.J. Karels, and J.S. Quarterman. The Design and Implementation of the 4.4BSD Operating System. Addison-Wesley, Reading, MA, 1996. [4] B. Sprunt, L. Sha, and J.P. Lehoczky. Aperiodic Task Scheduling for Hard Real-Time Systems. Real-Time Systems, 1, 27–60, 1989. [5] IEEE Std 1003.1-2001. The Open Group Base Specifications Issue 6. The IEEE and The Open Group, 2001. Also available online, at http://www.opengroup.org/. [6] OSEK/VDX. OSEK/VDX Operating System Specification. Available online, at http://www. osek-vdx.org/. [7] IEEE Std 1003.13-1998. Standardized Application Environment Profile — POSIX Realtime Application Support (AEP). The IEEE, New York, 1998. [8] ISO/IEC 9899:1999. Programming Languages — C. International Standards Organization, Geneva, 1999. [9] Red Hat Inc. eCos User Guide. Available online, at http://sources.redhat.com/ecos/. [10] Arcturus Networks Inc. µClinux Documentation. Available online, at http://www. uclinux.org/.


11-34


[11] FSMLabs, Inc. RTLinuxPro Frequently Asked Questions. Available online, at http://www. rtlinux.org/. [12] Politecnico di Milano, Dip. di Ingegneria Aerospaziale. The RTAI Manual. Available online, at http://www.aero.polimi.it/˜rtai/. [13] M. Barabanov and V. Yodaiken. Real-Time Linux. Linux Journal, February 1997. Also available online, at http://www.rtlinux.org/. [14] On-Line Applications Research. RTEMS Documentation. Available online, at http://www. rtems.com/. [15] R.A. Meyer and L.H. Seawright. A Virtual Machine Time-Sharing System. IBM Systems Journal, 9, 199–218, 1970. [16] International Business Machines Corp. z/VM General Information, GC24-5991-05. Also available online, at http://www.ibm.com/. [17] R. Goldberg. Architectural Principles for Virtual Computer Systems, Ph.D. thesis, Harvard University, 1972. [18] R. Goldberg. Survey of Virtual Machine Research. IEEE Computer Magazine, 7, 34–45, 1974. [19] Motorola Inc. M68000 Programmer’s Reference Manual, M68000PM/AD, Rev. 1. [20] VMware Inc. VMware GSX Server User’s Manual. Also available online, at http://www. vmware.com/. [21] T. Garfinkel, B. Plaff, J. Chow, M. Rosenblum, and D. Boneh. Terra: A Virtual Machine-Based Platform for Trusted Computing. In Proceedings of the 19th ACM Symposium on Operating Systems Principles, October 2003. [22] P. Barham, B. Dragovic, K. Fraser, S. Hand, T. Harris, A. Ho, R. Neugerbauer, I. Pratt, and A. Warfield. Xen and the Art of Virtualization. In Proceedings of the 19th ACM Symposium on Operating Systems Principles, October 2003. [23] E. Bugnion, S. Devine, and M. Rosenblum. Disco: Running Commodity Operating Systems on Scalable Multiprocessors. In Proceedings of the 16th ACM Symposium on Operating Systems Principles, October 1997. [24] D. Seal, Ed. ARM Architecture Reference Manual, 2nd ed. Addison-Wesley, Reading, MA, 2001.


12 Real-Time Operating Systems: The Scheduling and Resource Management Aspects 12.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-1 Achieving Predictability

12.2 Periodic Task Handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-2 Timeline Scheduling • Rate Monotonic Scheduling • Earliest Deadline First • Tasks with Deadlines Less than Periods

12.3 Aperiodic Task Handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-6 12.4 Protocols for Accessing Shared Resources . . . . . . . . . . . . . . 12-9 Priority Inheritance Protocol • Priority Ceiling Protocol • Schedulability Analysis

Giorgio C. Buttazzo University of Pavia

12.5 New Applications and Trends . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-11 12.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-12 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-12

12.1 Introduction Often, people say that real-time systems must react fast to external events. Such a definition, however, is not precise, because processing speed does not provide any information on the actual capability of the system to react timely to events. In fact, the effect of controller actions in a system can only be evaluated when considering the dynamic characteristics of the controlled environment. A more precise definition would say that a real-time system is a system in which performance depends not only on the correctness of the single controller actions, but also on the time at which actions are produced [1]. The main difference between a real-time task and a nonreal-time task is that a real-time task must complete within a given deadline. In other words, a deadline is the maximum time allowed for a computational process to finish its execution. In real-time applications, a result produced after its deadline is not only late, but can be dangerous. Depending on the consequences caused by a missed deadline, real-time activities can be classified into hard and soft tasks [2]. A real-time task is said to be hard if missing a deadline may have catastrophic consequences in the controlled system. A real-time task is said to be soft if missing a deadline 12-1


12-2


causes a performance degradation, but does not jeopardize correct system behavior. An operating system able to manage hard tasks is called a hard real-time system [3,4]. In general, hard real-time systems have to handle both hard and soft activities. In a control application, typical hard tasks include sensory data acquisition, detection of critical conditions, motor actuation, and action planning. Typical soft tasks include user command interpretation, keyboard input, message visualization, system status representation, and graphical activities. The great interest in real-time systems is motivated by the growing diffusion they have in our society in several application fields, including chemical and nuclear power plants, flight control systems, traffic monitoring systems, telecommunication systems, automotive devices, industrial automation, military systems, space missions, and robotic systems. Despite this large application domain, most of today’s real-time control systems are still designed using ad hoc techniques and heuristic approaches. Very often, control applications with stringent time constraints are implemented by writing large portions of code in assembly language, programming timers, writing low-level drivers for device handling, and manipulating task and interrupt priorities. Although the code produced by these techniques can be optimized to run very efficiently, this approach has several disadvantages. First of all, the implementation of large and complex applications in assembly language is much more difficult and time consuming than using high-level programming. Moreover, the efficiency of the code strongly depends on the programmer’s ability. In addition, assembly code optimization makes a program more difficult to comprehend, complicating software maintenance. Finally, without the support of specific tools and methodologies for code and schedulability analysis, the verification of time constraints becomes practically impossible. The major consequence of this state of affairs is that control software produced by empirical techniques can be highly unpredictable. If all critical time constraints cannot be verified a priori and the operating system does not include specific features for handling real-time tasks, the system apparently works well for a period of time, but may collapse in certain rare, but possible, situations. The consequences of a failure can sometimes be catastrophic and may injure people or cause serious damage to the environment. A trustworthy guarantee of system behavior under all possible operating conditions can only be achieved by adopting appropriate design methodologies and kernel mechanisms specifically developed for handling explicit timing constraints.

12.1.1 Achieving Predictability The most important property of a real-time system is not high speed, but predictability. In a predictable system we should be able to determine in advance whether all the computational activities can be completed within their timing constraints. The deterministic behavior of a system typically depends on several factors, ranging from the hardware architecture to the operating system, up to the programming language used to write the application. Architectural features that have major influence on task execution include interrupts, direct memory access (DMA), cache, and prefetching mechanisms. Although such features improve the average performance of the processor, they introduce a nondeterministic behavior in process execution, prolonging the worst-case response times. Other factors that significantly affect task execution are due to the internal mechanisms used in the operating system, such as the scheduling algorithm, the synchronization mechanisms, the memory management policy, and the method used to handle I/O devices. The programming language has also an important impact on predictability, through the constructs it provides to handle the timing requirements specified for computational activities.

12.2 Periodic Task Handling Periodic activities represent the major computational load in a real-time control system. For example, activities such as actuator regulation, signal acquisition, filtering, sensory data processing, action planning, and monitoring, need to be executed with a frequency derived from the application requirements. A periodic task is characterized by an infinite sequence of instances, or jobs. Each job is characterized by a request time and a deadline. The request time r(k) of the kth job of a task represents the time at


Real-Time Operating Systems

12-3

which the task becomes ready for execution for the kth time. The interval of time between two consecutive request times is equal to the task period. The absolute deadline of the kth job, denoted with d(k), represents the time within which the job has to complete its execution, and r(k) < d(k) ≤ r(k + 1).

12.2.1 Timeline Scheduling Timeline Scheduling (TS), also known as a cyclic executive, is one of the most used approaches to handle periodic tasks in defense military systems and traffic control systems. The method involves dividing the temporal axis into slices of equal length, in which one or more tasks can be allocated for execution, in such a way to respect the frequencies derived from the application requirements. A timer synchronizes the activation of the tasks at the beginning of each time slice. In order to illustrate this method, consider the following example, in which three tasks, A, B, and C, need to be executed with a frequency of 40, 20, and 10 Hz, respectively. By analyzing the task periods, it is easy to verify that the optimal length for the time slice is 25 msec, which is the greatest common divisor of the periods. Hence, to meet the required frequencies, task A needs to be executed in every time slice, task B every two slices, and task C every four slices. A possible scheduling solution for this task set is illustrated in Figure 12.1. The duration of the time slice is also called a minor cycle, whereas the minimum period after which the schedule repeats itself is called a major cycle. In general, the major cycle is equal to the least common multiple of all the periods (in the example it is equal to 100 msec). In order to guarantee a priori that a schedule is feasible on a particular processor, it is sufficient to know the task worst-case execution times and verify that the sum of the executions within each time slice is less than or equal to the minor cycle. In the example shown in Figure 12.1, if CA , CB , and CC denote the execution times of the tasks, it is sufficient to verify that CA + CB ≤ 25 msec CA + CC ≤ 25 msec The major relevant advantage of TS is its simplicity. The method can be implemented by programming a timer to interrupt with a period equal to the minor cycle and by writing a main program that calls the tasks in the order given in the major cycle, inserting a time synchronization point at the beginning of each minor cycle. Since the task sequence is not decided by a scheduling algorithm in the kernel, but is triggered by the calls made by the main program, there are no context switches, so the runtime overhead is very low. Moreover, the sequence of tasks in the schedule is always the same, can be easily visualized, and it is not affected by jitter (i.e., task start times and response times are not subject to large variations). In spite of these advantages, TS has some problems. For example, it is very fragile during overload conditions. If a task does not terminate at the minor cycle boundary, we can either let it continue or abort it. In both cases, however, the system may enter in a risky situation. In fact, if we leave the failing task in execution, it can cause a domino effect on the other tasks, breaking the entire schedule (timeline break). On the other hand, if the failing task is aborted, the system may be left in an inconsistent state, jeopardizing correct system behavior. Another big problem of the TS technique is its sensitivity to application changes. If updating a Task A Major cycle

Task B Task C

Minor cycle

0

25

FIGURE 12.1 Example of TS.


50

75

100

125

150

t

12-4


task requires an increase of its computation time or its activation frequency, the entire scheduling sequence may need to be reconstructed from scratch. Considering the previous example, if task B is updated to B and the code change is such that CA + CB > 25 msec, then we have to divide B in two or more pieces to be allocated in the available intervals of the timeline. Changing the task frequencies may cause even more radical changes in the schedule. For example, if the frequency of task B changes from 20 to 25 Hz, the previous schedule is not valid any more, because the new minor cycle is equal to 10 msec and the new major cycle is equal to 200 msec. Finally, another limitation of the TS is that it is difficult to handle aperiodic activities efficiently without changing the task sequence. The problems outlined above can be solved by using priority-based scheduling algorithms.

12.2.2 Rate Monotonic Scheduling The rate monotonic (RM) algorithm assigns each task a priority directly proportional to its activation frequency, so that tasks with shorter period have higher priority. Since a period is usually kept constant for a task, the RM algorithm implements a static priority assignment, in the sense that task priorities are decided at task creation and remain unchanged for the entire application run. RM is typically preemptive, although it can also be used in a non-preemptive mode. In 1973, Liu and Layland [5] showed that RM is optimal among all static scheduling algorithms, in the sense that if a task set is not schedulable by RM, then the task set cannot be feasibly scheduled by any other fixed priority assignment. Another important result proved by the same authors is that a set = τ1 , . . . , τn of n periodic tasks is schedulable by RM if n Ci i=1

Ti

≤ n(21/n − 1)

(12.1)

where Ci and Ti represent the worst-case computation time and the period of task i, respectively. The quantity U =

n Ci i=1

Ti

represents the processor utilization factor and denotes the fraction of time used by the processor to execute the entire task set. Table 12.1 shows the values of n(21/n − 1) for n from 1 to 10. As can be seen, the factor decreases with n and, for large n, it tends to the following limit value: lim n(21/n − 1) = ln 2 0.69

n→∞

TABLE 12.1 Maximum Processor Utilization for the RM Algorithm n 1 2 3 4 5 6 7 8 9 10


Ulub 1.000 0.828 0.780 0.757 0.743 0.735 0.729 0.724 0.721 0.718


12-5

We note that the test by Liu and Layland only gives a sufficient condition for guaranteeing a feasible schedule under the RM algorithm. Hence, a task set can be schedulable by RM even though the utilization condition is not satisfied. Nevertheless, we can certainly state that a periodic task set cannot be feasibly scheduled by any algorithm if U > 1. A statistical study carried out by Lehoczky et al. [6] on randomly generated task sets showed that the utilization bound of the RM algorithm has an average value of 0.88, and becomes 1 for periodic tasks with harmonic period relations. Necessary and sufficient schedulability tests for RM have been proposed [6,10,11,29], but they have pseudo-polynomial complexity. Recently, Bini and Buttazzo derived a sufficient polynomial time test, the Hyperbolic Bound [28], capable of accepting more tasks than the Liu and Layland test. In spite of the limitation on the schedulability bound, which in most cases prevents the full processor utilization, the RM algorithm is widely used in real-time applications, mainly for its simplicity. At the same time, being a static scheduling algorithm, it can be easily implemented on top of commercial operating systems, using a set of fixed priority levels. Moreover, in overload conditions, the highest priority tasks are less prone to missing their deadlines. For all these reasons, the Software Engineering Institute of Pittsburgh has prepared a sort of user guide for the design and analysis of real-time systems based on the RM algorithm [7]. Since the RM algorithm is optimal among all fixed priority assignments, the schedulability bound can only be improved through a dynamic priority assignment.

12.2.3 Earliest Deadline First The earliest deadline first (EDF) algorithm entails selecting (among the ready tasks) the task with the earliest absolute deadline. The EDF algorithm is typically preemptive, in the sense that, a newly arrived task can preempt the running task if its absolute deadline is shorter. If the operating system does not support explicit timing constraints, EDF (as RM) can be implemented on a priority-based kernel, where priorities are dynamically assigned to tasks. A task will receive the highest priority if its deadline is the earliest among those of the ready tasks, whereas it will receive the lowest priority if its deadline is the latest one. A task gets a priority that is inversely proportional to its absolute deadline. The EDF algorithm is more general than RM, since it can be used to schedule both periodic and aperiodic task sets, because the selection of a task is based on the value of its absolute deadline, which can be defined for both types of tasks. Typically, a periodic task that completed its execution is suspended by the kernel until its next release, coincident with the end of the current period. Dertouzos [8] showed that EDF is optimal among all online algorithms, while Liu and Layland [5] proved that a set = τ1 , . . . , τn of n periodic tasks is schedulable by EDF if and only if n Ci i=1

Ti

≤1

It is worth noting that the EDF schedulability condition is necessary and sufficient to guarantee a feasible schedule. This means that, if it is not satisfied, no algorithm is able to produce a feasible schedule for that task set. The dynamic priority assignment allows EDF to exploit the full processor, reaching up to 100 utilization factor less than one, the residual fraction of time can be efficiently used to handle aperiodic requests activated by external events. In addition, compared with RM, EDF generates a lower number of context switches, thus causing less runtime overhead. On the other hand, RM is simpler to implement on a fixed priority kernel and is more predictable in overload situations, because higher priority tasks are less viable to miss their deadlines.

12.2.4 Tasks with Deadlines Less than Periods Using RM or EDF, a periodic task can be executed at any time during its period. The only guarantee provided by the schedulability test is that each task will be able to complete its execution before the next


12-6


release time. In some real-time applications, however, there is the need for some periodic task to complete within an interval less than its period. The deadline monotonic (DM) algorithm, proposed by Leung and Whitehead [9], extends RM to handle tasks with a relative deadline less than or equal to their period. According to DM, at each instant the processor is assigned the task with the shortest relative deadline. In priority-based kernels, this is equivalent to assigning each task a priority Pi inversely proportional to its relative deadline. With Di fixed for each task, DM is classified as a static scheduling algorithm. In the recent years, several authors [6, 10, 11] independently proposed a necessary and sufficient test to verify the schedulability of a periodic task set. For example, the method proposed by Audsley et al. [10] involves computing the worst-case response time Ri of each periodic task. It is derived by summing its computation time and the interference caused by tasks with higher priority: Ri = Ci +

Ri Ck Tk

(12.2)

k∈hp(i)

where hp(i) denotes the set of tasks having priority higher than task i and x denotes the ceiling of a rational number, that is, the smaller integer greater than or equal to x. The equation above can be solved by an iterative approach, starting with Ri (0) = Ci and terminating when Ri (s) = Ri (s − 1). If Ri (s) > Di for some task, then the task set cannot be feasibly scheduled by DM. Under EDF, the schedulability analysis for periodic task sets with deadlines less than periods is based on the processor demand criterion, proposed by Baruah et al. [12]. According to this method, a task set is schedulable by EDF if and only if, in every interval of length L (starting at time 0), the overall computational demand is no greater than the available processing time, that is, if and only if ∀L > 0,

n L + T i − Di Ci ≤ L Ti

(12.3)

i=1

This test is feasible, because L can only be checked for values equal to task deadlines no larger than the least common multiple of the periods. A detailed analysis of EDF has been presented by Stankovic, Ramamritham, Spuri and Buttazzo [30] under several workload conditions.

12.3 Aperiodic Task Handling Although in a real-time system most acquisition and control tasks are periodic, there exist computational activities that must be executed only at the occurrence of external events (typically signaled through interrupts), which may arrive at irregular times. When the system must handle aperiodic requests of computation, we have to balance two conflicting interests: on the one hand, we would like to serve an event as soon as possible to improve system responsiveness; on the other hand, we do not want to jeopardize the schedulability of periodic tasks. If aperiodic activities are less critical than periodic tasks, then the objective of a scheduling algorithm should be to minimize their response time, while guaranteeing that all periodic tasks (although being delayed by the aperiodic service) complete their executions within their deadlines. If some aperiodic task has a hard deadline, we should try to guarantee its timely completion offline. Such a guarantee can only be done by assuming that aperiodic requests, although arriving at irregular intervals, do not exceed a maximum given frequency, that is, they are separated by a minimum interarrival time. An aperiodic task characterized by a minimum interarrival time is called a sporadic task. Let us consider an example in which an aperiodic job Ja of 3 units of time must be scheduled by RM along with two periodic tasks, having computation times C1 = 1, C2 = 3 and periods T1 = 4, T2 = 6, respectively. As shown in Figure 12.2, if the aperiodic request is serviced immediately (i.e., with a priority higher than that assigned to periodic tasks), then task τ2 will miss its deadline. The simplest technique for managing aperiodic activities while preserving the guarantee for periodic tasks is to schedule them in background. This means that an aperiodic task executes only when the



12-7

t1 0

4

8

12

Deadline miss t2 0

6

12

Ja 0

FIGURE 12.2

2

4

6

8

10

12

Immediate service of an aperiodic task. Periodic tasks are scheduled by RM.

t1 0

4

8

12

t2 0

6

12

Ja 0

FIGURE 12.3

2

4

6

8

10

12

Background service of an aperiodic task. Periodic tasks are scheduled by RM.

processor is not busy with periodic tasks. The disadvantage of this solution is that, if the computational load due to periodic tasks is high, the residual time left for aperiodic execution can be insufficient for satisfying their deadlines. Considering the same task set as before, Figure 12.3 illustrates how job Ja is handled by a background service. The response time of aperiodic tasks can be improved by handling them through a periodic server dedicated to their execution. As any other periodic task, a server is characterized by a period Ts and an execution time Cs , called the server capacity (or budget). In general, the server is scheduled using the algorithm adopted for periodic tasks and, once activated, it starts serving the pending aperiodic requests within the limit of its current capacity. The order of service of the aperiodic requests is independent of the scheduling algorithm used for the periodic tasks, and it can be a function of the arrival time, computation time, or deadline. During the last years, several aperiodic service algorithms have been proposed in the real-time literature, differing in performance and complexity. Among the fixed priority algorithms we mention the Polling Server, the Deferrable Server [13,14], the Sporadic Server [15], and the Slack Stealer [16]. Among those servers using dynamic priorities (which are more efficient on the average), we recall the Dynamic Sporadic Server [17,18], the Total Bandwidth Server [19], the Tunable Bandwidth Server [20], and the Constant Bandwidth Server [21]. In order to clarify the idea behind an aperiodic server, Figure 12.4 illustrates the schedule produced, under EDF, by a Dynamic Deferrable Server with capacity Cs = 1 and period Ts = 4. We note that, when the absolute deadline of the server is equal to the


12-8


t1 0

4

8

12

t2 0

6

12

Ja 0

2

4

6

8

10

12

0

2

4

6

8

10

12

Cs

FIGURE 12.4 Aperiodic service performed by a Dynamic Deferrable Server. Periodic tasks, including the server, are scheduled by EDF. Cs is the remaining budget available for Ja .

t1 0

4

8

12

t2 0

6

1

2

Ja 0

2

4

6

8

10

12

FIGURE 12.5 Optimal aperiodic service under EDF.

one of a periodic task, priority is given to the server in order to enhance aperiodic responsiveness. We also observe that the same task set would not be schedulable under a fixed priority system. Although the response time achieved by a server is less than that achieved through the background service, it is not the minimum possible. The minimum response time can be obtained with an optimal server (TB∗ ) that assigns each aperiodic request the earliest possible deadline which still produces a feasible EDF schedule [20]. The schedule generated by the optimal TB∗ algorithm is illustrated in Figure 12.5, where the minimum response time for job Ja is equal to 5 units of time (obtained by assigning the job a deadline da = 7). As for all the efficient solutions, the better performance is achieved at the price of a larger runtime overhead (due to the complexity of computing the minimum deadline). However, adopting a variant of the algorithm, called the Tunable Bandwidth Server [20], overhead cost and performance can be balanced in order to select the best service method for a given real-time system. An overview of the



12-9

Normal execution Critical section Blocked

t1

t2

t3 t0

FIGURE 12.6

t1

t2

t3

t4

t5

t6

t7

Example of priority inversion.

most common aperiodic service algorithms (both under fixed and dynamic priorities) can be found in Reference 3.

12.4 Protocols for Accessing Shared Resources When two or more tasks interact through shared resources (e.g., shared memory buffers), the direct use of classical synchronization mechanisms, such as semaphores or monitors, can cause a phenomenon known as priority inversion: a high priority task can be blocked by a low priority task for an unbounded interval of time. Such a blocking condition can create serious problems in safety critical real-time systems, since it can cause deadlines to be missed. For example, consider three tasks, τ1 , τ2 , and τ3 , having decreasing priority (τ1 is the task with highest priority), and assume that τ1 and τ3 share a data structure protected by a binary semaphore S. As shown in Figure 12.6, suppose that at time t1 task τ3 enters its critical section, holding semaphore S. During the execution of τ3 , at time t2 , assume τ1 becomes ready and preempts τ3 . At time t3 , when τ1 tries to access the shared resource, it is blocked on semaphore S, since the resource is used by τ3 . Since τ1 is the highest priority task, we would expect it to be blocked for an interval no longer than the time needed by τ3 to complete its critical section. Unfortunately, however, the maximum blocking time for τ1 can become much larger. In fact, task τ3 , while holding the resource, can be preempted by medium priority tasks (such as τ2 ), which will prolong the blocking interval of τ1 for their entire execution! The situation illustrated in Figure 12.6 can be avoided by simply preventing preemption inside critical sections. This solution, however, is appropriate only for very short critical sections, because it could cause unnecessary delays for high priority tasks. For example, a low priority task inside a long critical section would prevent the execution of a high priority task, even though they do not share any resource. A more efficient solution is to regulate the access to shared resource through the use of specific concurrency control protocols [22], designed to limit the priority inversion phenomenon.

12.4.1 Priority Inheritance Protocol An elegant solution to the priority inversion phenomenon caused by mutual exclusion is offered by the priority inheritance protocol (PIP) [23]. Here, the problem is solved by dynamically modifying the priorities of tasks that cause a blocking condition. In particular, when a task τa blocks on a shared resource, it transmits its priority to the task τb that is holding the resource. In this way, τb will execute its critical


12-10


Normal execution Critical section Direct blocking t1

Push-through blocking t2

t2 t0

FIGURE 12.7

t1

t2

t3

t4

t5

t6

t7

Schedule produced using Priority Inheritance on the task set of Figure 12.6.

section with the priority of task τa . In general, τb inherits the highest priority among the tasks it blocks. Moreover, priority inheritance is transitive, thus if task τc blocks τb , which in turn blocks τa , then τc will inherit the priority of τa through τb . Figure 12.7 illustrates how the schedule shown in Figure 12.6 is changed when resources are accessed using the PIP. Until time t3 the system evolution is the same as the one shown in Figure 12.6. At time t3 , the high priority task τ1 blocks after attempting to enter the resource held by τ3 (direct blocking). In this case, however, the protocol imposes that τ3 inherits the maximum priority among the tasks blocked on that resource, thus it continues the execution of its critical section at the priority of τ1 . Under these conditions, at time t4 , task τ2 is not able to preempt τ3 , hence it blocks until the resource is released (push-through blocking). In other words, although τ2 has a nominal priority greater than τ3 , it cannot execute, because τ3 inherited the priority of τ1 . At time t5 , τ3 exits its critical section, releases the semaphore and recovers its nominal priority. As a consequence, τ1 can proceed until its completion, which occurs at time t6 . Only then τ2 can start executing. The PIP has the following property [23]: given a task τ , if n is the number of tasks with lower priority sharing a resource with a task with priority higher than or equal to τ and m is the number of semaphores that could block τ , then τ can be blocked for at most the duration of min(n, m) critical sections. Although the PIP limits the priority inversion phenomenon, the maximum blocking time for high priority tasks can still be significant, due to possible chained blocking conditions. Moreover, deadlock can occur if semaphores are not properly used in nested critical sections.

12.4.2 Priority Ceiling Protocol The priority ceiling protocol (PCP) [23] provides a better solution for the priority inversion phenomenon, also avoiding chained blocking and deadlock conditions. The basic idea behind this protocol is to ensure that, whenever a task τ enters a critical section, its priority is the highest among those that can be inherited from all the lower priority tasks that are currently suspended in a critical section. If this condition is not satisfied, τ is blocked and the task that is blocking τ inherits τ ’s priority. This idea is implemented by assigning each semaphore a priority ceiling equal to the highest priority of the tasks using that semaphore. Then, a task τ is allowed to enter a critical section only if its priority is strictly greater than all priority © 2006 by Taylor & Francis Group, LLC


12-11

ceilings of the semaphores held by the other tasks. As for the PIP, the inheritance mechanism is transitive. The PCP, besides avoiding chained blocking and deadlocks, has the property that each task can be blocked for at most the duration of a single critical section.

12.4.3 Schedulability Analysis The importance of the protocols for accessing shared resources in a real-time system derives from the fact that they can bound the maximum blocking time experienced by a task. This is essential for analyzing the schedulability of a set of real-time tasks interacting through shared buffers or any other non-preemptable resource, for example, a communication port or bus. To verify the schedulability of task τi using the processor utilization approach, we need to consider the utilization factor of task τi , the interference caused by the higher priority tasks and the blocking time caused by lower priority tasks. If Bi is the maximum blocking time that can be experienced by task τi , then the sum of the utilization factors due to these three causes cannot exceed the least upper bound of the scheduling algorithm, that is: ∀i, 1 ≤ i ≤ n,

Ck Bi + ≤ i(21/i − 1) Tk Ti

(12.4)

k∈hp(i)

where hp(i) denotes the set of tasks with priority higher than τi . The same test is valid for both the protocols described above, the only difference being the amount of blocking that each task may experience.

12.5 New Applications and Trends In the last years, real-time system technology has been applied to several application domains, where computational activities have less stringent timing constraints and occasional deadline misses are typically tolerated. Examples of such systems include monitoring, multimedia systems, flight simulators, and, in general, virtual reality games. In such applications, missing a deadline does not cause catastrophic effects on the system, but just a performance degradation. Hence, instead of requiring an absolute guarantee for the feasibility of the schedule, such systems demand an acceptable quality of service (QoS). It is worth observing that, since some timing constraints need to be handled anyway (although not critical), a nonrealtime operating system, such a Linux or Windows, is not appropriate: first of all, such systems do not provide temporal isolation among tasks, thus a sporadic peak load on a task may negatively affect the execution of other tasks in the system. Furthermore, the lack of concurrency control mechanisms that prevent priority inversion makes these systems unsuitable for guaranteeing a desired QoS level. On the other hand, a hard real-time approach is also not well suited for supporting such applications, because resources would be wasted due to static allocation mechanisms and pessimistic design assumptions. Moreover, in many multimedia applications, tasks are characterized by highly variable execution times (consider, for instance, an MPEG player), thus providing precise estimations on task computation times is practically impossible, unless one uses overly pessimistic figures. In order to provide efficient as well as predictable support for this type of real-time applications, several new approaches and scheduling methodologies have been proposed. They increase the flexibility and the adaptability of a system to online variations. For example, temporal protection mechanisms have been proposed to isolate task overruns and reduce reciprocal task interference [21, 24]. Statistical analysis techniques have been introduced to provide a probabilistic guarantee aimed at improving system efficiency [21]. Other techniques have been devised to handle transient and permanent overload conditions in a controlled fashion, thus increasing the average computational load in the system. One method absorbs the overload by regularly aborting some jobs of a periodic task, without exceeding a maximum limit specified by the user through a QoS parameter describing the minimum number of jobs between two consecutive abortions [25,26]. Another technique handles overloads through a suitable variation of periods, managed to decrease the processor utilization up to a desired level [27]. © 2006 by Taylor & Francis Group, LLC

12-12


12.6 Conclusions This paper surveyed some kernel methodologies aimed at enhancing the efficiency and the predictability of real-time control applications. In particular, the paper presented some scheduling algorithms and analysis techniques for periodic and aperiodic task sets. Two concurrency control protocols have been described to access shared resources in mutual exclusion while avoiding the priority inversion phenomenon. Each technique has the property to be analyzable, so that an offline guarantee can be provided for feasibility of the schedule within the timing constraints imposed by the application. For soft real-time systems, such as multimedia systems or simulators, the hard real-time approach can be too rigid and inefficient, especially when the application tasks have highly variable computation times. In these cases, novel methodologies have been introduced to improve average resource exploitation. They are also able to guarantee a desired QoS level and control performance degradation during overload conditions. In addition to research efforts aimed at providing solutions to more complex problems, a concrete increase in the reliability of future real-time systems can only be achieved if the mature methodologies are actually integrated in next generation operating systems and languages, defining new standards for the development of real-time applications. At the same time, programmers and software engineers need to be educated about the appropriate use of the available technologies.

References [1] J. Stankovic, A Serious Problem for Next-Generation Systems. IEEE Computer. 10–19, 1988. [2] J. Stankovic and K. Ramamritham, Tutorial on Hard Real-Time Systems, IEEE Computer Society Press, Washington, 1988. [3] G.C. Buttazzo, Hard Real-Time Computing Systems: Predictable Scheduling Algorithms and Applications, Kluwer Academic Publishers, Boston, MA, 1997. [4] J. Stankovic, M. Spuri, M. Di Natale, and G. Buttazzo, Implications of Classical Scheduling Results for Real-Time Systems. IEEE Computer, 28, 16–25, 1995. [5] C.L. Liu and J.W. Layland, Scheduling Algorithms for Multiprogramming in a Hard Real-Time Environment. Journal of the ACM, 20, 40–61, 1973. [6] J.P. Lehoczky, L. Sha, and Y. Ding, The Rate-Monotonic Scheduling Algorithm: Exact Characterization and Average Case Behaviour. In Proceedings of the IEEE Real-Time Systems Symposium, pp. 166–171, 1989. [7] M.H. Klein et al., A Practitioners’ Handbook for Real-Time Analysis: Guide to Rate Monotonic Analysis for Real-Time Systems, Kluwer Academic Publishers, Boston, MA, 1993. [8] M.L. Dertouzos, Control Robotics: the Procedural Control of Physical Processes. Information Processing, Vol. 74, North-Holland, Amsterdam, 1974. [9] J. Leung and J. Whitehead, On the Complexity of Fixed Priority Scheduling of Periodic Real-Time Tasks. Performance Evaluation, 2, 237–250, 1982. [10] N.C. Audsley, A. Burns, M. Richardson, K. Tindell, and A. Wellings, Applying New Scheduling Theory to Static Priority Preemptive Scheduling. Software Engineering Journal, 8, 284–292, 1993. [11] M. Joseph and P. Pandya, Finding Response Times in a Real-Time System. The Computer Journal, 29, 390–395, 1986. [12] S.K. Baruah, R.R. Howell, and L.E. Rosier, Algorithms and Complexity Concerning the Preemptive Scheduling of Periodic Real-Time Tasks on One Processor. Real-Time Systems, 2, 301–324, 1990. [13] J.P. Lehoczky, L. Sha, and J.K. Strosnider, Enhanced Aperiodic Responsiveness in Hard Real-Time Environments. In Proceedings of the IEEE Real-Time Systems Symposium, pp. 261–270, 1987. [14] J.K. Strosnider, J.P. Lehoczky, and L. Sha, The Deferrable Server Algorithm for Enhanced Aperiodic Responsiveness in Hard Real-Time Environments. IEEE Transactions on Computers, 44, 1995. [15] B. Sprunt, L. Sha, and J. Lehoczky, Aperiodic Task Scheduling for Hard Real-Time System. Journal of Real-Time Systems, 1, 27–60, 1989.



12-13

[16] J.P. Lehoczky and S. Ramos-Thuel, An Optimal Algorithm for Scheduling Soft-Aperiodic Tasks in Fixed-Priority Preemptive Systems. In Proceedings of the IEEE Real-Time Systems Symposium, 1992. [17] T.M. Ghazalie and T.P. Baker, Aperiodic Servers in a Deadline Scheduling Environment Real-Time Systems, 9(1), 31–67, 1995. [18] M. Spuri and G.C. Buttazzo, Efficient Aperiodic Service under Earliest Deadline Scheduling. In Proceedings of IEEE Real-Time System Symposium, San Juan, PR, December 1994. [19] M. Spuri and G. Buttazzo, Scheduling Aperiodic Tasks in Dynamic Priority Systems. Real-Time Systems, 10(2), 179–210, 1996. [20] G. Buttazzo and F. Sensini, Optimal Deadline Assignment for Scheduling Soft Aperiodic Tasks in Hard Real-Time Environments. IEEE Transactions on Computers, 48(10), 1035–1052, 1999. [21] L. Abeni and G. Buttazzo, Integrating Multimedia Applications in Hard Real-Time Systems. In Proceedings of the IEEE Real-Time Systems Symposium, Madrid, Spain, December 1998. [22] R. Rajkumar, Synchronous Programming of Reactive Systems, Kluwer Academic Publishers, Boston, MA, 1991. [23] L. Sha, R. Rajkumar, and J.P. Lehoczky, Priority Inheritance Protocols: An Approach to Real-Time Synchronization. IEEE Transactions on Computers, 39, 1175–1185, 1990. [24] I. Stoica, H-Abdel-Wahab, K. Jeffay, S. Baruah, J.E. Gehrke, and G.C. Plaxton, A Proportional Share Resource Allocation Algorithm for Real-Time Timeshared Systems. In Proceedings of IEEE Real-Time Systems Symposium, December 1996. [25] G. Buttazzo and M. Caccamo, Minimizing Aperiodic Response Times in a Firm Real-Time Environment. IEEE Transactions on Software Engineering, 25, 22–32, 1999. [26] G. Koren and D. Shasha, Skip-Over: Algorithms and Complexity for Overloaded Systems that Allow Skips. In Proceedings of the IEEE Real-Time Systems Symposium, 1995. [27] G. Buttazzo, G. Lipari, M. Caccamo, and L. Abeni, Elastic Scheduling for Flexible Workload Management. IEEE Transactions on Computers, 51, 289–302, 2002. [28] E. Bini, G.C. Buttazzo, and G.M. Buttazzo, A Hyperbolic Bound for the Rate Monotonic Algorithm. In Proceedings of the 13th Euromicro Conference on Real-Time Systems, Delft, The Netherlands, pp. 59–66, June 2001. [29] E. Bini and G.C. Buttazzo, The Space of Rate Monotonic Schedulability. In Proceedings of the 23rd IEEE Real-Time Systems Symposium, Austin, TX, December 2002. [30] J. Stankovic, K. Ramamritham, M. Spuri, and G. Buttazzo, Deadline Scheduling for Real-Time Systems, Kluwer Academic Publishers, Boston, MA, 1998.


13 Quasi-Static Scheduling of Concurrent Specifications Alex Kondratyev Cadence Berkeley Laboratories and Politecnico di Torino

13.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-1 Quasi-Static Scheduling • A Simple Example

13.2 Overview of Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-4 13.3 QSS for PNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-5

Luciano Lavagno

Definitions • Specification Model • Schedulability Analysis • Algorithmic Implementation

Politecnico di Torino

Claudio Passerone

13.4 QSS for Boolean Dataflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-10 Definitions • Schedulability Analysis • Comparison to PN Model

Politecnico di Torino

Yosinori Watanabe Cadence Berkeley Laboratories


13.1 Introduction 13.1.1 Quasi-Static Scheduling The phenomenal growth of complexity and breadth of use of embedded systems can be managed only by raising the level of abstraction at which design activities start and most design space exploration occurs. This enables greater reuse potential, but requires significant tool support for efficient analysis, mapping, and synthesis. In this chapter we deal with methods aimed at providing designers with efficient methods for uniprocessor software synthesis, from formal models that explicitly represent the available concurrency. These methods can be extended to multiprocessor support and to hardware synthesis, however these advanced topics are outside the scope of this chapter. Concurrent specifications, such as dataflow networks [1], Kahn process networks [2], Communicating Sequential Processes [3], synchronous languages [4], and graphical state machines [5], are interesting because they expose the inherent parallelism in the application, which is much harder to recover a posteriori by optimizing compilers. In such a specification, the application is described as a set of processes that sequentially execute operations and communicate with each other. In considering an 13-1


13-2


implementation of the application, it is often necessary to analyze how these processes interact with each other. This analysis is used for evaluating how often a process will be invoked during an execution of the system, or how much memory will be required for implementing the communication between the processes. Quasi-Static Scheduling (QSS) is a technique for finding sequences of operations to be executed across the processes that constitute a concurrent specification of the application. Several approaches have been proposed [6–11], where they use certain mathematical models to abstract the specification and aim to compute graphs of finite size such that the sequences are given by traversing the graphs. We call the sequences of operations, or the graph which represents them, a schedule of the specification. The schedule is static in the sense that it statically commits to a particular execution order of operations of the processes. In general, there exist more than one possible order of operations to be executed, with a different implementation cost for each. On the other hand, by committing to a particular sequence, a static schedule allows a more rigorous analysis of the interaction among the processes than dynamic schedules, because one can precisely observe how the operations from different processes are interleaved to constitute the system execution. The reason to start from a concurrent specification is twofold. First of all, coarse-grained parallelism is very difficult to recover from a sequential specification, except in relatively simple cases (e.g., nested loops with affine memory accesses [12]). Second, parallel specifications offer a good model to perform systemlevel partitioning experiments, aimed at finding the best mixed hardware/software implementation on a complex SOC platform. The reason to look for a sequencing of the originally concurrent operations is that we are considering in this chapter embedded software implementations, for which context switching implied by a multithreaded concurrent implementation would be very expensive, whenever concurrency can be resolved at compile time. This resolution is especially difficult if the specification involves data-dependent conditional constructs, such as if-then-else with a data-dependent condition, because different sets of operations may be executed depending upon how the constructs are resolved. For such a specification, the static scheduling produces in principle a sequence of operations for each possible way of resolving the constructs (in practice, these multiple sequences are collapsed as much as possible, in order to reduce code size). Note that these constructs are resolved based on the data values, and therefore some of the resolutions of the constructs may not happen at the runtime in a particular execution of the system. The information about data values is not available to the static scheduling algorithm, because the latter runs statically at compile time. In this sense, scheduling for a specification with such constructs is called quasi-static. It is responsible for providing a sequence of operations to be executed for each possible runtime resolution of data-dependent choices. After a simple motivating example, we present an overview of some approaches proposed in the literature. In Section 13.2, we consider two questions that one is concerned with in QSS, and briefly describe how these questions are addressed in two different models that have been proposed in the literature. One of the models is Petri nets (PNs), and the other is Boolean Dataflow (BDF) Graphs. They model a given concurrent specification in different ways, and thus the expressiveness of the models and issues that need to be accounted for to solve the scheduling problem are different. These two models and issues on their scheduling approaches are presented more in detail in Sections 13.3 and 13.4, respectively.

13.1.2 A Simple Example Figure 13.1 illustrates how QSS works. In Figure 13.1(a), there are two processes, each with a sequential program associated. The one shown on the left reads a value from port DATA into variable d, computes a value for the variable D and writes it to the port PORT, and then goes back to the beginning. The other process reads a value for variable N from port START, and then iterates the body of the for-loop N times. For each iteration, it reads two values from port IN, and sets them to x[0] and x[1], respectively. Here, the third argument of the read function designates the number of data items to be read at a time. Since


QSS of Concurrent Specifications

13-3 START

(a) DATA

while (1) { read(START, N, 1); for (i=0, y=0; i0 SWITCH F A

x

QSS

if (x > 0) A: y=x; else B: y=x*x;

T

A

B

B

FIGURE 13.2 The PN and BDF modeling approaches.

TABLE 13.1

Models for Embedded Systems and the Complexity of Scheduling Problems PN

Modeling data dependence Bounded length schedule Bounded memory schedule

SDF graph

BDF graph

Equal-choice

General

No O(|Cycle_seq|) O(|Cycle_seq|)

Yes O(|Cycle_seq|) Undecidable

Yes O(|Cycle_seq|) O(|Cycle_seq|)

Yes O(|Cycle_seq|) Unknown

in the length of the sequence that brings the specification back to its initial state (called cyclic sequence). Note that, however, the size of this cyclic sequence can be exponential in the size of the SDF graph. We will review scheduling approaches based on PNs and on BDF more in detail in Sections 13.3 and 13.4, respectively.

13.3 QSS for PNs 13.3.1 Definitions A PN is defined by a tuple (P, T , F , M0 ), where P and T are sets of places and transitions respectively. F is a function from (P × T ) ∪ (T × P) to nonnegative integers. A marking M is another function from P to nonnegative integers, where M [p] denotes the number of tokens at p in M . M0 is the initial marking. A PN can be represented by a directed bipartite graph, where an edge [u, v] exists if F (u, v) is positive, which is called the weight of the edge. A transition t is enabled at a marking M , if M [p] ≥ F ( p, t ) for all p of P. In this case, one may fire the transition at the marking, which yields a marking M given by t M [p] = M [p] − F ( p, t ) + F (t , p) for each p of P. In the sequel, M → M denotes the fact that a transition t is enabled at a marking M and M is obtained by firing t at M . A transition t is said to be a source, if F (p, t ) = 0 for all p of P. A marking M is said to be reachable from M if there is a sequence of transitions fireable from M that leads to M . The set of markings reachable from the initial marking is denoted by R(M0 ).


13-6


A transition system is a tuple A = (S, , →, sin ), where S is a set of states, is an alphabet of symbols, →⊆ S × × S is the transition relation and sin is the initial state. Given s ∈ S, t ∈ is said to be fireable t at s if there exists s ∈ S such that (s, t , s ) ∈→, and we denote it with s → s . If corresponds to the set of transition T of a PN, a transition system can be used to represent a portion of the reachability space of that PN. This concept will be used in the definition of schedule in Section 13.3.3. A key notion for defining schedules is that of equal conflict sets. A pair of transitions ti and tj is said to be in equal conflict, if F ( p, ti ) = F ( p, tj ) for all p of P. These transitions are in conflict in the sense that ti is enabled at a given marking if and only if tj is enabled and it is not possible to fire both transitions at the marking. Equal conflict is an equivalence relation defined on the set of transitions, and each equivalence class is called equal conflict set (ECS). By definition, if one transition of an ECS is enabled at a given marking, all the other transitions of the ECS are also enabled. Thus, we may say that this ECS is enabled at the marking. A place p is said to be a choice place if it has more than one successor transition. A choice place is equal choice (a generalization of free choice [16]) if all the successor transitions are in the same ECS. A PN is said to be equal choice if all choice places are equal.

13.3.2 Specification Model A specification is given as a set of processes communicating via FIFOs, as described in Section 13.1.2. A PN is then used to model the specification by employing a specific abstraction. The PN for the example shown in Figure 13.1(a) is shown in Figure 13.3. Communication operations on ports and the computation operations of the processes are modeled by transitions. Places are used to represent both sequencing within processes (a single token models the program counter) and FIFO communication (the tokens model the presence of the data items, while hiding their values). For an input (output) port connected to the environment, a source (sink) transition is connected to the place for the port. Overall, the PN represents the control flow of the programs of the specification. Place p7 is a choice place, while the transitions D and E form an ECS, as defined in the previous section. This choice place models at the data-dependent control construct given by the termination condition of the for-loop in the specification (Figure 13.1[a]). We note that the PN does not model the termination condition of the for-loop, that is, it abstracts away the condition on data with which control constructs are resolved. An automatic procedure to produce this model of PNs has been developed for the C language, extended with constructs for implementing operations of the FIFO communication [8]. In addition to the read and write operations for FIFOs, the extended language supports another communication construct called select [17]. It realizes synchronization-dependent control, which determines the control flow of the program depending on the availability of data items on input ports. It accepts as

p5

p6

START C p7 p1 DATA

E

D A

p3

B

p4 p9 2

p2

FIGURE 13.3 PN model for the example of Figure 13.1.


F

p8 OUT


13-7

argument a set of input ports, and nondeterministically selects one of them with data available at the port. In case none of them has data, the process blocks until some is available. The select operations introduce nondeterminism in the specification, which some may consider a bad idea.2 However, they are essential in order to model efficiently systems whose control flow depends on the availability of data at some ports. For example, when specifying a component of a multimedia application, parameters such as the coefficients used in a filtering algorithm are typically provided nondeterministically at a control port, since they depend on the quality and size of the images being processed, which can be dynamically changed by a user of the device or by a global quality-of-service manager. In this case, the process looks at the control port as well as the data ports, and if new data are available at the control port, the process uses them for the filtering while otherwise it will simply proceed using the current values for the coefficients. Always polling for new coefficient values at every new image would lead to unnecessary loss of performance [17]. An example of a specification including select is shown in Figure 13.4(a), which has two processes, two input ports (IN and COEF) and one output port (OUT). The processes communicate to each other through the channel DATA. The process Filter extracts data inserted by Get_Data, multiplies them by a coefficient and sends them to the environment through the port OUT, where the availability of the coefficient is tested by the select statement. Figure 13.4(b) shows a PN for this specification. In formally defining a schedule based on this PN model, it is necessary to clarify how the environment is modeled. This work, following the notation introduced in Reference 1, uses source transitions, as depicted by Tin and Tcoef in Figure 13.4(b). However, the basic model of Reference 1 is extended by distinguishing between two types of inputs, and hence of source transitions. The types are called controllable and uncontrollable respectively. The uncontrollable inputs are the stimuli to the system being scheduled, that is, the system execution takes place as a reaction to events provided by this type of inputs. The objective of the scheduling is then to find a sequence of operations to be executed in each such reaction. The scheduling problem is defined under the assumption that all the uncontrollable inputs are independent with respect to each other and with respect to the execution of the system. This means that the system cannot tell when the stimuli are provided by the environment or how they are related, and thus no such information can be assumed when schedules are sought. Therefore, a schedule must be designed so that when the system is ready to react to a stimulus from one uncontrollable input, it must be ready to react to a stimulus from any other uncontrollable input. In Figure 13.4, all the inputs are uncontrollable. Controllable inputs, on the other hand, represent data from the environment that the system can acquire whenever it decides to do so. It follows that schedules can be sought under the assumption that if the read operation is executed on a controllable input, then the operation will always succeed in reading the specified amount of data from the input without blocking the execution of the system. For example, in the specification given in Figure 13.1, the input to the port START was specified as uncontrollable while the port DATA was given as controllable. This environment model covers the one used in dataflow models such as SDF and BDF, in which one input (the first one selected by the scheduler) can be considered as uncontrollable, while all other inputs are controllable.

13.3.3 Schedulability Analysis The problem to be solved is to find a schedule for a given PN and to generate a software code that implements the schedule. A schedule is formally defined as a transition system that satisfies the following properties with respect to the given PN. 2 An essential property of dataflow networks and Kahn process networks is the fact that the result of the computation

is independent of the order in which the processes are scheduled for execution. This property no longer holds if the select construct is added.


13-8


COEF

(a)

OUT

IN

PROCESS GetData (InPort IN, OutPort DATA) { float sample,sum; int i; while (1) { sum = 0; for (i=0; i0?

+

SWITCH

OUT

FIGURE 13.8 BDF graph for the example of Figure 13.1.


SWITCH F

T

T SELECT

F

T

13-14


storing the state graph. Even worse, it also significantly reduces the capabilities of pruning the explored reachability space based on different termination conditions. These conditions impose a partial order between states and avoid generation of reachability space beyond “ordered” states [6, 8]. For PNs the partial order is established purely by markings, while for BDF graphs in addition to markings it also requires to consider values of Boolean streams. Due to this, state graphs of BDF have sparser ordering relations and are significantly larger. Hence we feel that for bounded memory quasi-static schedulability analysis, the PN approach is simpler and more suitable, especially if the limitations of current translators from C to PNs are addressed.

13.5 Conclusions This chapter described modeling methods and scheduling algorithms that bridge the gap between specification and implementation of reactive systems. From a specification given in terms of concurrent communicating processes, and by deriving intermediate representations based on PNs and dataflow graphs, one can (unfortunately not always) obtain a sequential schedule that can be efficiently implemented on a processor. Future work should consider better heuristic to find such schedules, since the problem is undecidable in general, once data-dependent choices come into play. Furthermore, it would be interesting to extend it by considering sequential and concurrent implementations on several resources (e.g., CPUs and custom datapaths) [19]. Another body of future research concerns the extension of the notion of schedule into the time domain, in order to cope with performance constraints, while all the approaches considered in this chapter assume infinite processing speed with respect to the speed of the environment. For real-time applications one would need to extend the scheduling frameworks by explicit annotation of system events with delays, and by using timing driven algorithms for schedule construction.

References [1] E.A. Lee and D.G. Messerschmitt. Static scheduling of synchronous data flow graphs for digital signal processing. IEEE Transactions on Computers, C-36(1), 24–35, 1987. [2] G. Kahn. The semantics of a simple language for parallel programming. In Proceedings of IFIP Congress, August 1974. [3] C.A.R. Hoare. Communicating Sequential Processes. International Series in Computer Science. Prentice Hall, Hertfordshire, 1985. [4] N. Halbwachs. Synchronous Programming of Reactive Systems. Kluwer Academic Publishers, Boston, MA, 1993. [5] D. Har’el, H. Lachover, A. Naamad, A. Pnueli, M. Politi, R. Sherman, A. Shtull-Trauring, and M. Trakhtenbrot. STATEMATE: a working environment for the development of complex reactive systems. IEEE Transactions on Software Engineering, 16(4), 403–414, 1990. [6] J. Buck. Scheduling Dynamic Dataflow Graphs with Bounded Memory Using the Token Flow Model. Ph.D. thesis, University of California, Berkeley, 1993. [7] J.T. Buck. Static scheduling and code generation from dynamic dataflow graphs with integer valued control streams. In Proceedings of the 28th Asilomar Conference on Signals, Systems, and Computer, October 1994. [8] J. Cortadella, A. Kondratyev, L. Lavagno, C. Passerone, and Y. Watanabe. Quasi-static scheduling of independent tasks for reactive systems. IEEE Transactions on Computer-Aided Design, 24(9), 2004. [9] T.M. Parks. Bounded Scheduling of Process Networks. Ph.D. thesis, Department of EECS, University of California, Berkeley, 1995. Technical report UCB/ERL 95/105. [10] K. Strehl, L. Thiele, D. Ziegenbein, R. Ernst et al. Scheduling hardware/software systems using symbolic techniques. In International Workshop on Hardware/Software Codesign, 1999.



13-15

[11] P. Wauters, M. Engels, R. Lauwereins, and J.A. Peperstraete. Cyclo-dynamic dataflow. In Proceedings of the 4th EUROMICRO Workshop on Parallel and Distributed Processing, January 1996. [12] T. Stefanov, C. Zissulescu, A. Turjan, B. Kienhuis, and E. Deprettere. System design using Kahn process networks: the Compaan/Laura approach. In Proceedings of the Design Automation and Test in Europe Conference, February 2004. [13] G. Arrigoni, L. Duchini, L. Lavagno, C. Passerone, and Y. Watanabe. False path elimination in quasi-static scheduling. In Proceedings of the Design Automation and Test in Europe Conference, March 2002. [14] F. Thoen, M. Cornero, G. Goossens, and H. De Man. Real-time multi-tasking in software synthesis for information processing systems. In Proceedings of the International System Synthesis Symposium, 1995. [15] Javier Esparza. Decidability and complexity of Petri net problems — an introduction. In Lectures on Petri Nets I: Basic Models, Advances in Petri Nets, Lecture notes on Computer Science, vol. 1491, Petri Nets, 1996, pp. 374–428. [16] T. Murata. Petri nets: properties, analysis, and applications. Proceedings of the IEEE, 77(4), 541–580, 1989. [17] E.A. de Kock, G. Essink, W.J.M. Smits, P. van der Wolf, J.-Y. Brunel, W.M. Kruijtzer, P. Lieverse, and K.A. Vissers. YAPI: application modeling for signal processing systems. In Proceedings of the 37th Design Automation Conference, June 2000. [18] Joseph Buck, Soonhoi Ha, Edward A. Lee, and David G. Messerschmitt. Ptolemy: a framework for simulating and prototyping heterogenous systems. International Journal in Computer Simulation, 4(2), 1994. [19] J. Cortadella, A. Kondratyev, L. Lavagno, A. Taubin, and Y. Watanabe. Quasi-static scheduling for concurrent architectures. Fundamenta Informaticae, 62, 171–196, 2004.


Timing and Performance Analysis 14 Determining Bounds on Execution Times Reinhard Wilhelm

15 Performance Analysis of Distributed Embedded Systems Lothar Thiele and Ernesto Wandeler


14 Determining Bounds on Execution Times 14.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-2 Tool Architecture and Algorithm • Timing Anomalies • Contexts

14.2 Cache-Behavior Prediction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-6 Cache Memories • Cache Semantics • Abstract Semantics

14.3 Pipeline Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-10 Simple Architectures without Timing Anomalies • Processors with Timing Anomalies • Algorithm Pipeline-Analysis • Pipeline Modeling • Formal Models of Abstract Pipelines • Pipeline States

14.4 Path Analysis Using Integer Linear Programming . . . . . 14-17 14.5 Other Ingredients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-18 Value Analysis • Control Flow Specification and Analysis • Frontends for Executables

14.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-19 A (Partly) Dynamic Method • Purely Static Methods

Reinhard Wilhelm Universität des Saarlandes

14.7 State of the Art and Future Extensions . . . . . . . . . . . . . . . . . 14-20 Acknowledgments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-21 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-21

Run-time guarantees play an important role in the area of embedded systems and especially hard real-time systems. These systems are typically subject to stringent timing constraints, which often result from the interaction with the surrounding physical environment. It is essential that the computations are completed within their associated time bounds; otherwise severe damages may result, or the system may be unusable. Therefore, a schedulability analysis has to be performed which guarantees that all timing constraints will be met. Schedulability analyses require upper bounds for the execution times of all tasks in the system to be known. These bounds must be safe, that is, they may never underestimate the real execution time. Furthermore, they should be tight, that is, the overestimation should be as small as possible. In modern microprocessor architectures, caches, pipelines, and all kinds of speculation are key features for improving (average-case) performance. Unfortunately, they make the analysis of the timing behavior of instructions very difficult, since the execution time of an instruction depends on the execution history. A lack of precision in the predicted timing behavior may lead to a waste of hardware resources, which would have to be invested in order to meet the requirements. For products which are manufactured

14-1


14-2


in high quantities, for example, in the automobile or telecommunications markets this would result in intolerable expenses. Subject of this chapter are one particular approach and the subtasks involved in computing safe and precise bounds on the execution times for real-time systems.

14.1 Introduction Hard real-time systems are subject to stringent timing constraints which are dictated by the surrounding physical environment. We assume that a real-time system consists of a number of tasks, which realize the required functionality. A schedulability analysis for this set of tasks and a given hardware has to be performed in order to guarantee that all the timing constraints of these tasks will be met (“timing validation”). Existing techniques for schedulability analysis require upper bounds for the execution times of all the system’s tasks to be known. These upper bounds are commonly called the worst-case execution times (WCETs), a misnomer that causes a lot of confusion and will therefore not be adopted in this presentation. In analogy, lower bounds on the execution time have been named best-case execution times (BCET). These upper bounds (and lower bounds) have to be safe, that is, they must never underestimate (overestimate) the real execution time. Furthermore, they should be tight, that is, the overestimation (underestimation) should be as small as possible. Figure 14.1 depicts the most important concepts of our domain. The system shows a certain variation of execution times depending on the input data or different behavior of the environment. In general, the state space is too large to exhaustively explore all possible executions and so determine the exact worst-case and best-case execution times, WCET and BCET, respectively. Some abstraction of the system is necessary to make a timing analysis of the system feasible. These abstractions loose information, and thus are responsible for the distance between WCETs and upper bounds and between BCETs and lower bounds. How much is lost depends both on the methods used for timing analysis and on system properties, such as the hardware architecture and the cleanness of the software. So, the two distances mentioned above, termed upper predictability and lower predictability can be seen as a measure for the timing predictability of the system. Experience has shown that the two predictabilities can be quite different, cf. Reference 1. The methods used to determine upper bounds and lower bounds are the same. We will concentrate on the determination of upper bounds unless otherwise stated. Methods to compute sharp bounds for processors with fixed execution times for each instruction have long been established [2,3]. However, in modern microprocessor architectures caches, pipelines, and all kinds of speculation are key features for improving (average-case) performance. Caches are used to bridge the gap between processor speed and the access time of main memory. Pipelines enable acceleration by overlapping the executions of different instructions. The consequence is that the execution time of individual instructions, and thus the contribution of one execution of an instruction to the program’s

Predictability w.c. guarantee w.c. performance 0

Lower bound

Best case

Worst case

Upper bound

Variation of execution time FIGURE 14.1 Basic notions concerning timing analysis of systems. © 2006 by Taylor & Francis Group, LLC

t

Determining Bounds on Execution Times

14-3

Fetch

Issue

Execute

Retire

I–Cache miss?

Unit occupied?

Multicycle?

Pending instructions? 6

19

4

1 No

6

3

30

6

Yes 1 3

6 41

FIGURE 14.2

Different paths through the execution of a multiply instruction. Unlabeled transitions take 1 cycle.

execution time can vary widely. The interval of execution times for one instruction is bounded by the execution times of the following two cases: • The instruction goes “smoothly” through the pipeline; all loads hit the cache, no pipeline hazard happens, that is, all operands are ready, no resource conflicts with other currently executing instructions exist. • “Everything goes wrong,” that is, instruction and/or operand fetches miss the cache, resources needed by the instruction are occupied, etc. Figure 14.2 shows the different paths through a multiply instruction of a PowerPC processor. The instruction-fetch phase may find the instruction in the cache (cache hit ), in which case it takes 1 cycle to load it. In the case of a cache miss, it may take something like 30 cycles to load the memory block containing the instruction into the cache. The instruction needs an arithmetic unit, which may be occupied by a preceding instruction. Waiting for the unit to become free may take up to 19 cycles. This latency would not occur, if the instruction fetch had missed the cache, because the cache-miss penalty of 30 cycles has allowed any preceding instruction to terminate its arithmetic operation. The time it takes to multiply two operands depends on the size of the operands; for small operands, 1 cycle is enough, for larger, three are needed. When the operation has finished, it has to be retired in the order it appeared in the instruction stream. The processor keeps a queue for instructions waiting to be retired. Waiting for a place in this queue may take up to 6 cycles. On the dashed path, where the execution always takes the fast way, its overall execution time is 4 cycles. However, on the dotted path, where it always takes the slowest way, the overall execution time is 41 cycles. We will call any increase in execution time during an instruction’s execution a timing accident and the number of cycles by which it increases the timing penalty of this accident. Timing penalties for an instruction can add up to several hundred processor cycles. Whether the execution of an instruction encounters a timing accident depends on the execution state, for example, the contents of the cache(s), the occupancy of other resources, and thus on the execution history. It is therefore obvious that the attempt to predict or exclude timing accidents needs information about the execution history. For certain classes of architectures, namely those without timing anomalies of Section 1, excluding timing accidents means decreasing the upper bounds. However, for those with timing anomalies this assumption is not true.

14.1.1 Tool Architecture and Algorithm A more or less standard architecture for timing-analysis tools has emerged [4–6]. Figure 14.3 shows one instance of this architecture. A first phase, depicted on the left, predicts the behavior of processor © 2006 by Taylor & Francis Group, LLC

14-4


CFG builder

FIGURE 14.3

The architecture of the aiT timing-analysis tool.

components for the instructions of the program. It usually consists of a sequence of static program analyses of the program. They altogether allow to derive safe upper bounds for the execution times of basic blocks. A second phase, the column on the right, computes an upper bound on the execution times over all possible paths of the program. This is realized by mapping the control flow of the program to an Integer Linear Program and solving this by appropriate methods. This architecture has been successfully used to determine precise upper bounds on the execution times of real-time programs running on processors used in embedded systems [1,7–10]. A commercially available tool, aiT by AbsInt, cf. http://www.absint.de/wcet.htm, was implemented and is used in the aeronautics and automotive industries. The structure of the first phase, processor-behavior prediction, often called microarchitecture analysis, may vary depending on the complexity of the processor architecture. A first, modular approach would be the following: 1. Cache-behavior prediction determines statically and approximately the contents of caches at each program point. For each access to a memory block, it is checked, whether the analysis can safely predict a cache hit. Information about cache contents can be forgotten after the cache analysis. Only the miss/hit information is needed by the pipeline analysis. 2. Pipeline-behavior prediction analyzes, how instructions pass through the pipeline taking cache-hit or miss information into account. The cache-miss penalty is assumed for all cases, where a cache hit can not be guaranteed. At the end of simulating one instruction, the pipeline analysis continues with only those states that show the locally maximal execution times. All others can be forgotten.

14.1.2 Timing Anomalies Unfortunately, this approach is not safe for many processor architectures. Most powerful microprocessors have so-called timing anomalies. Timing anomalies are contra-intuitive influences of the (local) execution time of one instruction on the (global) execution time of the whole program [11]. The interaction of



14-5

several processor features can interact in such a way that a locally faster execution of an instruction can lead to a globally longer execution time of the whole program. For example, a cache miss contributes the cache-miss penalty to the execution time of a program. It was, however, observed for the MCF 5307 [12], that a cache miss may actually speed up program execution. Since the MCF 5307 has a unified cache and the fetch and execute pipelines are independent, the following can happen: a data access that is a cache hit is served directly from the cache. At the same time, the fetch pipeline fetches another instruction block from main memory, performing branch prediction and replacing two lines of data in the cache. These may be reused later on and cause two misses. If the data access was a cache miss, the instruction fetch pipeline may not have fetched those two lines, because the execution pipeline may have resolved a misprediction before those lines were fetched. The general case of a timing anomaly is the following. Different assumption about the processor’s execution state, for example, the fact that the instruction is or is not in the instruction cache, will result in a difference Tlocal of the execution time of the instruction between these two cases. Either assumption may lead to a difference T of the global execution time compared to the other one. We say that a timing anomaly occurs if either Tlocal < 0 that is, the instruction executes faster, and T < Tlocal , the overall execution is accelerated by more than the acceleration of the instruction, or T > 0, the program runs longer than before. Tlocal > 0 that is, the instruction takes longer to execute, and T > Tlocal , that is, the overall execution is extended by more than the delay of the instruction, or T < 0 that is, the overall execution of the program takes less time to execute than before. The case Tlocal < 0 ∧ T > 0 is a critical case for our timing analysis. It makes it impossible to use local worst cases for the calculation of the program’s execution time. The analysis has to follow all possible paths as will be explained in Section 14.3.

14.1.3 Contexts The contribution of an individual instruction to the total execution time of a program may vary widely depending on the execution history. For example, the first iteration of a loop typically loads the caches, and later iterations profit from the loaded memory blocks being in the caches. In this case, the execution of an instruction in a first iteration encounters one or more cache misses and pays with the cache-miss penalty. Later executions, however, will execute much faster because they hit the cache. A similar observation holds for dynamic branch predictors. They may need a few iterations until they stabilize and predict correctly. Therefore, precision is increased if instructions are considered in their control-flow context, that is, the way control reached them. Contexts are associated with basic blocks, that is, maximally long straightline code sequences that can be only entered at the first instruction and left at the last. They indicate through which sequence of function calls and loop iterations control arrived at the basic block. Thus, when analyzing the cache behavior of a loop, precision can be increased by regarding the first iteration of the loop and all other iterations separately; more precisely, to unroll the loop once and then analyze the resulting code.1 Definition 14.1 Let p be a program with set of functions P = {p1 , p2 , . . . , pn } and set of loops L = {l1 , l2 , . . . , ln }. A word c over the alphabet P ∪ L × IN is called a context for a basic block b, if b can be reached by calling the functions and iterating through the loops in the order given in c. 1 Actually,

this unrolling transformation need not be really performed, but can be incorporated into the iteration strategy of the analyzer. So, we talk of virtually unrolling the loops.


14-6


Even, if all loops have static loop bounds and recursion is also bounded, there are in general too many contexts to consider them exhaustively. A heuristics is used to keep relevant contexts apart and summarize the rest conservatively, if their influence on the behavior of instructions does not significantly differ. Experience has shown [10], that a few first iterations and recursive calls are sufficient to “stabilize” the behavior information, as the above example indicates, and that the right differentiation of contexts is decisive for the precision of the prediction [13]. A particular choice of contexts transforms the call and the control flow graph into a context-extended control-flow graph by virtually unrolling the loops and virtually inlining the functions as indicated by the contexts. The formal treatment of this concept is quite involved and shall not be given here. It can be found in Reference 14.

14.2 Cache-Behavior Prediction Abstract Interpretation [15] is used to compute invariants about cache contents. How the behavior of programs on processor pipelines is predicted follows in Section 14.3.

14.2.1 Cache Memories A cache can be characterized by three major parameters: • Capacity is the number of bytes it may contain. • Line size (also called block size) is the number of contiguous bytes that are transferred from memory on a cache miss. The cache can hold at most n = capacity/line size blocks. • Associativity is the number of cache locations where a particular block may reside. n/associativity is the number of sets of a cache. If a block can reside in any cache location, then the cache is called fully associative. If a block can reside in exactly one location, then it is called direct mapped. If a block can reside in exactly A locations, then the cache is called A-way set associative. The fully associative and the direct mapped caches are special cases of the A-way set associative cache where A = n and A = 1, respectively. In the case of an associative cache, a cache line has to be selected for replacement when the cache is full and the processor requests further data. This is done according to a replacement strategy. Common strategies are LRU (Least Recently Used), FIFO (First In First Out), and random. The set where a memory block may reside in the cache is uniquely determined by the address of the memory block, that is, the behavior of the sets is independent of each other. The behavior of an A-way set associative cache is completely described by the behavior of its n/A fully associative sets. This holds also for direct mapped caches where A = 1. For the sake of space, we restrict our description to the semantics of fully associative caches with LRU replacement strategy. More complete descriptions that explicitly describe direct mapped and A-way set associative caches can be found in References 8 and 16.

14.2.2 Cache Semantics In the following, we consider a (fully associative) cache as a set of cache Lines L = {l1 , . . . , ln } and the store as a set of memory blocks S = {s1 , . . . , sm }. To indicate the absence of any memory block in a cache line, we introduce a new element I ; S = S ∪{I }. Definition 14.2 (concrete cache state) A (concrete) cache state is a function c : L → S . Cc denotes the set of all concrete cache states. The initial cache State cI maps all cache lines to I . If c(li ) = sy for a concrete cache state c, then i is the relative age of the memory block according to the LRU replacement strategy and not necessarily the physical position in the cache hardware.



14-7

z y x t

s z y x

z s x t

s z x t

“young” Age “old”

[s]

FIGURE 14.4

Update of a concrete fully associative (sub-) cache.

The update function describes the effect on the cache of referencing a block in memory. The referenced memory block sx moves into l1 if it was in the cache already. All memory blocks in the cache that had been used more recently than sx increase their relative age by one, that is, they are shifted by one position to the next cache line. If the referenced memory block was not yet in the cache, it is loaded into l1 after all memory blocks in the cache have been shifted and the “oldest,” that is, least recently used memory block, has been removed from the cache if the cache was full. Definition 14.3 (cache update) A cache update function U : Cc × S → Cc determines the new cache state for a given cache state and a referenced memory block. Updates of fully associative caches with LRU replacement strategy are pictured as in Figure 14.4. 14.2.2.1 Control Flow Representation We represent programs by control flow graphs consisting of nodes and typed edges. The nodes represent basic blocks. A basic block is a sequence (of fragments) of instructions in which control flow enters at the beginning and leaves at the end without halt or possibility of branching except at the end. For cache analysis, it is most convenient to have one memory reference per control flow node. Therefore, our nodes may represent the different fragments of machine instructions that access memory. For nonprecisely determined addresses of data references, one can use a set of possibly referenced memory blocks. We assume that for each basic block, the sequence of references to memory is known (This is appropriate for instruction caches and can be too restricted for data caches and combined caches. See References 7 and 16 for weaker restrictions.), that is, there exists a mapping from control flow nodes to sequences of memory blocks: L : V → S ∗ . We can describe the effect of such a sequence on a cache with the help of the update function U . Therefore, we extend U to sequences of memory references by sequential composition: U (c, sx1 , . . . , sxy ) = U (. . . (U (c, sx1 )), . . . , sxy ). The cache state for a path (k1 , . . . , kp ) in the control flow graph is given by applying U to the initial cache state cI and the concatenation of all sequences of memory references along the path: U (cI , L(k1 ), . . . , L(kp )). The Collecting Semantics of a program gathers at each program point the set of all execution states, which the program may encounter at this point during some execution. A semantics on which to base a cache analysis has to model cache contents as part of the execution state. One could thus compute the collecting semantics and project the execution states onto their cache components to obtain the set of all possible cache contents for a given program point. However, the collecting semantics is in general not computable. Instead, one restricts the standard semantics to only those program constructs, which involve the cache, that is, memory references. Only they have an effect on the cache modelled by the cache update function, U . This coarser semantics may execute program paths which are not executable in the start


14-8


semantics. Therefore, the Collecting Cache Semantics of a program computes a superset of the set of all concrete cache states occurring at each program point. Definition 14.4 (Collecting Cache Semantics) The Collecting Cache Semantics of a program is Ccoll ( p) = {U (cI , L(k1 ), . . . , L(kn ))|(k1 , . . . , kn ) path in the CFG leading to p} This collecting semantics would be computable, although often of enormous size. Therefore, another step abstracts it into a compact representation, so called abstract cache states. Note that every information drawn from the abstract cache states allows to safely deduce information about sets of concrete cache states, that is, only precision may be reduced in this two step process. Correctness is guaranteed.

14.2.3 Abstract Semantics The specification of a program analysis consists of the specification of an abstract domain and of the abstract semantic functions, mostly called transfer functions. The least upper bound operator of the domain combines information when control flow merges. We present two analyses. The must analysis determines a set of memory blocks that are in the cache at a given program point whenever execution reaches this point. The may analysis determines all memory blocks that may be in the cache at a given program point. The latter analysis is used to determine the absence of a memory block in the cache. The analyses are used to compute a categorization for each memory reference describing its cache behavior. The categories are described in Table 14.1. The domains for our abstract interpretations consist of abstract cache states. Definition 14.5 (abstract cache state) An abstract cache state cˆ : L → 2S maps cache lines to sets of memory blocks. Cˆ denotes the set of all abstract cache states. The position of a line in an abstract cache will, as in the case of concrete caches, denote the relative age of the corresponding memory blocks. Note, however, that the domains of abstract cache states will have different partial orders and that the interpretation of abstract cache states will be different in the different analyses. The following functions relate concrete and abstract domains. An extraction function, extr, maps a concrete cache state to an abstract cache state. The abstraction function, abstr, maps sets of concrete cache states to their best representation in the domain of abstract cache states. It is induced by the extraction function. The concretization function, concr, maps an abstract cache state to the set of all concrete cache states represented by it. It allows to interpret abstract cache states. It is often induced by the abstraction function, cf. Reference 17. Definition 14.6 (extraction, abstraction, concretization functions) The extraction function extr: Cc → Cˆ forms singleton sets from the images of the concrete cache states it is applied to, that is, extr(c)(li ) = {sx } if c(li ) = sx . The abstraction function abstr: 2Cc → Cˆ is defined by abstr(C) = {extr(c)|c ∈ C} The concretization function concr: Cˆ → 2Cc is defined by concr(ˆc ) = {c|extr(c) cˆ }. TABLE 14.1 Category

Categorizations of Memory References and Memory Blocks Abbreviation

Meaning

ah am nc

The memory reference will always result in a cache hit. The memory reference will always result in a cache miss. The memory reference could neither be classified as ah nor am.

Always hit Always miss Not classified



14-9

So much of commonalities of all the domains to be designed. Note, that all the constructions are parameterized in and . The transfer functions, the abstract cache update functions, all denoted Uˆ , will describe the effects of a control flow node on an element of the abstract domain. They will be composed of two parts: 1. “Refreshing” the accessed memory block, that is, inserting it into the youngest cache line. 2. “Aging” some other memory blocks already in the abstract cache. 14.2.3.1 Termination of the Analyses There are only a finite number of cache lines and for each program a finite number of memory blocks. This means, that the domain of abstract cache states cˆ : L → 2S is finite. Hence, every ascending chain is finite. Additionally, the abstract cache update functions, Uˆ , are monotonic. This guarantees that all the analyses will terminate. 14.2.3.2 Must Analysis As explained above, the must analysis determines a set of memory blocks that are in the cache at a given program point whenever execution reaches this point. Good information, in the sense of valuable for the prediction of cache hits, is the knowledge that a memory block is in this set. The bigger the set, the better. As we will see, additional information will even tell how long it will at least stay in the cache. This is connected to the “age” of a memory block. Therefore, the partial order on the must -domain is as follows. Take an abstract cache state cˆ . Above cˆ in the domain, that is, less precise, are states where memory blocks from cˆ are either missing or are older than in cˆ . Therefore, the -operator applied to two abstract cache states cˆ1 and cˆ2 will produce a state cˆ containing only those memory blocks contained in both, and will give them the maximum of their ages in cˆ1 and cˆ2 (see Figure 14.5). The positions of the memory blocks in the abstract cache state are thus the upper bounds of the ages of the memory blocks in the concrete caches occurring in the collecting cache semantics. Concretization of an abstract cache state, cˆ , produces the set of all concrete cache states, which contain all the memory blocks contained in cˆ with ages not older than in cˆ . Cache lines not filled by these are filled with other memory blocks. We use the abstract cache update function depicted in Figure 14.6. Let us argue the correctness of this update function. The following theorem formulates the soundness of the must-cache analysis. Theorem 14.1 Let n be a program point, cîn the abstract cache state at the entry to n, s a memory line in cîn with age k. (i) For each 1 ≤ k ≤ A there are at most k memory lines in lines 1, 2, . . . , k. (ii) On all paths to n, s is in cache with age at most k. The solution of the must analysis problem is interpreted as follows: let cˆ be an abstract cache state at some program point. If sx ∈ cˆ (li ) for a cache line li then sx will definitely be in the cache whenever execution reaches this program point. A reference to sx is categorized as always hit (ah). There is even a stronger interpretation of the fact that sx ∈ cˆ (li ). sx will stay in the cache at least for the next n − i {c} {e} {a} {d}

{a} {} {c, f} {d}

“intersection + maximal age” {} {} {a, c} {d}

FIGURE 14.5 Combination for must analysis.


14-10


{x} {} {s, t} {y}

{s} {x} {t} {y}

“young” Age “old”

[s]

FIGURE 14.6

Update of an abstract fully associative (sub-) cache. {c} {e} {a} {d}

{a} {c, f} {} {d}

“union + minimal age” {a, c} {e, f} {} {d}

FIGURE 14.7

Combination for may analysis.

references to memory blocks that are not in the cache or are older than the memory blocks in cˆ , whereby sa is older than sb means: ∃li , lj : sa ∈ cˆ (li ), sb ∈ cˆ (lj ), i > j. 14.2.3.3 May Analysis To determine, if a memory block sx will never be in the cache, we compute the complimentary information, that is, sets of memory blocks that may be in the cache. “Good” information is that a memory block is not in this set, because this memory block can be classified as definitely not in the cache whenever execution reaches the given program point. Thus, the smaller the sets are, the better. Additionally, the older blocks will reach the desired situation to be removed from the cache faster than the younger ones. Therefore, the partial order on this domain is as follows. Take some abstract cache state cˆ . Above cˆ in the domain, that is, less precise, are those states which contain additional memory blocks or where memory blocks from cˆ are younger than in cˆ . Therefore, the -operator applied to two abstract cache states cˆ1 and cˆ2 will produce a state cˆ containing those memory blocks contained in cˆ1 or cˆ2 and will give them the minimum of their ages in cˆ1 and cˆ2 (see Figure 14.7). The positions of the memory blocks in the abstract cache state are thus the lower bounds of the ages of the memory blocks in the concrete caches occurring in the collecting cache semantics. The solution of the may analysis problem is interpreted as follows: the fact that sx is in the abstract cache cˆ means that sx may be in the cache during some execution when the program point is reached. If sx is not in cˆ (li ) for any li then it will definitely be not in the cache on any execution. A reference to sx is categorized as always miss (am).

14.3 Pipeline Analysis Pipeline analysis attempts to find out how instructions move through the pipeline. In particular, it determines how many cycles they spend in the pipeline. This largely depends on the timing accidents the instructions suffer. Timing accidents during pipelined executions can be of several kinds. Cache misses during instruction or data load stall the pipeline for as many cycles as the cache miss penalty indicates. Functional units that an instruction needs may be occupied. Queues into which the instruction may have to be moved may be full, and prefetch queues, from which instructions have to be loaded, may be empty. The bus needed for a pipeline phase may be occupied by a different phase of another instruction.



14-11

Again, for architectures without timing anomalies we can use a simplified picture, in which the task is to find out which timing accidents can be safely excluded, because each excluded accident allows to decrease the bound for the execution time. Accidents that can not be safely excluded are assumed to happen. A cache analysis as described in Section 14.2 has annotated the instructions with cache-hit information. This information is used to exclude pipeline stalls at instruction or data fetches. We will explain pipeline analysis in a number of steps starting with concrete-pipeline execution. A pipeline goes through a number of pipeline phases and consumes a number of cycles when it executes a sequence of instructions; in general, a different number of cycles for different initial execution states. The execution of the instructions in the sequence overlaps in the instruction pipeline as far as the data dependences between instructions permit it and if the pipeline conditions are statisfied. Each execution of a sequence of instructions starting in some initial state produces one trace, that is, sequence of execution states. The length of the trace is the number of cycles this execution takes. Thus, concrete execution can be viewed as applying a function function exec (b : basic block, s : pipeline state) t : trace that executes the instruction sequence of basic block b starting in concrete pipeline state s producing a trace t of concrete states. last (t ) is the final state when executing b. It is the initial state for the successor block to be executed next. So far, we talked about concrete execution on a concrete pipeline. Pipeline analysis regards abstract execution of sequences of instructions on abstract (models of) pipelines. The execution of programs on abstract pipelines produces abstract traces, that is, sequences of abstract states, where some information contained in the concrete states may be missing. There are several types of missing information: • The cache analysis in general has incomplete information about cache contents. • The latency of an arithmetic operation, if it depends on the operand sizes, may be unknown. It influences the occupancy of pipeline units. • The state of a dynamic branch predictor changes over iterations of a loop and may be unknown for a particular iteration. • Data dependences can not safely be excluded because effective addresses of operands are not always statically known.

14.3.1 Simple Architectures without Timing Anomalies In a first step, we assume a simple processor architecture, with in-order execution and without timing anomalies, that is, architectures, where local worst cases contribute to the program’s global execution time, cf. Section 14.1.2. Also, it is safe to assume the local worst cases for unknown information. For both of them the corresponding timing penalties are added. For example, the cache miss penalty has to be added for instruction fetch of an instruction in the two cases, that a cache miss is predicted or that neither a cache miss nor a cache hit can be predicted. The result of the abstract execution of an instruction sequence for a given initial abstract state is again one trace; however, possibly of a greater length and thus an upper bound properly bounding the execution time from above. Because worst cases were assumed for all uncertainties, this number of cycles is a safe upper bound for all executions of the basic block starting in concrete states represented by this initial abstract state. The Algorithm for pipeline analysis is quite simple. It uses a function function e xec ˆ (b : cache-annotated basic block, sˆ : abstract pipeline state) tˆ : abstract trace that executes the instruction sequence of basic block b, annotated with cache information, starting in the abstract pipeline state sˆ and producing a trace tˆ of abstract states. This function is applied to each basic block b in each of its contexts and the empty pipeline state sˆ0 corresponding to a flushed pipeline. Therefore, a linear traversal of the cache-annotated context-extended


14-12


Basic-Block Graph suffices. The result is a trace for the instruction sequence of the block, whose length is an upper bound for the execution time of the block in this context. Note, that it still makes sense to analyze a basic block in several contexts because the cache information for them may be quite different. Note, that this algorithm is simple and efficient, but not necessarily very precise. Starting with a flushed pipeline at the beginning of the basic block is safe, but it ignores the potential overlap between consecutive basic blocks. A more precise algorithm is possible. The problem is with basic blocks having several predecessor blocks. Which of their final states should be selected as initial state of the successor block? First solution involves working with sets of states for each pair of basic block and context. Then, one analysis of each basic block and context would be performed for each of the initial states. The resulting set of final states would be passed on to successor blocks, and the maximum of the trace lengths would be taken as upper bound for this basic block in this context. Second solution would work with a single state per basic block and context and would combine the set of predecessor final states conservatively to the initial state for the successor.

14.3.2 Processors with Timing Anomalies In the next step, we assume more complex processors, including those with out-of-order execution. They typically have timing anomalies. Our assumption above, that is, that local worst cases contribute worst-case times to the global execution times, is no more valid. This forces us to consider several paths, wherever uncertainty in the abstract execution state does not allow to take a decision between several successor states. Note, that the absence of information leads from the deterministic concrete pipeline to an abstract pipeline that is non-deterministic. This situation is depicted in Figure 14.8. It demonstrates two cases of missing information in the abstract state. First, the abstract state lacks the information whether the instruction is in the I-cache. Pipeline analysis has to follow both cases in case of instruction fetch, because it could turn out that the I-cache miss, in fact, is not the global worst case. Second, the abstract state does not contain information about the size of the operands. We also have to follow both paths. The dashed paths have to be explored to obtain the execution times for this instruction. Depending on the architecture, we may be able to conservatively assume the case of large operands and surpress some paths. The algorithm has to combine cache and pipeline analysis because of the interference between both, which actually is the reason for the existence of the timing anomalies. For the cache analysis, it uses the abstract cache states discussed in Section 14.2. For the pipeline part, it uses analysis states, which are sets of abstract pipeline states, that is, sets of states of the abstract pipeline. The question arises whether Fetch

Issue

Execute

Retire

I–cache miss?

Unit occupied?

Multicycle?

Pending instructions? 6

19

4

1 6

3

No

6 30

6

Yes 1 3

6 41

FIGURE 14.8 Different paths through the execution of a multiply instruction. Decisions inside the boxes can not be deterministically taken based on the abstract execution state because of missing information.



14-13

an abstract cache state is to be combined with an analysis state ssˆ or an individual one with each of the abstract pipeline states in ss. ˆ So, there could be one abstract cache state for ssˆ representing the concrete cache contents for all abstract pipeline states in ss, ˆ or there could be one abstract cache state per abstract pipeline state in ss. ˆ The first choice saves memory during the analysis, but loses precision. This is because different pipeline states may cause different memory accesses and thus cache contents, which have to be merged into the one abstract state thereby losing information. The second choice is more precise but requires more memory during the analysis. We choose the second alternative and thus define a new domain of analysis states Aˆ of the following type: ˆ ˆ Aˆ = 2S ×C

(14.1)

Sˆ = set of abstract pipeline states

(14.2)

Cˆ = set of abstract cache states

(14.3)

The Algorithm again uses a new function exêc c . function exêc c (b : basic block, aˆ : analysis state) Tˆ : set of abstract trace, which analyzes a basic block b starting in an analysis state aˆ consisting of pairs of abstract pipeline states and abstract cache states. As a result it will produce a set of abstract traces. The algorithm is as follows.

14.3.3 Algorithm Pipeline-Analysis Perform fixpoint iteration over the context-extended Basic-Block Graph: For each basic block b in each of its contexts c, and for the initial analysis state aˆ , compute exêc c (b, aˆ ) yielding a set of traces {tˆ1 , tˆ2 , . . . , tˆm }. max ({|tˆ1 |, |tˆ2 |, . . . , |tˆm |}) is the bound for this basic block in this context. The set of output states {last (tˆ1 ), last (tˆ2 ), . . . , last (tˆm )} will be passed on to the successor block(s) in context c as initial states. Basic blocks (in some context) having more than one predecessor receive the union of the set of output states as initial states. The abstraction we use as analysis states is a set of abstract pipeline states, since the number of possible pipeline states for one instruction is not too big. Hence, our abstraction computes an upper bound to the collecting semantics. The abstract update for an analysis state aˆ is thus the application of the concrete update on each abstract pipeline state in aˆ extended with the possibility of multiple successor states in case of uncertainties. Figure 14.9 shows the possible pipeline states for a basic block in this example. Such pictures are shown by aiT tool upon special demand. The large dark grey boxes correspond to the instructions of the basic block, and the smaller rectangles in them stand for individual pipeline states. Their cyclewise evolution is indicated by the strokes connecting them. Each layer in the trees corresponds to one CPU cycle. Branches in the trees are caused by conditions that could not be statically evaluated, for example, a memory access with unknown address in presence of memory areas with different access times. On the other hand, two pipeline states fall together when details they differ in leave the pipeline. This happened, for instance, at the end of the second instruction, reducing the number of states from four to three. The update function belonging to an edge (ν, ν ) of the control-flow graph updates each abstract pipeline state separately. When the bus unit is updated, the pipeline state may split into several successor states with different cache states. The initial analysis state is a set of empty pipeline states plus a cache that represents a cache with unknown content. There can be multiple concrete pipeline states in the initial states, since the adjustment of internal to external clock of the processor is not known in the beginning and every possibility (aligned, one cycle apart, etc.) has to be considered. Thus prefetching must start from


14-14

FIGURE 14.9


Possible pipeline states in a basic block.

scratch, but pending bus requests are ignored. To obtain correct results, they must be taken into account by adding a fixed penalty to the calculated upper bounds.

14.3.4 Pipeline Modeling The basis for pipeline analysis is a model of an abstract version of the processor pipeline, which is conservative with respect to the timing behavior, that is, times predicted by the abstract pipeline must never be lower than those observed in concrete executions. Some terminology is needed to avoid confusion. Processors have concrete pipelines, which may be described in some formal language, for example, VHDL. If this is the case, there exists a formal model of the pipeline. Our abstraction step, by which we eliminate many components of a concrete pipeline that are not relevant for the timing behavior lead us to an abstract pipeline. This may again be described in a formal language, for example, VHDL, and thus have a formal model. Deriving an abstract pipeline is a complex task. It is demonstrated for the Motorola ColdFire processor, a processor quite popular in the aeronautics and the submarine industry. The presentation follows closely that of Reference 18.2 14.3.4.1 The ColdFire MCF 5307 Pipeline The pipeline of the ColdFire MCF 5307 consists of a fetch pipeline that fetches instructions from memory (or the cache), and an execution pipeline that executes instructions, cf. Figure 14.10. Fetch and execution pipelines are connected and as far as speed is concerned decoupled by a FIFO instruction buffer that can hold at most 8 instructions. The MCF 5307 accesses memory through a bus hierarchy. The fast pipelined K-bus connects the cache and an internal 4KB SRAM area to the pipeline. Accesses to this bus are performed by the IC1/IC2 and the AGEX and DSOC stages of the pipeline. On the next level, the M-Bus connects the K-Bus to the internal peripherals. This bus runs at the external bus frequency, while the K-Bus is clocked with the faster internal core clock. The M-Bus connects to the external bus, which accesses off-chip peripherals and memory. The fetch pipeline performs branch prediction in the IED stage, redirecting fetching long before the branch reaches the execution stages. The fetch pipeline is stalled if the instruction buffer is full, or if the execution pipeline needs the bus for a memory access. All these stalls cause the pipeline to wait for one cycle. After that, the stall condition is checked again. The fetch pipeline is also stalled if the memory block to be fetched is not in the cache (cache miss). The pipeline must wait until the memory block is loaded into the cache and forwarded to the pipeline. The instructions that are already in the later stages of the fetch pipeline are forwarded to the instruction buffer. 2 The model of

the abstract pipeline of the MCF 5307 has been derived by hand. A computer-supported derivation would have been preferable. Ways to develop this are subject of actual research.



Instruction Fetch Pipeline (IFP)

Operand Execution Pipeline (OEP)

IAG

Instruction Address Generation

IC1

Instruction Fetch Cycle 1

IC2

Instruction Fetch Cycle 2

IED

Instruction Early Decode

IB

FIFO Instruction Buffer

DSOC

Decode & Select, Operand Fetch

AGEX

14-15

Address [31:0]

Data[31:0]

Address Generation, Execute

FIGURE 14.10 The pipeline of the Motorola ColdFire 5307 processor.

The execution pipeline finishes the decoding of instructions, evaluates their operands, and executes the instructions. Each kind of operation follows a fixed schedule. This schedule determines, how many cycles the operation needs and in which cycles memory is accessed.3 The execution time varies between 2 cycles and several dozen cycles. Pipelining admits a maximum overlap of 1 cycle between consecutive instructions: the last cycle of each instruction may overlap with the first of the next one. In this first cycle, no memory access and no control-flow alteration happen. Thus, cache and pipeline cannot be affected by two different instructions in the same cycle. The execution of an instruction is delayed if memory accesses lead to cache misses. Misaligned accesses lead to small time penalties of 1 to 3 cycles. Store operations are delayed if the distance to the previous store operation is less than 2 cycles. (This does not hold if the previous store operation was issued by a MOVEM instruction.) The start of the next instruction is delayed if the instruction buffer is empty.

14.3.5 Formal Models of Abstract Pipelines An abstract pipeline can be seen as a big finite state machine, which makes a transition on every clock cycle. The states of the abstract pipeline, although greatly simplified still contain all timing relevant information 3 In fact, there are some instructions like MOVEM

whose execution schedule depends on the value of an argument as immediate constant. These instructions can be taken into account by special means.


14-16


of the processor. The number of transitions it takes from the beginning of the execution of an instruction until its end gives the execution time of that instruction. The abstract pipeline although greatly reduced by leaving out irrelevant components still is a really big finite state machine, but it has structure. Its states can be naturally decomposed into components according to the architecture. This makes it easier to specify, verify, and implement a model of an abstract pipeline. In the formal approach presented here, an abstract pipeline state consists of several units with inner states that communicate with one another and the memory via signals, and evolve cycle-wise according to their inner state and the signals received. Thus, the means of decomposition are units and signals. Signals may be instantaneous, meaning that they are received in the same cycle as they are sent, or delayed, meaning that they are received one cycle after they have been sent. Signals may carry data, for example, a fetch address. Note that these signals are only part of the formal pipeline model. They may or may not correspond to real hardware signals. The instantaneous signals between units are used to transport information between the units. The state transitions are coded in the evolution rules local to each unit. Figure 14.11 shows the formal pipeline model for the ColdFire MCF 5307. It consists of the following units: IAG (instruction address generation), IC1 (instruction fetch cycle 1), IC2 (instruction fetch cycle 2), IED (instruction early decode), IB (instruction buffer), EX (execution unit), SST (store stall timer). In addition, there is a bus unit modeling the buses that connect the CPU, the static RAM, the cache, and

set(a)/stop

IAG addr(a)

wait

cancel

fetch(a)

IC1 await(a)

hold wait

cancel code(a)

IC2 wait

wait

IED instr

wait

IB next

start read(A)/write(A)

EX data/hold store

wait

SST FIGURE 14.11 Abstract model of the Motorola ColdFire 5307 processor.


BUS UNIT

put(a)


14-17

the main memory. The signals between these units are shown as arrows. Most units directly correspond to a stage in the real pipeline. However, the SST unit is used to model the fact that two stores must be separated by at least two clock cycles. It is implemented as a (virtual) counter. The two stages of the execution pipeline are modeled by a single stage, EX, because instructions can only overlap by one cycle. The inner states and emitted signals of the units evolve in each cycle. The complexity of this state update varies from unit to unit. It can be as simple as a small table, mapping pending signals and inner state to a new state and signals to be emitted, for example, for the IAG unit and the IC1 unit. It can be much more complicated, if multiple dependencies have to be considered, for example, the instruction reconstruction and branch prediction in the IED stage. In this case, the evolution is formulated in pseudo code. Full details on the model can be found in Reference 19.

14.3.6 Pipeline States Abstract Pipeline States are formed by combining the inner states of IAG, IC1, IC2, IED, IB, EX, SST, and bus unit plus additional entries for pending signals into one overall state. This overall state evolves from one cycle to the next. Practically, the evolution of the overall pipeline state can be implemented by updating the functional units one by one in an order that respects the dependencies introduced by input signals and the generation of these signals. 14.3.6.1 Update Function for Pipeline States For pipeline modeling, one needs a function that describes the evolution of the concrete pipeline state while traveling along an edge (ν, ν ) of the control-flow graph. This function can be obtained by iterating the cycle-wise update function of the previous paragraph. An initial concrete pipeline state at ν has an empty execution unit EX. It is updated until an instruction is sent from IB to EX. Updating of the concrete pipeline state continues using the knowledge that the successor instruction is ν until EX has become empty again. The number of cycles needed from the beginning until this point can be taken as the time needed for the transition from ν to ν for this concrete pipeline state.

14.4 Path Analysis Using Integer Linear Programming The structure of a program and the set of program paths can be mapped to an ILP in a very natural way. A set of constraints describes the control flow of the program. Solving these constraints yields very precise results [5]. However, requirements for precision of the results demand analyzing basic blocks in different contexts, that is, in different ways, how control reached them. This makes the control quite complex, so that the mapping to an ILP may be very complex [14]. A problem formulated in an ILP consists of two parts: the cost function and constraints on the variables used in the cost function. Our cost function represents the number of CPU cycles. Correspondingly, it has to be maximized. Each variable in the cost function represents the execution count of one basic block of the program and is weighted by the execution time of that basic block. Additionally, variables are used corresponding to the traversal counts of the edges in the control flow graph, see Figure 14.12. The integer constraints describing how often basic blocks are executed relative to each other can be automatically generated from the control flow graph (Figure 14.13). However, additional information about the program provided by the user is usually needed, as the problem of finding the worst case program path is unsolvable in the general case. Loop and recursion bounds cannot always be inferred automatically and must therefore be provided by the user. The ILP approach for program path analysis has the advantage that users are able to describe in precise terms virtually anything they know about the program by adding integer constraints. The system first generates the obvious constraints automatically and then adds user supplied constraints to tighten the WCET bounds.


14-18

Embedded Systems Handbook e1

trav (e1)

v1

if v1 then v2 else

e2

e4

fi

trav (e3)

cnt(v2)

v3

v2

v3

cnt (v1) trav (e2)

e3

cnt (v3)

trav (e4)

e5

trav (e5)

v4

v4

cnt (v4)

e6

trav (e6)

FIGURE 14.12 A program snippet, the corresponding control flow graph, and the ILP variables generated. ... e1

en

e⬘1

FIGURE 14.13

...

n

m

i=1

i=1

Σ trav(ei ) = cnt(v ) = Σ trav(e⬘i)

v e⬘m

Control flow joins and splits and flow-preservation laws.

14.5 Other Ingredients 14.5.1 Value Analysis A static method for data-cache behavior prediction needs to know effective memory addresses of data, in order to determine where a memory access goes. However, effective addresses are only available at run time. Interval analysis as described by Cousot and Halbwachs [20] can help here. It can compute intervals for address-valued objects like registers and variables. An interval computed for such an object at some program point bounds the set of potential values the object may have when program execution reaches this program point. Such an analysis, in aiT called value analysis has shown to be able to determine many effective addresses in disciplined code statically [10].

14.5.2 Control Flow Specification and Analysis Any information about the possible flow of control of the program may increase the precision of the subsequent analyses. Control flow analysis may attempt to exclude infeasible paths, determine execution frequencies of paths or the relation between execution frequencies of different paths or subpaths, etc. The purpose of control flow analysis is to determine the dynamic behavior of the program. This includes information about what functions are called and with which arguments, how many times loops iterate, if there are dependencies between successive if-statements, etc. The main focus of flow analysis has been the determination of loop bounds, since the bounding of loops is a necessary step in order to find an execution time bound for a program. Control-flow analysis can be performed manually or automatically. Automatic analyses have been based on various techniques, like symbolic execution, abstract interpretation, and pattern recognition on parse trees. The best precision is achieved by using interprocedural analysis techniques, but this has to be traded off with the extra computation time and memory required. All automatic techniques allow a user to complement the results and guide the analysis using manual annotations, since this is sometimes necessary in order to obtain reasonable results. Since the flow analysis in general is performed separately from the path analysis, it does not know the execution times of individual program statements, and must thus generate a safe (over)approximation including all possible program executions. The path analysis will later select the path from the set of possible program paths that corresponds to the upper bound using the time information computed by processor behavior prediction.



14-19

Control flow specification is preferrably done on the source level. Concepts based on source-level constructs are used in References 6 and 21.

14.5.3 Frontends for Executables Any reasonably precise timing analysis takes fully linked executable programs as input. Source programs do not contain information about program and data allocation, which is essential for the described methods to predict the cache behavior. Executables must be analyzed to reconstruct the original control flow of the program. This may be a difficult task depending on the instruction set of the processor and the code generation of the used compiler. A generic approach to this problem is described in References 14, 22, and 23.

14.6 Related Work It is not possible in general to obtain upper bounds on running times for programs. Otherwise, one could solve the halting problem. However, real-time systems only use a restricted form of programming, which guarantees that programs always terminate. That is, recursion is not allowed (or explicitly bounded) and the maximal iteration counts of loops are known in advance. A worst-case running time of a program could easily be determined if the worst-case input for the program were known. This is in general not the case. The alternative, to execute the program with all possible inputs, is often prohibitively expensive. As a consequence, approximations for the worst-case execution time are determined. Two classes of methods to obtain bounds can be distinguished: • Dynamic methods employ real program executions to obtain approximations. These approximations are unsafe as they only compute the maximum of a subset of all executions. • Static methods only need the program itself, maybe extended with some additional information (like loop bounds).

14.6.1 A (Partly) Dynamic Method A traditional method, still used in industry, combines measuring and static methods. Here, small snippets of code are measured for their execution time, then a safety margin is applied and the results for code pieces are combined according to the structure of the whole task. For example, if a tasks first executes a snippet A and then a snippet B, the resulting time is that measured for A, tA , added to that measured for B, tB : t = tA + tB . This reduces the amount of measurements that have to be made, as code snippets tend to be reused a lot in control software and only the different snippets need to be measured. It adds, however, the need for an argumentation about the correctness of the composition step of the measured snippet times. This typically relies on certain implicit assumptions about the worst-case initial execution state for these measurements. For example, the snippets are measured with an empty cache at the beginning of the measurement under the assumption that this is the worst-case cache state. In Reference 19 it is shown that this assumption can be wrong. The problem of unknown worst-case input exists for this method as well, and it is still infeasible to measure execution times for all input values.

14.6.2 Purely Static Methods 14.6.2.1 The Timing-Schema Approach In the timing-schemata approach [24], bounds for the execution times of a composed statement are computed from the bounds of the constituents. One timing schema is given for each type of statement. Basis are known times of the atomic statements. These are assumed to be constant and available from a manual or are assumed to be computed in a preceding phase. A bound for the whole program is obtained by combining results according to the structure of the program.


14-20


The precision can be very bad because of some implicit assumptions underlying this method. Timing schemes assume compositionality of bounds for execution times, that is, they compute bounds for execution times of composed constructs from already computed bounds of the constituents. However, as we have seen, the execution times of the constituents depend heavily on the execution history. 14.6.2.2 Symbolic Simulation Another static method simulates the execution of the program on an abstract model of the processor. The simulation is performed without input; the simulator thus has to be capable to deal with partly unkown execution states. This method combines flow analysis, processor-behavior prediction, and path analysis in one integrated phase [25,26]. One problem with this approach is that analysis time is proportional to the actual execution time of the program with a usually large factor for doing a simulation. 14.6.2.3 WCET Determination by ILP Li, Malik, and Wolfe proposed an ILP-based approach to WCET determination [27–30]. Cache and pipeline behavior prediction are formulated as a single linear program. The i960KB is investigated, a 32-bit microprocessor with a 512 byte direct mapped instruction cache and a fairly simple pipeline. Only structural hazards need to be modeled, thus keeping the complexity of the integer linear program moderate compared to the expected complexity of a model for a modern microprocessor. Variable execution times, branch prediction, and instruction prefetching are not considered at all. Using this approach for superscalar pipelines does not seem very promising, considering the analysis times reported in one of the articles. One of the severe problems is the exponential increase of the size of the ILP in the number of competing l-blocks. l-blocks are maximally long contiguous sequences of instructions in a basic block mapped to the same cache set. Two l-blocks mapped to the same cache set compete if they do not have the same address tag. For a fixed cache architecture, the number of competing l-blocks grows linearly with the size of the program. Differentiation by contexts, absolutely necessary to achieve precision, increases this number additionally. Thus, the size of the ILP is exponential in the size of the program. Even though the problem is claimed to be a network-flow problem the size of the ILP is killing the approach. Growing associativity of the cache increases the number of competing l-blocks. Thus, also increasing cache-architecture complexity plays against this approach. Nonetheless, their method of modeling the control flow as an ILP, the so-called Implicit Path Enumeration, is elegant and can be efficient if the size of the ILP is kept small. It has been adopted by many groups working in this area. 14.6.2.4 Timing Analysis by Static Program Analysis The method described in this chapter uses a sequence of static program analyses for determining the program’s control flow and its data accesses and for predicting the processor’s behavior for the given program. An early approach to timing analysis using data-flow analysis methods can be found in References 31 and 32. Jakob Engblom showed how to precompute parts of a timing analyzer to speed up the actual timing analysis for architectures without timing anomalies [33]. Reference 34 gives an overview of existing tools for timing analysis, both commercially available tools and academic prototypes.

14.7 State of the Art and Future Extensions The timing-analysis technology described in this chapter is realized in the aiT tool and is used in the aeronautics and automotive industries. Several benchmarks have shown that precision of the predicted upper bounds is in the order of 10% [10]. To obtain such a precision, however, requires competent users since the available knowledge about the program’s control flow may be difficult to specify.



14-21

The computational effort is high, but acceptable. Future optimizations will reduce this effort. As often in static program analysis, there is a trade-off between precision and effort. Precision can be reduced if the effort is intolerable. The only really drawback of the described technology is the huge effort for producing abstract processor models. Work is under way to support this activity through transformations on the VHDL level.

Acknowledgments Many former students have worked on different parts of the method presented in this chapter and have together built a timing-analysis tool satisfying industrial requirements. Christian Ferdinand studied cache analysis and showed that precise information about cache contents can be obtained. Stephan Thesing together with Reinhold Heckmann and Marc Langenbach developed methods to model abstract processors. Stephan went through the pains of implementing several abstract models for real-life processors such as the ColdFire MCF 5307 and the PPC 755. I owe him my thanks for help with the presentation of pipeline analysis, Henrik Theiling contributed the preprocessor technology for the analysis of executables and the translation of complex control flow to integer linear programs. Many thanks to him for his contribution to the path analysis section. Michael Schmidt implemented powerful versions of value analysis. Reinhold Heckmann managed to model even very complex cache architectures. Florian Martin implemented the program-analysis generator, PAG, which is the basis for many of the program analyses.

References [1] Reinhold Heckmann, Marc Langenbach, Stephan Thesing, and Reinhard Wilhelm. The influence of processor architecture an the design and the results of WCET tools. IEEE Proceedings on Real-Time Systems, 91: 1038–1054, 2003. [2] P. Puschner and Ch. Koza. Calculating the maximum execution time of real-time programs. Real-Time Systems, 1: 159–176, 1989. [3] Chang Yun Park and Alan C. Shaw. Experiments with a program timing tool based on source-level timing schema. IEEE Computer, 24: 48–57, 1991. [4] Christopher A. Healy, David B. Whalley, and Marion G. Harmon. Integrating the timing analysis of pipelining and instruction caching. In Proceedings of the IEEE Real-Time Systems Symposium, December 1995, pp. 288–297. [5] Henrik Theiling, Christian Ferdinand, and Reinhard Wilhelm. Fast and precise WCET prediction by separated cache and path analyses. Real-Time Systems, 18: 157–179, 2000. [6] Andreas Ermedahl. A Modular Tool Architecture for Worst-Case Execution Time Analysis. Ph.D. thesis, Uppsala University, Uppsala, Sweden, 2003. [7] Martin Alt, Christian Ferdinand, Florian Martin, and Reinhard Wilhelm. Cache behavior prediction by abstract interpretation. In Proceedings of SAS’96, Static Analysis Symposium, Vol. 1145 of Lecture Notes in Computer Science, Springer-Verlag, Heidelberg, 1996, pp. 52–66. [8] Christian Ferdinand, Florian Martin, and Reinhard Wilhelm. Cache behavior prediction by abstract interpretation. Science of Computer Program, 35: 163–189, 1999. [9] C. Ferdinand, R. Heckmann, M. Langenbach, F. Martin, M. Schmidt, H. Theiling, S. Thesing, and R. Wilhelm. Reliable and precise WCET determination for a real-life processor. In Proceedings of the First International Workshop on Embedded Software Workshop, Vol. 2211 of Lecture Notes on Computer Science, Springer-Verlag, London, 2001, pp. 469–485. [10] Stephan Thesing, Jean Souyris, Reinhold Heckmann, Famantanantsoa Randimbivololona, Marc Langenbach, Reinhard Wilhelm, and Christian Ferdinand. An abstract interpretationbased timing validation of hard real-time avionics software systems. In Proceedings of the 2003 International Conference on Dependable Systems and Networks (DSN 2003), IEEE Computer Society, Washington, 2003, pp. 625–632.


14-22


[11] Thomas Lundqvist and Per Stenström. Timing Anomalies in Dynamically Scheduled Microprocessors. In Proceedings of the 20th IEEE Real-Time Systems Symposium, December 1999, pp. 12–21. [12] T. Reps, M. Sagiv, and R. Wilhelm. Shape analysis and applications. In Y.N. Srikant and Priti Shankar, Eds., The Compiler Design Handbook: Optimizations and Machine Code Generation, CRC Press, Boca Raton, FL, 2002, pp. 175–217. [13] Florian Martin, Martin Alt, Reinhard Wilhelm, and Christian Ferdinand. Analysis of loops. In Proceedings of the International Conference on Compiler Construction (CC’98), Vol. 1383 of Lecture Notes in Computer Science, Springer-Verlag, Heidelberg, 1998, pp. 80–94. [14] Henrik Theiling. Control Flow Graphs For Real-Time Systems Analysis. Ph.D. thesis, Universität des Saarlandes, Saarbrücken, Germany, 2002. [15] Patrick Cousot and Radhia Cousot. Abstract interpretation: a unified lattice model for static analysis of programs by construction or approximation of fixpoints. In Proceedings of the 4th ACM Symposium on Principles of Programming Languages, Los Angeles, CA, 1977, pp. 238–252. [16] Christian Ferdinand. Cache Behavior Prediction for Real-Time Systems. Ph.D. thesis, Universität des Saarlandes, Saarbrueken, 1997. [17] Flemming Nielson, Hanne Riis Nielson, and Chris Hankin. Principles of Program Analysis. Springer-Verlag, Heidelberg, 1999. [18] Marc Langenbach, Stephan Thesing, and Reinhold Heckmann. Pipeline modelling for timing analysis. In Manuel V. Hermenegildo and German Puebla, Eds., Static Analysis Symposium SAS 2002, Vol. 2477 of Lecture Notes in Computer Science, Springer-Verlag, Heidelberg, 2002, pp. 294–309. [19] Stephan Thesing. Safe and Precise WCET Determination by Abstract Interpretation of Pipeline Models. Ph.D. thesis, Saarland University, Saarbruecken, 2004. [20] Patrick Cousot and Nicolas Halbwachs. Automatic discovery of linear restraints among variables of a program. In Proceedings of the 5th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, Tucson, AZ, ACM Press, New York, 1978, pp. 84–96. [21] Andreas Ermedahl and Jan Gustafsson. Deriving annotations for tight calculation of execution time. In Proceedings of the Euro-Par, 1997, pp. 1298–1307. [22] Henrik Theiling. Extracting safe and precise control flow from binaries. In Proceedings of the Seventh International Conference on Real-Time Systems and Application, IEEE Computer Society, 2000, pp. 23–30. [23] Henrik Theiling. Generating decision trees for decoding binaries. In ACM SIGPLAN 2001 Workshop on Languages, Compilers, and Tools for Embedded Systems, 2001, pp. 112–120. [24] Alan C. Shaw. Reasoning about time in higher-level language software. IEEE Transactions on Software Engineering, 15: 875–889, 1989. [25] Thomas Lundqvist and Per Stenström. An integrated path and timing analysis method based on cycle-level symbolic execution. Real-Time Systems, 17: 183–207, 1999. [26] Thomas Lundqvist. A WCET Analysis Method for Pipelined Microprocessors with Cache Memories. Ph.D. thesis, Department of Computer Engineering, Chalmers University of Technology, Sweden, 2002. [27] Yau-Tsun Steven Li and Sharad Malik. Performance analysis of embedded software using implicit path enumeration. In Proceedings of the 32nd ACM/IEEE Design Automation Conference, June 1995, pp. 456–461. [28] Yau-Tsun Steven Li, Sharad Malik, and Andrew Wolfe. Efficient microarchitecture modeling and path analysis for real-time software. In Proceedings of the IEEE Real-Time Systems Symposium, December 1995, pp. 298–307. [29] Yau-Tsun Steven Li, Sharad Malik, and Andrew Wolfe. Performance estimation of embedded software with instruction cache modeling. In Proceedings of the IEEE/ACM International Conference on Computer-Aided Design, November 1995, pp. 380–387.



14-23

[30] Yau-Tsun Steven Li, Sharad Malik, and Andrew Wolfe. Cache modeling for real-time software: beyond direct mapped instruction caches. In Proceedings of the IEEE Real-Time Systems Symposium, December 1996. [31] R. Arnold, F. Mueller, D. Whalley, and M. Harmon. Bounding worst-case instruction cache performance. In Proceedings of the IEEE Real-Time Systems Symposium, Puerto Rico, December 1994, pp. 172–181. [32] Frank Mueller, David B. Whalley, and Marion Harmon. Predicting instruction cache behavior. In Proceedings of the ACM SIGPLAN Workshop on Language, Compiler and Tool Support for Real-Time Systems, 1994. [33] Jakob Engblom. Processor Pipelines and Static Worst-Case Execution Time Analysis. Ph.D. thesis, Uppsala University, Uppsala, Sweden, 2002. [34] Reinhard Wilhelm, Jakob Engblom, Stephan Thesing, and David Whalley. The determination of worst-case execution times — introduction and survey of available tools, 2004 (submitted).


15 Performance Analysis of Distributed Embedded Systems 15.1 Performance Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-1 Distributed Embedded Systems • Basic Terms • Role in the Design Process • Requirements

15.2 Approaches to Performance Analysis . . . . . . . . . . . . . . . . . . . 15-6 Simulation-Based Methods • Holistic Scheduling Analysis • Compositional Methods

Lothar Thiele and Ernesto Wandeler Swiss Federal Institute of Technology

15.3 The Performance Network Approach . . . . . . . . . . . . . . . . . . 15-11 Performance Network • Variability Characterization • Resource Sharing and Analysis • Concluding Remarks

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-17

15.1 Performance Analysis 15.1.1 Distributed Embedded Systems An embedded system is a special-purpose information processing system that is closely integrated into its environment. It is usually dedicated to a certain application domain and knowledge about the system behavior at design time can be used to minimize resources while maximizing predictability. The embedding into a technical environment and the constraints imposed by a particular application domain very often lead to heterogeneous and distributed implementations. In this case, systems are composed of hardware components that communicate via some interconnection network. The functional and nonfunctional properties of the whole system not only depend on the computations inside the various nodes but also on the interaction of the various data streams on the common communication media. In contrast to multiprocessor or parallel computing platforms, the individual computing nodes have a high degree of independence and usually communicate via message passing. It is particulary difficult to maintain global state and workload information as the local processing nodes usually make independent scheduling and resource access decisions. In addition, the dedication to an application domain very often leads to heterogeneous distributed implementations, where each node is specialized to its local environment and/or its functionality. For example, in an automotive application one may find nodes (usually called embedded control units) that contain a communication controller, a CPU, memory, and I/O interfaces. But depending on the particular

15-1


15-2


task of a node, it may contain additional digital signal processors (DSP), different kinds of CPUs and interfaces, and different memory capacities. The same observation holds for the interconnection networks also. They may be composed of several interconnected smaller sub-networks, each one with its own communication protocol and topology. For example, in automotive applications we may find Controller Area Networks (CAN), time-triggered protocols (TTP) like in TTCAN, or hybrid protocols like in FlexRay. The complexity of a design is particularly high if the computation nodes responsible for a single application are distributed across several networks. In this case, critical information may flow through several sub-networks and connecting gateways before it reaches its destination. Recently, we see that the earlier described architectural concepts of heterogeneity, distributivity, and parallelism can be seen on several layers of granularity. The term system-on-a-chip refers to the implementation of sub-systems on a single device, that contains a collection of (digital or analogue) interfaces, busses, memory, and heterogeneous computing resources such as FPGAs, CPUs, controllers, and DSPs. These individual components are connected using “networks-on-chip” that can be regarded as dedicated interconnection networks involving adapted protocols, bridges, or gateways. Based on the assessment given, it becomes obvious that heterogeneous and distributed embedded systems are inherently difficult to design and to analyze. In many cases, not only the availability, the safety, and the correctness of the computations of the whole embedded system are of major concern, but also the timeliness of the results. One cause for end-to-end timing constraints is the fact that embedded systems are frequently connected to a physical environment through sensors and actuators. Typically, embedded systems are reactive systems that are in continuous interaction with their environment and they must execute at a pace determined by that environment. Examples are automatic control tasks, manufacturing systems, mechatronic systems, automotive/air/space applications, radio receivers and transmitters, and signal processing tasks in general. And also in the case of multimedia and content production, missing audio or video samples need to be avoided under all circumstances. As a result, many embedded systems must meet real-time constraints, that is, they must react to stimuli within the time interval dictated by the environment. A real-time constraint is called hard, if not meeting that constraint could result in a catastrophic failure of the system, and it is called soft otherwise. As a consequence, time-predictability in the strong sense cannot be guaranteed using statistical arguments. Finally, let us give an example that shows part of the complexity in the performance and timing analysis of distributed embedded systems. The example adapted from Reference 1 is particularly simple in order to point out one source of difficulties, namely the interaction of event streams on a communication resource (Figure 15.1). P1, P2

P3 A1

Sensor

CPU

Memory

I/O

Bus

Input P4

...

DSP

Buffer

A2

P5, P6

Bus load t BCET

WCET

FIGURE 15.1 Interference of two applications on a shared communication resource.


Analysis of Distributed Embedded Systems

15-3

The application A1 consists of a sensor that sends periodically bursts of data to the CPU, which stores them in the memory using a task P1. These data are processed by the CPU using a task P2, with a worst-case execution time (WCET) and a best-case execution time (BCET). The processed data are transmitted via the shared bus to a hardware input/output device that is running task P3. We suppose that the CPU uses a preemptive fixed-priority scheduling policy, where P1 has the highest priority. The maximal workload on the CPU is obtained when P2 continuously uses the WCET and when the sensor simultaneously submits data. There is a second streaming application A2 that receives real-time data in equidistant packets via the Input interface. The Input interface is running task P4 to send the data to a DSP for processing with task P5. The processed packets are then transferred to a playout buffer and task P6 periodically removes packets from the buffer, for example, for playback. We suppose that the bus uses a FCFS (first come first serve) scheme for arbitration. As the bus transactions from the applications A1 and A2 interfere on the common bus, there will be a jitter in the packet stream received by the DSP that eventually may lead to an undesirable buffer overflow or underflow. It is now interesting to note that the worst-case situation in terms of jitter occurs if the processing in A1 uses its BCET, as this leads to a blocking of the bus for a long time period. Therefore, the worst-case situation for the CPU load leads to a best case for the bus, and vice versa. In case of more realistic situations, there will be simultaneous resource sharing on the computing and communication resources, there may be different protocols and scheduling policies on these resources, there may be a distributed architecture using interconnected sub-networks, and there may be additional nondeterminism caused by unknown input patterns and data. It is the purpose of performance analysis to determine the timing and memory properties of such systems.

15.1.2 Basic Terms As a starting point of the analysis of timing and performance of embedded systems, it is very useful to clarify a few basic terms. Very often, the timing behavior of an embedded system can be described by the time interval between a specified pair of events. For example, the instantiation of a task, the occurrence of a sensor input, or the arrival of a packet could be a start event. Such events will be denoted as arrival events. Similar, the finishing of an application or a part of it can again be modeled as an event, denoted as finishing event. In case of a distributed system, the physical location of the finishing event may not be equal to that of the corresponding arrival event and the processing may require the processing of a sequence or set of tasks, and the use of distributed computing and communication resources. In this case, we talk about end-to-end timing constraints. Note that not all pairs of events in a system are necessarily critical, that is, have deadline requirements. An embedded system processes the data associated with arrival events. The timing of computations and communications within the embedded system may depend on the input data (because of data dependent behavior of tasks) and on the arrival pattern. In case of a conservative resource sharing strategy, such as the time-triggered architecture (TTA), the interference between these tasks is removed by applying a static sharing strategy. If the use of shared resources is controlled by dynamic policies, all activities may interact with each other and the timing properties influence each other. As shown in Section 15.1.1, it is necessary to distinguish between the following terms: • Worst case and best case. The worst case and the best case are the maximal and minimal time interval between the arrival and finishing events under all admissible system and environment states. The execution time may vary largely, owing to different input data and interference between concurrent system activities. • Upper and lower bounds. Upper and lower bounds are quantities that bound the worst- and bestcase behavior. These quantities are usually computed offline, that is, not during the runtime of the system. • Statistical measures. Instead of computing bounds on the worst- and best-case behavior, one may also determine a statistical characterization of the runtime behavior of the system, for example, expected values, variances, and quantiles.


15-4


In the case of real-time systems, we are particularly interested in upper and lower bounds. They are used in order to verify statically, whether the system meets its timing requirements, for example, deadlines. In contrast to the end-to-end timing properties, the term performance is less well defined. Usually, it refers to a mixture of the achievable deadline, the delay of events or packets, and of the number of events that can be processed per time unit (throughput). There is a close relation between the delay of individual packets or events, the necessary memory in the embedded system and the throughput, that is, the required memory is proportional to the product of throughput and delay. Therefore, we will concentrate on the delay and memory properties in this chapter. Several methods do exist, such as analysis, simulation, emulation, and implementation, in order to determine or approximate the above quantities. Besides analytic methods based on formal models, one may also consider simulation, emulation, or implementation. All the latter possibilities should be used with care as only a finite set of initial states, environment behaviors, and execution traces can be considered. As is well known, the corner cases that lead to a WCET or BCET are usually not known, and thus incorrect results may be obtained. The huge state space of realistic system architectures makes it highly improbable that the critical instances of the execution can be determined without the help of analytical methods. In order to understand the requirements for performance analysis methods in distributed embedded systems, we will classify possible causes for a large difference between the worst case and best case or between the upper and lower bounds: • Nondeterminism and interference. Let us suppose that there is only limited knowledge about the environment of the embedded system, for example, about the time when external events arrive or about their input data. In addition, there is interference of computation and communication on shared resources such as CPU, memory, bus, or network. Then, we will say that the timing properties are nondeterministic with respect to the available information. Therefore, there will be a difference between the worst-case and the best-case behavior as well as between the associated bounds. An example may be that the execution time of a task may depend on its input data. Another example is the communication of data packets on a bus in case of an unknown interference. • Limited analyzability. If there is complete knowledge about the whole system, then the behavior of the system is determined. Nevertheless, it may be that because of the system complexity, there is no feasible way of determining close upper and lower bounds on the worst- and best-case timing, respectively. As a result of this discussion, we understand that methods to analyze the performance of distributed embedded system must be (1) correct in that they determine valid upper and lower bounds and (2) accurate in that the determined bounds are close to the actual worst case and best case. In contrast to other chapters of the handbook, we will concentrate on the interaction between the task level of an embedded system and the distributed operation. We suppose that the whole application is partitioned into tasks and threads. Therefore, the task level refers to operating system issues such as scheduling, memory management, and arbitration of shared resources. In addition, we are faced with applications that run on distributed resources. The corresponding layer contains methods of distributed scheduling and networking. On this level of abstraction we are interested in end-to-end timing and performance properties.

15.1.3 Role in the Design Process One of the major challenges in the design process of embedded systems is to estimate essential characteristics of the final implementation early in the design. This can help in making important design decisions before investing too much time in detailed implementations. Typical questions faced by a designer during a system-level design process are: which functions should be implemented in hardware and which in software (partitioning)? Which hardware components should be chosen (allocation)? How should the different functions be mapped onto the chosen hardware (binding)? Do the system-level timing properties meet the design requirements? What are the different bus utilizations and which bus or processor acts



Application specification

15-5

Execution platform Mapping scheduling arbitration

Performance analysis

Design space exploration

FIGURE 15.2

Relation between design space exploration and performance analysis.

as a bottleneck? Then there are also questions related to the on-chip memory requirements and off-chip memory bandwidth. Typically, the performance analysis or estimation is part of the design space exploration, where different implementation choices are investigated in order to determine the appropriate design trade-offs between the different conflicting objectives, for an overview see Reference 2. Following Figure 15.2, the estimation of system properties in an early design phase is an essential part of the design space exploration. Different choices of the underlying system architecture, the mapping of the applications onto this architecture, and the chosen scheduling and arbitration schemes will need to be evaluated in terms of the different quality criteria. In order to achieve acceptable design times though, there is a need for automatic or semiautomatic (interactive) exploration methods. As a result, there are additional requirements for performance analysis if used for design space exploration, namely (1) the simple reconfigurability with respect to architecture, mapping, and resource sharing policies, (2) a short analysis time in order to be able to test many different choices in a reasonable time frame, and (3) the possibility to cope with incomplete design information, as typically the lower layers are not designed or implemented yet. Even if the design space exploration as described is not a part of the chosen design methodology, the performance analysis is often part of the development process of software and hardware. In embedded system design, the functional correctness is validated after each major design step using simulation or formal methods. If there are nonfunctional constraints such as deadline or throughput requirements, they need to be validated as well and all aspects of the design representation related to performance become “first class citizens.” Finally, performance analysis of the whole embedded system may be done after completion of the design, in particular if the system is operated under hard real-time conditions where timing failures lead to a catastrophic situation. As has been mentioned earlier, performance simulation is not appropriate in this case because the critical instances and test patterns are not known in general.

15.1.4 Requirements Based on the discussion, one can list some of the requirements that a methodology for performance analysis of distributed embedded systems must satisfy: • Correctness. The results of the analysis should be correct, that is, there exist no reachable system states and feasible reactions of the system environment such that the calculated bounds are violated. • Accuracy. The lower and upper bounds determined by the performance analysis should be close to the actual worst- and best-case timing properties.


15-6


• Embedding into the design process. The underlying performance model should be sufficiently general to allow the representation of the application (which possibly uses different specification mechanisms), of the environment (periodic, aperiodic, bursty, different event types), of the mapping including the resource sharing strategies (preemption, priorities, time triggered) and of the hardware platform. The method should seamlessly integrate into the functional specification and design methodology. • Short analysis time. Especially, if the performance analysis is part of a design space exploration, a short analysis time is important. In addition, the underlying model should allow for reconfigurability in terms of application, mapping, and hardware platform. As distributed systems are heterogeneous in terms of the underlying execution platform, the diverse concurrently running applications, and the different scheduling and arbitration policies used, modularity is a key requirement for any performance analysis method. We can distinguish between several composition properties: • Process composition. Often, events need to be processed by several consecutive application tasks. In this case, the performance analysis method should be modular in terms of this functional composition. • Scheduling composition. Within one implementation, different scheduling methods can be combined, even within one computing resource (hierarchial scheduling); the same property holds for the scheduling and arbitration of communication resources. • Resource composition. A system implementation can consist of different heterogeneous computing and communication resources. It should be possible to compose them in a similar way as processes and scheduling methods. • Building components. Combinations of processes, associated scheduling methods and architecture elements should be combined into components. This way, one could associate a performance component to a combined hardware/operating system/software module of the implementation, that exposes the performance requirements but hides internal implementation details. It should be mentioned that none of the approaches known to date are able to satisfy all of the above mentioned criteria. On the other hand, depending on the application domain and the chosen design approach, not all of the requirements are equally important. Section 15.2 summarizes some of the available methods and in Section 15.3 one available method is described in more detail.

15.2 Approaches to Performance Analysis In this survey, we select just a few representative and promising approaches that have been proposed for the performance analysis of distributed embedded systems.

15.2.1 Simulation-Based Methods Currently, the performance estimation of embedded systems is mainly done using simulation or tracebased simulation. Examples of available approaches and software support provides the SystemC initiative, see for example, References 3 and 4, that is supported by tools from companies such as Cadence (nc-systemc) and Synopsys (System Studio). In simulation-based methods, many dynamic and complex interactions can be taken into account whereas analytic methods usually have to stick to a restrictive underlying model and suffer from limited scope. In addition, there is the possibility to match the level of abstraction in the representation of time to the required degree of accuracy. Examples for these different layers are cycle-accurate models, for example, those used in the simulation of processors [5], up to networks of discrete event components that can be modeled in SystemC. In order to determine timing properties of an embedded system, a simulation framework not only has to consider the functional behavior but also requires a concept of time and a way of taking into



Input stimuli

Performance estimation

Cosimulation (Based on an abstract architecture)

Refined CAG

15-7

Abstract trace

Simulation

Initial CAG

Analysis

Communication topology mapping arbitration protocols

FIGURE 15.3 A hybrid method for performance estimation, based on simulation and analytic methods.

account properties of the execution platform, of the mapping between functional computation and communication processes and elements of the underlying hardware, and of resource sharing policies (as usually implemented in the operating system or directly in hardware). This additional complexity leads to higher computation times, and performance estimation quickly becomes a bottleneck in the design. Besides, there is a substantial set-up effort necessary if the mapping of the application to the underlying hardware platform changes, for example, in order to perform a design space exploration. The fundamental problem of simulation-based approaches to performance estimation is the insufficient corner case coverage. As shown in the example in Figure 15.1, the sub-system corner case (high computation time of A1) does not lead to the system corner case (small computation time of A1). Designers must provide a set of appropriate simulation stimuli in order to cover all the corner cases that exist in the distributed embedded system. Failures of embedded systems very often relate to timing anomalies that happen infrequently and therefore, are almost impossible to discover by simulation. In general, simulation provides estimates of the average system performance but does not yield worst-case results and cannot determine whether the system satisfies required timing constraints. The approach taken by Lahiri et al. [6] combines performance simulation and analysis by a hybrid trace-based methodology. It is intended to fill the gap between pure simulation that may be too slow to be used in a design space exploration cycle, and analytic methods that are often too restricted in scope and not accurate enough. The approach as described concentrates on communication aspects of a distributed embedded system. The performance estimation is partitioned into several stages, see Figure 15.3: • Stage 1. An initial cosimulation of the whole distributed system is performed. The simulation not only covers functional aspects (processing of data) but also captures the communication in an abstract manner, that is, in form of events, tokens, or abstract data transfers. The resulting set of traces cover essential characteristics of computation and communication but do not contain data information anymore. Here, we do not take into account resource sharing such as different arbitration schemes and access conflicts. The output of this step is a timing inaccurate system execution trace. • Stage 2. The traces from stage 1 are transformed into an initial Communication Analysis Graph (CAG). One can omit unnecessary details (values of the data communicated, only the size might be important here, etc.) and bursts of computation/communication events might be clustered by identifying only start and end times of these bursts. • Stage 3. A communication topology is chosen, the mapping of the abstract communications to paths in the communication architecture (network, bus, point-to-point links) is specified and finally, the corresponding arbitration protocols are chosen. • Stage 4. In the analytic part of the whole methodology, the CAG from stage 2 is transformed and refined using the information in stage 3. It now captures the computation, communication, and synchronization as seen on the target system. To this end, the initial CAG is augmented to incorporate the various latencies and additional computations introduced by moving from an abstract communication model to an actual one.


15-8


The resulting CAG can then be analyzed in order to estimate the system performance, determine critical paths, and collect various statistics about the computation and communication components. The above approach still suffers from several disadvantages. All traces are the result of a simulation, and the coverage of corner cases is still limited. The underlying representation is a complete execution of the application in form of a graph that may be of prohibitive size. The effect of the transformations applied in order to (1) reduce the size of the CAG and to (2) incorporate the concrete communication architecture are not formally specified. Therefore, it is not clear what the final analysis results represent. Finally, because of the separation between the functional simulation and the nonfunctional analysis, no feedback is possible. For example, a buffer overflow because of a sporadic communication overload situation may lead to a difference in the functional behavior. Nevertheless, the described approach blends two important approaches to performance estimation, namely simulation and analytic methods and makes use of the best properties of both worlds.

15.2.2 Holistic Scheduling Analysis There is a large body of formal methods available for scheduling of shared computing resources, for example, fixed priority, rate monotonic, earliest deadline first scheduling, time-triggered policies like TDMA or round-robin, and static cyclic scheduling. From the WCET of individual tasks, the arrival pattern of activation and the particular scheduling strategy, one can analyze in many cases the schedulability and worst-case response times, see for example, Reference 7. Many different application models and event patterns have been investigated such as sporadic, periodic, jitter, and bursts. There exists a large number of commercial tools that allow for this “one-model approach” the analysis of quantities such as resource load and response times. In a similar way, network protocols are increasingly supported by analysis and optimization tools. The classical scheduling theory has been extended toward distributed systems where the application is executed on several computing nodes and the timing properties of the communication between these nodes cannot be neglected. The seminal work of Tindell and Clark [8] combined fixed priority preemptive scheduling at computations nodes with TDMA scheduling on the interconnecting bus. These results are based on two major achievements: • The communication system (in this case, the bus), was handled in a similar way than the computing nodes. Because of this integration of process and communication scheduling, the method was called a holistic approach to the performance analysis of distributed real-time systems. • The second contribution was the analysis of the influence of the release jitter on the response time, where the release jitter denotes the worst-case time difference between the arrival (or activation) of a process and its release (making it available to the processor). Finally, the release jitter has been linked to the message delay induced by the communication system. This work was improved in terms of accuracy by Yen and Wolf [9] by taking into account correlations between arrivals of triggering events. In the meantime, many extensions and applications have been published based on the same line of thoughts. Other combinations of scheduling and arbitration policies have been investigated, such as CAN [10], and more recently, the FlexRay protocol [11]. The latter extension opens the holistic scheduling methodology to mixed event-triggered and time-triggered systems where the processing and communication is driven by the occurrence of events or the advance of time, respectively. Nevertheless, it must be noted that the holistic approach does not scale to general distributed architectures in that for every new kind of application structure, sharing of resources and combination thereof, a new analysis needs to be developed. In general, the model complexity grows with the size of the system and the number of different scheduling techniques. In addition, the method is restricted to the classical models of task arrival patterns such as periodic, or periodic with jitter.



15-9

15.2.3 Compositional Methods Three main problems arise in the case of complex distributed embedded systems: first, the architecture of such systems, as already mentioned, is highly heterogeneous — the different architectural components are designed assuming different input event models and use different arbitration and resource sharing strategies. This makes any kind of compositional performance analysis difficult. Second, applications very often rely on a high degree of concurrency. Therefore, there are multiple control threads, which additionally complicate timing analysis. And third, we cannot expect that an embedded system only needs to process periodic events where to each event a fixed number of bytes is associated. If, for example, the event stream represents a sampled voice signal, then after several coding, processing, and communication steps, the amount of data per event as well as the timing may have changed substantially. In addition, stream based systems often also have to process other event streams that are sporadic or bursty, for example, they have to react to external events or deal with best-effort traffic for coding, transcription, or encryption. There are only a few approaches available that can handle such complex interactions. One approach is based on a unifying model of different event patterns in the form of arrival curves as known from the networking domain, see References 12 and 13. The proposed real-time calculus (RTC) represents the resources and their processing or communication capabilities in a compatible manner and therefore, allows for a modular hierarchical scheduling and arbitration for distributed embedded systems. The approach will be explained in Section 15.3 in some more detail. Richter et al. propose in References 1, 14, and 15 a method that is based on classical real-time scheduling results. They combine different well known abstractions of event task arrival patterns and provide additional interfaces between them. The approach is based on the following principles: • The main goal is to make use of the very successful results in real-time scheduling, in particular for sharing a single processor or a single communication link, see for example, References 7 and 16. For a large class of scheduling and arbitration policies and a set of arrival patterns (periodic, periodic with jitter, sporadic, and bursty), upper and lower bounds on the response time can be determined, that is, the time difference between the arrival of a task and its finishing time. Therefore, the abstraction of a task of the application consists of a triggering event stream with a certain arrival pattern, the WCET and BCET on the resource. Several tasks can be mapped onto a single resource. Together with the scheduling policy, one can obtain for each task the associated lower and upper bound of the response time. In a similar way, communication and shared busses can be handled. • The application model is a simple concatenation of several tasks. The end-to-end delay can now be obtained by adding the individual contributions of the tasks; the necessary buffer memory can simply be computed taking into account the initial arrival pattern. • Obviously, the approach is feasible only if the arrival patterns fit the few basic models for which results on computing bounds on the response time are available. In order to overcome this limitation, two types of interfaces are defined: (a) EMIF. Event Model Interfaces are used in the performance analysis only. They perform a type conversion between certain arrival patterns, that is, they change the mathematical representation of event streams. (b) EAF. Event Adaptation Functions need to be used in cases where there exists no EMIF. In this case, the hardware/software implementation must be changed in order to make the system analyzable, for example, by adding playout buffers at appropriate locations. In addition, a new set of six arrival patterns was defined in Reference 1 which is more suitable for the proposed type conversion using EMIF and EAF, see Figure 15.4. In Figure 15.5, the example of Figure 15.1 is extended by adding to the tasks P1 to P6, appropriate arrival patterns (event stream abstractions) and EMIF/EAF interfaces. For example, we suppose that there is an analysis method for the bus arbitration scheme available that requires “periodic with jitter” as the input model. As the transformation from “periodic with burst” requires an EAF, the implementation must be changed to accommodate a buffer that smoothens the bursts. From “periodic” to “periodic with jitter,”


15-10


Periodic

Periodic w/jitter

Periodic w/burst

ti + 1 – ti = T ti

ti

ti + 1

t

T

J>T

t

T J Admissible occurrence of event

ti

ti = i ·T + wi + w0 0 > wi > J ti +1 – ti > d

J>T

t

T

J

ti = i ·T + wi + w0 0 > wi > J

FIGURE 15.4 Some arrival patterns of tasks that can be used to characterize properties of event streams in Reference 1. T , J , and d denote the period, jitter, and minimal interarrival time, respectively. φ0 denotes a constant phase shift.

A1

P1 Sensor

P3

Memory P2

Sporadic

Periodic w/burst EAF

EMIF

Periodic w/jitter C2 Periodic w/jitter Periodic

C1

EMIF

Periodic w/burst

A2

P5

P4 …

FIGURE 15.5

Periodic w/burst

Buffer P6

Example of event stream interfaces for the example in Figure 15.1.

one can construct a lossless EMIF simply by setting the jitter J = 0. There is another interface between communication C1 and task P3 that converts the bursty output of the bus to a sporadic model. Now, one can apply performance analysis methods to all of the components. As a result, one may determine the minimal buffer size and an appropriate scheduling policy for the DSP such that no overflow or underflow occurs. Several extensions have been worked out, for example, in order to deal with cyclic nonfunctional dependencies and to generalize the application model. Nevertheless, when comparing the requirements for a modular performance analysis, the approach has some inherent drawbacks. EAFs are caused by the limited class of supported event models and the available analysis methods. The analysis method enforces a change in the implementation. Furthermore, the approach is not modular in terms of the resources, as their service is not modeled explicitly. For example, if several scheduling policies need to be combined in one resource (hierarchical scheduling), then for each new combination an appropriate analysis method must be developed. In this way, the approach suffers from the same problem as the “holistic approach” described earlier. In addition, one is bound to the classical arrival patterns that are not sufficient in case of stream processing applications. Other event models need to be converted with loss in accuracy (EMIF) or the implementation must be changed (EAF).



15-11

15.3 The Performance Network Approach This section describes an approach to the performance analysis of embedded systems that is influenced by the worst-case analysis of communication networks. The network calculus as described in Reference 17 is based on Reference 18 and uses (max,+)-algebra to formulate the necessary operations. The network calculus is a promising analysis methodology as it is designed to be modular in various respects and as the representation of event (or packet) streams is not restricted to the few classes mentioned in Section 15.2.3. In References 12 and 19, the method has been extended to the RTC in order to deal with distributed embedded systems by combining computation and communication. Because of the detailed modeling of the capability of the shared computing and communication resources as well as the event streams, a high accuracy can be achieved, see Reference 20. The following sections serve to explain the basic approach. In addition, the main performance analysis method is not bound to the use of the RTC. Instead, any suitable abstraction of event streams and resource characterization is possible. Only the actual computations that are done within the components of the performance network need to be changed appropriately.

15.3.1 Performance Network In functional specification and verification, the given application is usually decomposed into components that are communicating via event interfaces. The properties of the whole system are investigated by combining the behavior of the components. This kind of representation is common in the design of complex embedded systems and is supported by many tools and standards, for example, UML. It would be highly desirable if the performance analysis follows the same line of thinking as it could be integrated into the usual design methodology easily. Considering the discussion given earlier, we can identify two major additions that are necessary: • Abstraction. Performance analysis is interested in making statements about the timing behavior not just for one specific input characterization but for a larger class of possible environments. Therefore, the concrete event streams that flow between the components must be represented in an abstract way. As an example, we have seen their characterization by “periodic” or “sporadic with jitter.” The same way, the nonfunctional properties of the application and the resource sharing mechanisms must be modeled appropriately. • Resource modeling. In comparison to functional validation, we need to model the resource capabilities and how they are changed by the workload of tasks or communication. Therefore, in contrary to the approaches described before, we will model the resources explicitly as “first class citizens” of the approach. As an example of a performance network, let us look again at the simple example from Figure 15.1 and Figure 15.5. In Figure 15.6, we see a corresponding performance network. Because of the simplicity of the example, not all the modeling possibilities can be shown. On the left-hand side, you see the abstract input which models the sources of the event streams that trigger the tasks of the applications: “Timer” represents the periodic instantiation of the task that reads out the buffer for playback, “Sensor” models the periodic bursty events from the sensor and “RT data” denotes the real-time data in equidistant packets via the Input interface. The associated abstract event streams are transformed by the performance components. On the top, you can see the resource modules that model the service of the shared resources, for example, the Input, CPU, Bus, and I/O component. The abstract resource streams (vertical direction) interact with the event streams on the performance modules and performance components. The resource interfaces at the bottom represent the remaining resource service that is available to other applications that may run on the execution platform. The performance components represent (1) the way how the timing properties of input event streams are transformed to timing properties of output event streams and (2) the transformation of the resources. Of course, these components can be hierarchically grouped into larger components. The way how the


15-12


Resource module

Input

CPU

Bus

Timer

DSP

I/O

P5

Sensor

P1

P2

P3 C1 C2

RT data

Abstract input

P4

Abstract event stream

P5

Resource interface

Performance component

Abstract resource stream

FIGURE 15.6 A simple performance network related to the example in Figure 15.1.

performance components are grouped and their transfer function reflect the resource sharing strategy. For example, P1 and P2 are connected serially in terms of the resource stream and therefore, they model a fixed-priority scheme with the high priority assigned to task P1. If the bus implements FCFS strategy or a TTP, the transfer function of C1/C2 needs to be determined such that the abstract representations of the event and resource stream are correctly transformed.

15.3.2 Variability Characterization The timing characterization of event and resource streams is based on Variability Characterization Curves (VCC) which substantially generalize the classical representations such as sporadic or periodic. As the event streams propagate through the distributed architecture, their timing properties get increasingly complex and the standard patterns can not model them with appropriate accuracy. The event streams are described using arrival curves α¯ u (), α¯ l () ∈ R ≥ 0, ∈ R ≥ 0 which provide upper and lower bounds on the number of events in any time interval of length . In particular, there are at most α¯ u () and at least α¯ l () events within the time interval [t , t + ) for all t ≥ 0. In a similar way, the resource streams are characterized using service functions β u (), β l () ∈ R ≥ 0, ∈ R ≥ 0 provide upper and lower bounds on the available service in any time interval of length . The unit of service depends on the kind of the shared resource, for example, instructions (computation) or bytes (communication). Note that as defined above, the VCC’s α¯ u () and α¯ l () are expressed in terms of events (this is marked by a bar on their symbol), while the VCC’s β u () and β l () are expressed in terms of workload/service. A method to transform event-based VCC’s to workload/resource-based VCC’s and vice versa is presented later in this chapter. All calculations and transformations presented here are valid both with only eventbased or with only workload/resource-based VCC’s, but in this chapter mainly the event-based formulation is used. Figure 15.7 shows arrival curves that specify the basic classical models shown in Figure 15.4. Note that in case of sporadic patterns, the lower arrival curves are 0. In a similar way, Figure 15.8 shows a service curve of a simple TDMA bus access with period T , bandwidth b, and slot interval τ .


Analysis of Distributed Embedded Systems Periodic

Periodic w/jitter

au, al

Periodic w/bursts

au, al

4 3 2 1

au, al 4 3 2 1

4 3 2 1

T

FIGURE 15.7

15-13

∆

2T

T 2T T–J T+J

∆

T 2T ∆ d 2T – J 2T + J

Basic arrival functions related to the patterns described in Figure 15.4.

bu, bl Bandwidth b

t

T

t

bt T t T–t

FIGURE 15.8

2T

∆

Example of a service curve that describes a simple TDMA protocol.

Note that arrival curves can be approximated using linear approximations, that is, a piecewise linear function. Moreover, there are of course finite representations of the arrival and service curves, for example, by decomposing them into an irregular initial part and a periodic part. Where do we get the arrival and service functions from, for example, those characterizing a processor (CPU in Figure 15.6), or an abstract input (Sensor in Figure 15.6): • Pattern. In some cases, the patterns of the event or resource stream are known, for example, bursty, periodic, sporadic, and TDMA. In this case, the functions can be constructed analytically, see for example, Figure 15.7 and Figure 15.8. • Trace. In case of unknown arrival or service patterns, one may use a set of traces and compute the envelope. This can be done easily by using a sliding window of size and determining the maximum and minimum number of events (or service) within the window. • Data sheets. In other cases, one can derive the curves by deriving the bounds from the characteristic of the generating device (in terms of the arrival curve) or the hardware component (in case of service curve). The performance components transform abstract event and resource streams. But so far, the arrival curve is defined in terms of events per time interval whereas the service curve is given in terms of service per time interval. One possibility to overcome this gap is to define the concept of workload curves that connect the number of successive events in an event stream and the maximal or minimal workload associated. They capture the variability in execution demands. The upper and lower workload curve γ u (e), γ l (e) ∈ R ≥ 0 denote the maximal and minimal workload on a specific resource for any sequence of e consecutive events. If we have these curves available, then we can easily determine upper and lower bounds on the workload that an event stream imposes in any time interval of length on a resource as α u () = γ u (α¯ u ()) and α l () = γ l (α¯ l ()), respectively. And −1 analogously, β¯ u () = γ l (β u ()) and β¯ l () = γ u −1 (β l ()). As in the case of the arrival and service curves, there appears the question, where the workload curves can come from. A selection of possibilities is given below: • WCET and BCET. The simplest possibility is to (1) assume that each event of an event stream triggers the same task and (2) that this task has a given WCET and BCET determined by other


15-14

Embedded Systems Handbook Task g u, g l

Subtask

16 12 8 4

Workload g u, g l 10 1

2 3

1

2

3

WCET = 4 BCET = 3

20 15 10 5

2

e

1

5

e

Workload

FIGURE 15.9 Two examples of modeling the relation between incoming events and the associated workload on a resource. The left-hand side shows a simple modeling in terms of the WCET and BCET of the task triggered by an event. The right-hand side models the workload generated by a task through a finite state machine. The workload curves can be constructed by considering the maximum or minimum weight paths with e transitions.

methods. An example of an associated workload curve is given in Figure 15.9. The same holds for communication events also. • Application modeling. The above method models the fact that not all events lead to the same execution load (or number of bits) by simply using upper and lower bounds on the execution time. The accuracy of this approach can be substantially improved, if characteristics of the application are taken into account, for example (1) distinguishing between different event types each one triggering a different task and (2) modeling that it is not possible that many consecutive events all have the WCET (or BCET). This way, one can model correlations in event streams, see Reference 21. Figure 15.9 represents on the right-hand side a simple example where a task is refined into a set of subtasks. At each incoming event, a subtask generates the associated workload and the program branches to one of its successors. • Trace. As in the case of arrival curves, we can use a given trace and report the workloads associated to each event, for example, by simulation. Based on this information, we can easily compute the upper and lower envelope. A more fine-grained modeling of an application is possible also, for example, by taking into account different event types in event streams, see Reference 22. By the same approach, it is also possible to model more complex task models, for example, a task with different production and consumption rates of events or tasks with several event inputs, see Reference 23. Moreover, the same modeling holds for the load on communication links of the execution platform also. In order to construct a scheduling network according to Figure 15.6, we still need to take into account the resource sharing strategy.

15.3.3 Resource Sharing and Analysis In Figure 15.1, we see, for example, that the performance modules associated to tasks P1 and P2 are connected serially. This way, we can model a preemptive fixed-priority resource sharing strategy as P2 only gets the CPU resource that is left after the workload of P1 has been served. Other resource sharing strategies can be modeled as well, see for example, Figure 15.10 where in addition a proportional share policy is modeled on the left. In this case, a fixed portion of the available resource (computation or communication) is associated to each task. Other sharing strategies are possible also, such as FCFS [17]. In the same Figure 15.10, we also see how the workload characterization as described in the last section is used to transform the incoming arrival curve into a representation that talks about the workload for a resource. After the transformation of the incoming stream by a block called RTC, the inverse workload transformation may be done again in order to characterize the stream by means of events per time interval. This way, the performance modules can be freely combined as their input and output representations match.



15-15

[ bl,bu ]

Share g–1

[al,au ] RTC

[al ⬘,au⬘]

[ bl ⬘,bu⬘]

RTC g Sum Proportional share component

FIGURE 15.10

Fixed priority component

Performance module

Two examples of resource sharing strategies and their model in the RTC.

Input Buffers streams

Resource sharing

Service …

FIGURE 15.11

Functional model of resource sharing on computation and communication resources.

We still need to describe how a single workload stream and resource stream interact on a resource. The underlying model and analysis very much depends on the underlying execution platform. As a common example, we suppose that the events (or data packets) corresponding to a single stream are stored in a queue before being processed, see Figure 15.11. The same model is used for computation as well as for communication resources. It matches well the common structure of operating systems where ready tasks are lined up until the processor is assigned to one of them. Events belonging to one stream are processed in a FCFS manner whereas the order between different streams depends on the particular resource sharing strategy. Following this model, one can derive the equations that describe the transformation of arrival and service curves by an RTC module according to Figure 15.10, see for example, Reference 13:

α¯ u = [(α¯ u ⊗ β¯ u ) β¯ l ] ∧ β¯ u α¯ l = [(α¯ l β¯ u ) ⊗ β¯ l ] ∧ β¯ l β¯ u = (β¯ u − α¯ l ) 0 β¯ l = (β¯ l − α¯ u ) ⊗ 0

Following Reference 24, the operators used are called min-plus/max-plus convolutions ( f ⊗ g )(t ) = inf

0≤u≤t

( f ⊗ g )(t ) = sup 0≤u≤t


f (t − u) + g (u) f (t − u) + g (u)

15-16

Embedded Systems Handbook bl1

bl2

blN

au bl

–u a

Delay gu1

gu2

Backlog

guN

∆

–l b

FIGURE 15.12

Representation of the delay and accumulated buffer space computation in a performance network.

and min-plus/max-plus deconvolutions ( f g )(t ) = sup f (t + u ) − g (u ) u ≥0

( f g )(t ) = inf f (t + u ) − g (u ) u ≥0

further, ˆ denotes the minimum-operator (f gˆ )(t ) = min{f (t ), g (t )}. Using these equations, the workload curves, and the characterization of input event and resource streams, we now can determine the characterizations of all event and resource streams in a performance network such as in Figure 15.6. From the resulting arrival curves (leaving the network on the right-hand side) and service curves (at the bottom), we can compute all the relevant information such as the average resource loads, the end-to-end delays and the necessary buffer spaces on the event and packet queues, see Figure 15.11. If the performance network contains cycles, then fixed-point iterations are necessary. As an example let us suppose that the upper input arrival curve of an event stream is α¯ u (). Moreover, the stream is processed by a sequence of N modules according to the right-hand side of Figure 15.10 with incoming service curves βil (), 1 ≤ i ≤ N and workload curves γiu (e ). Then we can determine the maximal end-to-end delay and accumulated buffer space for this stream according to (see Reference 12) γi−1 (W ) = sup{e ≥ 0: γiu (e ) ≤ W } β¯il () = γi−1 (βil ())

∀1 ≤ i ≤ N

∀1 ≤ i ≤ N

β¯ l () = β¯1l () ⊗ β¯2l () · · · ⊗ β¯Nl () delay ≤ sup inf {τ ≥ 0: α¯ u () ≤ β¯ l ( + τ )} ≥0

backlog ≤ sup {α¯ u () − β¯ l ()} ≥0

The curve γ −1 (W ) denotes the pseudo inverse of a workload curve, that is, it yields the minimum number of events that can be processed if the service W is available. Therefore, β¯il () is the minimal available service in terms of events per time interval. It has been shown in Reference 17, that the delay and backlog are determined by the accumulated service β¯ l () that can be obtained using the convolution of all individual services. The delay and backlog can now be interpreted as the maximal horizontal and vertical distance between the arrival and accumulated service curves, respectively, see Figure 15.12. All the above computations can be implemented efficiently, if appropriate representations for the variability characterization curves are used, for example, piecewise linear, discrete points, or periodic.

15.3.4 Concluding Remarks Because of the modularity of the performance network, one can easily analyze a large number of different mapping and resource sharing strategies for design space exploration. Applications can be extended by



15-17

adding tasks and performance modules. Moreover, different subsystems can use different kinds of resource sharing without sacrificing the performance analysis. Of particular interest is the possibility to build a performance component for a combined hardware– software system that describes the performance properties of a whole subsystem. This way, a subcontractor can deliver a hardware/software/operating system module that already contains part of the application. The system house can now integrate the performance components of the subsystems in order to validate the performance of the whole system. To this end, he does not need to know the details of the subsystem implementations. In addition, a system house can also add an application to the subsystems. Using the resource interfaces that characterize the remaining available service from the subsystems, its timing correctness can easily be verified. On one hand, the performance network approach is correct in the sense that it yields upper and lower bounds on quantities such as end-to-end delay and buffer space. On the other hand, it is a worst-case approach that covers all possible corner cases independent of their probability. Even if the deviations from simulation results can be small, see for example, Reference 25, in many cases one is interested in average case behavior of distributed embedded systems also. Therefore, performance analysis methods as those described in this chapter can be considered to be complementary to the existing simulation based validation methods. Furthermore, any automated or semiautomated exploration of different design alternatives (design space exploration) could be separated into multiple stages, each having a different level of abstraction. It would then be appropriate to use an analytical performance evaluation framework, such as those described in this chapter, during the initial stages and resort to simulation only when a relatively small set of potential architectures is identified.

References [1] K. Richter, D. Ziegenbein, M. Jersak, and R. Ernst. Model composition for scheduling analysis in platform design. In Proceedings of the 39th Design Automation Conference (DAC). ACM Press, New Orleans, LA, June 2002. [2] Lothar Thiele, Simon Künzli, and Eckart Zitzler. A modular design space exploration framework for embedded systems. IEE Proceedings Computers and Digital Techniques, Special Issue on Embedded Microelectronic Systems, 2004. [3] SystemC homepage. http://www.systemc.org. [4] T. Grötker, S. Liao, G. Martin, and S. Swan. System Design with SystemC. Kluwer Academic Publishers, Boston, May 2002. [5] Doug Burger and Todd M. Austin. The simplescalar tool set, version 2.0. SIGARCH Computer Architecture News, 25, 1997, 13–25. [6] K. Lahiri, A. Raghunathan, and S. Dey. System-level performance analysis for designing on-chip communication architectures. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 20, 2001, 768–783. [7] G.C. Buttazzo. Hard Real-Time Computing Systems: Predictable Scheduling Algorithms and Applications. Kluwer Academic Publishers, Boston, 1997. [8] K. Tindell and J. Clark. Holistic schedulability analysis for distributed hard real-time systems. Microprocessing and Microprogramming — Euromicro Journal, Special Issue on Parallel Embedded Real-Time Systems, 40, 1994, 117–134. [9] T. Yen and W. Wolf. Performance estimation for real-time distributed embedded systems. IEEE Transactions on Parallel and Distributed Systems, 9, 1998, 1125–1136. [10] K. Tindell, A. Burns, and A.J. Wellings. Calculating controller area networks (CAN) message response times. Control Engineering Practice, 3, 1995, 1163–1169. [11] T. Pop, P. Eles, and Z. Peng. Holistic scheduling and analysis of mixed time/event triggered distributed embedded systems. In Proceedings of the International Symposium on Hardware–Software Codesign (CODES). ACM Press, May 2002, pp. 187–192.


15-18


[12] L. Thiele, S. Chakraborty, M. Gries, A. Maxiaguine, and J. Greutert. Embedded software in network processors — models and algorithms. In Proceedings of the 1st Workshop on Embedded Software (EMSOFT). (Lake Tahoe, CA, USA), Vol. 2211 of Lecture Notes in Computer Science. Springer-Verlag, Heidelberg, 2001, pp. 416–434. [13] L. Thiele, S. Chakraborty, M. Gries, and S. Künzli. A framework for evaluating design tradeoffs in packet processing architectures. In Proceedings of the 39th Design Automation Conference (DAC). ACM Press, New Orleans, LA, June 2002, pp. 880–885. [14] Kai Richter, Marek Jersak, and Rolf Ernst. A formal approach to mpsoc performance verification. IEEE Computer, 36, 2003, 60–67. [15] K. Richter and R. Ernst. Model interfaces for heterogeneous system analysis. In Proceedings of the 6th Design, Automation and Test in Europe (DATE). IEEE, Munich, Germany, March 2002, pp. 506–513. [16] J.A. Stankovic, M. Spuri, K. Ramamritham, and G.C. Buttazzo. Deadline scheduling for real-time systems: EDF and related algorithms. In Kluwer International Series in Engineering and Computer Science, Vol. 460. Kluwer Academic Publishers, Dordrecht, 1998. [17] J.-Y. Le Boudec and P. Thiran. Network Calculus — A Theory of Deterministic Queuing Systems for the Internet, Vol. 2050 of Lecture Notes in Computer Science. Springer-Verlag, Heidelberg, 2001. [18] R.L. Cruz. A calculus for network delay, part I: network elements in isolation. IEEE Transactions on Information Theory, 37, 1991, 114–131. [19] L. Thiele, S. Chakraborty, and M. Naedele. Real-time calculus for scheduling hard real-time systems. In Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS), Vol. 4. IEEE, 2000, pp. 101–104. [20] S. Chakraborty, S. Künzli, L. Thiele, A. Herkersdorf, and P. Sagmeister. Performance evaluation of network processor architectures: combining simulation with analytical estimation. Computer Networks, 41, 2003, 641–665. [21] Alexander Maxiaguine, Simon Künzli, and Lothar Thiele. Workload characterization model for tasks with variable execution demand. In Proceedings of Design Automation and Test in Europe (DATE). IEEE Press, Paris, France, February 2004, pp. 1040–1045. [22] Ernesto Wandeler, Alexander Maxiaguine, and Lothar Thiele. Quantitative characterization of event streams in analysis of hard real-time applications. In Proceedings of the 10th IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS). IEEE Computer Society, May 2004, pp. 450–459. [23] Ernesto Wandeler and Lothar Thiele. Abstracting functionality for modular performance analysis of hard real-time systems. In Asia South Pacific Design Automation Conference (ASP-DAC). IEEE, January 2005. [24] F. Baccelli, G. Cohen, G. Olsder, and J.-P. Quadrat. Synchronization and Linearity. John Wiley & Sons, New York, 1992. [25] S. Chakraborty, S. Künzli, and L. Thiele. A general framework for analysing system properties in platform-based embedded system designs. In Proceedings of the 6th Design, Automation and Test in Europe (DATE). Munich, Germany, March 2003.


Power Aware Computing 16 Power Aware Embedded Computing Margarida F. Jacome and Anand Ramachandran


16 Power Aware Embedded Computing 16.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16-1 16.2 Energy and Power Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16-3 Instruction- and Function-Level Models • Micro-Architectural Models • Memory and Bus Models • Battery Models

16.3 System/Application Level Optimizations . . . . . . . . . . . . . . 16-7 16.4 Energy Efficient Processing Subsystems . . . . . . . . . . . . . . . . 16-8 Voltage and Frequency Scaling • Dynamic Resource Scaling • Processor Core Selection

16.5 Energy Efficient Memory Subsystems . . . . . . . . . . . . . . . . . . 16-11

Margarida F. Jacome and Anand Ramachandran University of Texas

Cache Hierarchy Tuning • Novel Horizontal and Vertical Cache Partitioning Schemes • Dynamic Scaling of Memory Elements • Software-Controlled Memories, Scratch-Pad Memories • Improving Access Patterns to Off-Chip Memory • Special Purpose Memory Subsystems for Media Streaming • Code Compression • Interconnect Optimizations

16.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16-17 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16-17

16.1 Introduction Embedded systems are pervasive in modern life. State-of-the-art embedded technology drives the ongoing revolution in consumer and communication electronics, and is on the basis of substantial innovation in many other domains, including medical instrumentation, process control, etc. [1]. The impact of embedded systems in well established “traditional” industrial sectors, for example, automotive industry, is also increasing at a fast pace [1,2]. Unfortunately, as Complementary Metal-Oxide Semiconductor (CMOS) technology rapidly scales, enabling the fabrication of ever faster and denser Integrated Circuits (ICs), the challenges that must be overcome to deliver each new generation of electronic products multiply. In the last few years, power dissipation has emerged as a major concern. In fact, projections on power density increases owing to CMOS scaling clearly indicate that this is one of the fundamental problems that will ultimately preclude further scaling [3,4]. Although the power challenge is indeed considerable, much can be done to mitigate the deleterious effects of power dissipation, thus enabling performance and device density to be taken to truly unprecedented levels by the semiconductor industry throughout the next 10 to 15 years.

16-1


16-2


Power density has a direct impact on packaging and cooling costs, and can also affect system reliability, owing to electromigration and hot-electron degradation effects. Thus, the ability to decrease power density, while offering similar performance and functionality, critically enhances the competitiveness of a product. Moreover, for battery operated portable systems, maximizing battery lifetime translates into maximizing duration of service, an objective of paramount importance for this class of products. Power is thus a primary figure of merit in contemporaneous embedded system design. Digital CMOS circuits have two main types of power dissipation: dynamic and static. Dynamic power is dissipated when the circuit performs the function(s) it was designed for, for example, logic and arithmetic operations (computation), data retrieval, storage, and transport, etc. Ultimately, all of this activity translates into switching of the logic states held on circuit nodes. Dynamic power dissipation is thus proportional to C · VDD 2 · f · r, where C denotes the total circuit capacitance, VDD and f denote the circuit supply voltage and clock frequency, respectively, and r denotes the fraction of transistors expected to switch at each clock cycle [5,6]. In other words, dynamic power dissipation is impacted to first order by circuit size/complexity, speed/rate, and switching activity. In contrast, static power dissipation is associated with preserving the logic state of circuit nodes between such switching activity, and is caused by subthreshold leakage mechanisms. Unfortunately, as device sizes shrink, the severity of leakage power is increasing at an alarming pace [3]. Clearly, the power problem must be addressed at all levels of the design hierarchy, from system to circuit, as well as through innovations on CMOS device technology [5,7,8]. In this survey we provide a snapshot on the state-of-the-art on system and architecture level design techniques and methodologies aimed at reducing, both, static and dynamic power dissipation. Since such techniques focus on the highest level of the design hierarchy, their potential benefits are immense. In particular, at this high level of abstraction, the specifics of each particular class of embedded applications can be considered as a whole and, as it will be shown in our survey, such an ability is critical to designing power/energy efficient systems, that is, systems that spend energy strictly when and where it is needed. Broadly speaking, this requires a proper design and allocation of system resources, geared toward addressing critical performance bottlenecks in a power efficient way. Substantial power/energy savings can also be achieved through the implementation of adequate dynamic power management policies, for example, tracking instantaneous workloads (or levels of resource utilization) and “shutting-down” idling/unused resources, so as to reduce leakage power, or “slowing down” under-utilized resources, so as to decrease dynamic power dissipation. These are clearly system level decisions/policies, in that their implementation typically impacts several architectural subsystems. Moreover, different decisions/policies may interfere or conflict with each other and, thus, assessing their overall effectiveness requires a system level (i.e., global) view of the problem. A typical embedded system architecture consists of a processing subsystem (including one or more processor cores, hardware accelerators, etc.), a memory subsystem, peripherals, and global and local interconnect structures (buses, bridges, crossbars, etc.). Figure 16.1 shows an abstract view of two such architecture instances. Broadly speaking, system level design consists of defining the specific embedded system architecture to be used for a particular product, as well as defining how the target embedded application (implementing the required functionality/services) is to be mapped onto that architecture. Embedded systems come in many varieties and with many distinct design optimization goals and requirements. Even when two products provide the same basic functionality, say, video encoding/decoding, they may have fundamentally different characteristics, namely, different performance and quality-ofservice requirements, one may be battery operated and the other not, etc. The implications of such product differentiation are of paramount importance when power/energy are considered. Clearly, the higher the system’s required performance/speed (defined by metrics such as throughput, latency, bandwidth, response time, etc.), the higher will be its power dissipation. The key objective is thus to minimize the power dissipated to deliver the required level of performance [5,9]. The trade-offs, techniques, and optimizations required to develop such power aware or power efficient designs vary widely across the vast spectrum of embedded systems available in today’s market, encompassing many complex decisions driven by system requirements as well as intrinsic characteristics of the target applications [10].


Power Aware Embedded Computing

Memory

Modem DSP core

I/O Ports

VLIW core Embedded processor core

A/D and D/A

16-3

VLIW core Primary embedded processor core A/D and D/A

Master control ASIP core Sound codec DSP core Host interface

ASIP memory controller RAM Flash ROM

Hardware accelerator FFT, DCT, ...

I/O ports

FIGURE 16.1 Illustrative examples of a simple and a more complex embedded system architecture.

Consider, for example, the design task of deciding on the number and type of processing elements to be instantiated on an embedded architecture, that is, defining its processing subsystem. Power, performance, cost, and time-to-market considerations dictate if one should rely entirely on readily available processors (i.e., off-the-shelf microcontrollers, Digital Signal Processors (DSPs) and/or general-purpose RISC cores), or should also consider custom execution engines, namely, Application Specific Instruction set Processors (ASIPs), possibly reconfigurable, and/or hardware accelerators (see Figure 16.1). Hardware/software partitioning is a critical step in this process [1]. It consists of deciding which of an application’s segments/functions should be implemented in software (i.e., run on a processor core) and which (if any) should be implemented in hardware (i.e., execute on high performance, highly power efficient custom hardware accelerators). Naturally, hardware/software partitioning decisions should reflect the power/performance criticality of each such segment/function. Clearly, this is a complex multiobjective optimization problem defined on a huge design space that encompasses, both, hardware and software related decisions. To compound the problem, the performance and energy efficiency of an architecture’s processing subsystem cannot be evaluated in isolation, since its effectiveness can be substantially impacted by the memory subsystem (i.e., the adopted memory hierarchy/organization) and the interconnect structures supporting communication/data transfers between processing components to/from the environment in which the system is embedded. Thus, decisions with respect to these other subsystems and components must be concurrently made and jointly assessed. Targeting up front a specific embedded system platform, that is, an architectural “subspace” relevant to a particular class of products/applications, can considerably reduce the design effort [11,12]. Still, the design space remains (typically) so complex that a substantial design space exploration may be needed in order to identify power/energy efficient solutions for the specified performance levels. Since time to market is critical, methodologies to efficiently drive such an exploration, as well as fast simulators and low complexity (yet good fidelity) performance, power, and energy estimation models, are critical to aggressively exploiting effective power/energy driven optimizations, within a reasonable time frame. Our survey starts by providing an overview on state-of-the-art models and tools used to evaluate the goodness of individual system design points. We then discuss power management techniques and optimizations aimed at aggressively improving the power/energy efficiency of the various subsystems of an embedded system.

16.2 Energy and Power Modeling This section discusses high-level modeling and power estimation techniques aimed at assisting system and architecture-level design. It would be unrealistic to expect a high degree of accuracy on power estimates


16-4


produced during such an early design phase, since accurate power modeling requires detailed physical level information that may not yet be available. Moreover, highly accurate estimation tools (working with detailed circuit/layout-level information) would be too time consuming to allow for any reasonable degree of design space exploration [1,5,13]. Thus, practically speaking, power estimation during early design space exploration should aim at ensuring a high degree of fidelity rather than necessarily accuracy. Specifically, the primary objective during this critical exploration phase is to assess the relative power efficiency of different candidate system architectures (populated with different hardware and/or software components), the relative effectiveness of alternative software implementations (of the same functionality), the relative effectiveness of different power management techniques, etc. Estimates that correctly expose such relative power trends across the design space region being explored provide the designer with the necessary information to guide the exploration process.

16.2.1 Instruction- and Function-Level Models Instruction-level power models are used to assess the relative power/energy efficiency of different processors executing a given target embedded application, possibly with alternative memory subsystem configurations. Such models are thus instrumental during the definition of the main subsystems of an embedded architecture, as well as during hardware/software partitioning. Moreover, instruction-level power models can also be used to evaluate the relative effectiveness of different software implementations of the same embedded application, in the context of a specific embedded architecture/platform. In their most basic form, instruction-level power models simply assign a power cost to each assembly instruction (or class of assembly instructions) of a programmable processor. The overall energy consumed by a program running on the target processor is estimated by summing up the instruction costs for a dynamic execution trace which is representative of the application [14–17]. Instruction-level power models were first developed by experimentally measuring the current drawn by a processor while executing different instruction sequences [14]. During this first modeling effort, it was observed that the power cost of an instruction may actually depend on previous instructions. Accordingly, the instruction-level power models developed in Reference 14 include several inter-instruction effects. Later studies observed that, for certain processors, the power dissipation incurred by the hardware responsible for fetching, decoding, analyzing, and issuing instructions, and then routing and reordering results, was so high that a simpler model that only differentiates between instructions that access on-chip resources and those that go off-chip would suffice for such processors [16]. Unfortunately, power estimation based on instruction-level models can still be prohibitively time consuming during early design space exploration, since it requires collecting and analyzing large instruction traces and, for many processors, considering a quadratically large number of inter-instruction effects. In order to “accelerate” estimation, processor specific coarser function-level power models were later developed [18]. Such approaches are faster because they rely on the use of macromodels characterizing the average energy consumption of a library of functions/subroutines executing on a target processor [18]. The key challenge in this case is to devise macromodels that can properly quantify the power consumed by each subroutine of interest, as a function of “easily observable parameters.” Thus, for example, a quadratic power model of the form an 2 + bn + c could be first tentatively selected for a insertion sort routine, where n denotes the number of elements to be sorted. Actual power dissipation then needs to be measured for a large number of experiments, run with different values of n. Finally, the values of the macromodel’s coefficients a, b, and c are derived, using regression analysis, and the overall accuracy of the resulting macromodel is assessed [18]. The “high-level” instruction- and function-level power models discussed so far allow designers to “quickly” assess a large number candidate system architectures and alternative software implementations, so as to narrow the design space to a few promising alternatives. Once this initial broad exploration is concluded, power models for each of the architecture’s main subsystems and components are needed, in order to support the detailed architectural design phase that follows.



16-5

16.2.2 Micro-Architectural Models Micro-architectural power models are critical to evaluating the the impact of different processing subsystem choices on power consumption, as well as the effectiveness of different (micro-architecture level) power management techniques implemented on the various subsystems. In the late 1980s and early 1990s, cycle accurate (or more precisely, cycle-by-cycle) simulators, such as Simplescalar [19], were developed to study the effect of architectural choices on the performance of general-purpose processors. Such simulators are in general very flexible, allowing designers/architects to explore the complex design space of contemporaneous processors. Namely, they include built-in parameters that can be used to specify the number and mix of functional units to be instantiated in the processor’s datapath, the issue width of the machine, the size and associativity of the L1 and L2 caches, etc. By varying such parameters, designers can study the performance of different machine configurations for representative applications/benchmarks. As power consumption became more important, simulators to estimate dynamic power dissipation (e.g., “Wattch [20],”“Cai–Lim model [21],” and “Simplepower [22]”) were later incorporated on these existing frameworks. Such an integration was performed seamlessly, by directly augmenting the “cycle-oriented” performance models for the various micro-architectural components with corresponding power models. Naturally, the overall accuracy of these simulation-based power estimation techniques is determined by the level of detail of the power models used for the micro-architecture’s constituent components. For out-of-order RISC cores, for example, the power consumed in finding independent instructions to issue is a function of the number of instructions currently in the instruction queue and of the actual dependencies between such instructions. Unfortunately, the use of detailed power models accurately capturing the impact of input and state data on the power dissipated by each component would prohibitively increase the already “long” micro-architectural simulation runtimes. Thus, most state-of-the-art simulators use very simple/straightforward empirical power models for datapath and control logic, and slightly more sophisticated models for regular structures such as caches [20]. In their simplest form, such models capture “typical” or “average” power dissipation for each individual micro-architectural component. Specifically, each time a given component is accessed/used during a simulation run, it is assumed that it dissipates its corresponding “average” power. Slightly more sophisticated power macromodels for datapath components have been proposed in References 23–28, and shown to improve accuracy with a relatively small impact on simulation time. So far, we have discussed power modeling of micro-architectural components, yet a substantial percentage of the overall power budget of a processor is actually spent on the global clock (up to 40 to 45% [5]). Thus, global clock power models must also be incorporated in these frameworks. The power dissipated on global clock distribution is impacted to first order by the number of pipeline registers (and thus by a processor’s pipeline depth) and by global and local wiring capacitances (and thus by a processor’s core area) [5]. Accordingly, different processor cores and/or different configurations of the same core may dissipate substantially different clock distribution power. Power estimates incorporating such numbers are thus critical during processor core selection and configuration. The component-level and clock distribution models discussed so far are used to estimate the dynamic power dissipation of a target micro-architecture. Yet, as mentioned earlier, static/leakage power dissipation is becoming a major concern, and thus, micro-architectural techniques aimed at reducing leakage power are increasingly relevant. Models to support early estimation of static power dissipation emerged along the same lines as those used for dynamic power dissipation. The “Butts–Sohi” model, which is one of the most influential static power models developed so far, quantifies static energy in CMOS circuits/components using a lumped parameter model that maps technology and design effects into corresponding characterizing parameters [29]. Specifically, static power dissipation is modeled as VDD · N · kdesign · Ileak , where VDD is the supply voltage and N denotes the number of transistors in the circuit. kdesign is the design dependent parameter — it captures “circuit style” related characteristics of a component, including average transistor aspect ratio, average number of transistors switched off during “normal/typical” component operation, etc. Finally, Ileak is the technology dependent parameter. It accounts for the impact


16-6


of threshold voltage, temperature, and other key parameters, on leakage current, for a specific fabrication process. From a system designer’s perspective, static power can be reduced by lowering supply voltage (VDD ), and/or by power supply gating or VDD gating (as opposed to clock gating) unused/idling devices (N ). Integrating models for estimating static power dissipation on cycle-by-cycle simulators thus enables embedded system designers to analyze critical static power versus performance trade-offs enabled by “power aware” features available in contemporaneous processors, such as dynamic voltage scaling and selective datapath (re)configuration. An improved version of the “Butts–Sohi” model, providing the ability to dynamically recalculate leakage currents (as temperature and voltage change owing to operating conditions and/or dynamic voltage scaling), has been integrated into the Simplescalar simulation framework, called “HotLeakage,” enabling such high-level trade-offs to be explored by embedded system designers [30].

16.2.3 Memory and Bus Models Storage elements, such as caches, register files, queues, buffers, and tables constitute a substantial part of the power budget of contemporaneous embedded systems [31]. Fortunately, the high regularity of some such memory structures (e.g., caches) permits the use of simple, yet reasonably accurate power estimation techniques, relying on automatically synthesized “structural designs” for such components. The Cache Access and Cycle TIme (CACTI) framework implements this synthesis-driven power estimation paradigm. Specifically, given a specific cache hierarchy configuration (defined by parameters such as cache size, associativity, and line size), as well as information on the minimum feature size of the target technology [32], it internally generates a coarse structural design for such cache configuration. It then derives delay and power estimates for that particular design, using parameterized built-in C models for the various constituent elements, namely, SRAM cells, row and column decoders, word and bit lines, precharge circuitry, etc. [33,34]. CACTI’s synthesis algorithms used to generate the structural design of the memory hierarchy (which include defining the aspect ratio of memory blocks, the number of instantiated sub-banks, etc.) have been shown to consistently deliver reasonably “good” designs across a large range of cache hierarchy parameters [34]. CACTI can thus be used to “quickly” generate power estimates (starting from highlevel architectural parameters) exhibiting a reasonably good fidelity over a large region of the design space. During design space exploration, the designer may thus consider a number of alternative L1 and L2 cache configurations and use CACTI to obtain “access-based” power dissipation estimates for each such configuration with good fidelity. Naturally, the memory access traces used by CACTI should be generated by a micro-architecture simulator (e.g., Simplescalar) working with a memory simulator (e.g., Dinero [35]), so that they reflect the bandwidth requirements of the embedded application of interest. Buses are also a significant contributor to dynamic power dissipation [5,36]. The dynamic power 2 · fa, where C denotes the total capacitance of the bus dissipation on a bus is proportional to C · VDD (including metal wires and buffers), VDD denotes the supply voltage, and fa denotes the average switching frequency of the bus [36]. In this high-level model, the average switching frequency of the bus (fa) is defined by the product of two terms, namely, the average number of bus transitions per word, and the bus frequency (given in bus words per second). The average number of bus transitions per word can be estimated by simulating sample programs and collecting the corresponding transition traces. Although this model is coarse, it may suffice during the early design phases under consideration.

16.2.4 Battery Models The capacity of a battery is a nonlinear function of the current drawn from it, that is, if one increases the average current drawn from a battery by a factor of two, the “remaining” deliverable battery capacity, and thus its lifetime, decreases by more than half. Peukert’s formula models such nonlinear behavior by defining the capacity of a battery as k/I α , where k is a constant depending on the battery design, I is the



16-7

discharge current and α quantifies the “nonideal” behavior of the battery [37].1 More effective systemlevel trade-offs between quality/performance and duration of service can be implemented by taking such nonlinearity (also called rate-capacity effect) into consideration. However, in order to properly evaluate the effectiveness of such techniques/trade-offs during system level design, adequate battery models, and metrics are needed. Energy-delay product is a well-known metric used to assess the energy efficiency of a system. It basically quantifies a system’s performance loss per unit gain in energy consumption. To emphasize the importance of accurately exploring key trade-offs between battery lifetime and system performance, a new metric, viz., battery-discharge delay product, has recently been proposed [38]. Task scheduling strategies and dynamic voltage scaling policies adopted for battery operated embedded systems should thus aim at minimizing battery-discharge delay product, rather than energy-delay product, since it captures the important ratecapacity effect alluded to above, whereas energy-delay product is insensitive to it. Yet, a metric such as battery-discharge delay product requires the use of precise/detailed battery models. One such detailed battery model has been recently proposed, which can predict the remaining capacity of a rechargeable lithium-ion battery in terms of several critical factors, viz., discharge-rate (current), battery output voltage, battery temperature, and cycle age (i.e., the number of times a battery has been charged and discharged) [39].

16.3 System/Application Level Optimizations When designing an embedded system, in particular, a battery powered one, it may be useful to explore different task implementations exhibiting different power/energy versus quality-of-service characteristics, so as to provide the system with the desired functionality while meeting cost, battery lifetime, and other critical requirements. Namely, one may be interested in trading-off accuracy for energy savings on a handheld GPS system, or image quality for energy savings on an image decoder, etc. [40–42]. Such application level “tuning/optimizations” may be performed statically (i.e., one may use a single implementation for each task) or may be performed at runtime, under the control of a system level power manager. For example, if the battery level drops below a certain threshold, the power manager may drop some services and/or swap some tasks to less power hungry (“lower quality”) software versions, so that the system can remain operational for a specified window of additional time. An interesting example of such dynamic power management was implemented for the Mars Pathfinder, an unmanned space robot that draws power from, both, a nonrechargable battery and solar cells [43]. In this case, the power manager tracks the power available from the solar cells, ensuring that most of the robot’s active work is done during daylight, since the solar energy cannot be stored for later use. In addition to dropping and/or swapping tasks, a system’s dynamic power manager may also shutdown or slowdown subsystems/modules that are idling or under-utilized. The Advanced Configuration and Power Interface (ACPI) is a widely adopted standard that specifies an interface between an architecture’s power managed subsystems/modules (e.g., display drivers, modems, hard disk drivers, processors, etc.) and the system’s dynamic power manager. It is assumed that the power managed subsystems have at least two power states (ACTIVE and STANDBY) and that one can dynamically switch between them. Using ACPI, one can implement virtually any power management policy, including fixed time-out, predictive shutdown, predictive wake-up, etc. [44]. Since the transition from one state to another may take a substantial amount of time and consume a nonnegligible amount of energy, the selection of a proper power management policy for the various subsystems is critical to achieve good energy savings with a negligible impact on performance. The so-called time-out policy, widely used in modern computers, simply switches a subsystem to STANDBY mode when the elapsed time after the last utilization reaches a given fixed threshold. Predictive shutdown uses the previous history of the subsystem to predict the next an “ideal” battery, that is, a battery whose capacity is independent of the way current is drawn, α = 0, while for a real battery α may be as high as 0.7 [37]. 1 For


16-8


expected idle time and, based on that, decides if it should or should not be shut down. More sophisticated policies, using stochastic methods [40–42,45], can also be implemented, yet they are more complex, and thus the power associated with running the associated dynamic power management algorithms may render them inadequate or inefficient for certain classes of systems. The system-level techniques discussed here act on each subsystem as a whole. Although the effectiveness of such techniques has been demonstrated across a wide variety of systems, finer-grained self-monitoring techniques, implemented at the subsystem level, can substantially add to these savings. Such techniques are discussed in the following sections.

16.4 Energy Efficient Processing Subsystems As mentioned earlier, hardware/software codesign methodologies partition the functionality of an embedded system into hardware and software components. Software components, executing on programmable microcontrollers or processors (either general purpose or application specific), are the preferred solution, since their use can substantially reduce design and manufacturing costs, as well as shorten time-to-market. Custom hardware components are typically used only when strictly necessary, namely, when an embedded system’s power budget and/or performance constraints preclude the use of software. Accordingly, a large number of power aware processor families is available in today’s market, providing a vast gamut of alternatives suiting the requirements/needs of most embedded system [46,47]. In the sequel we discuss power aware features available in contemporaneous processors, as well as several power related issues relevant to processor core selection.

16.4.1 Voltage and Frequency Scaling Dynamic power consumption in a processor (be it general purpose or application specific) can be decreased by reducing two of its key contributors, viz., supply voltage and clock frequency. In fact, since the power dissipated in a CMOS circuit is proportional to the square of the supply voltage, the most effective way to reduce power is to scale down the supply voltage. Note that, however, the propagation delay across a CMOS transistor is proportional to VDD /(VDD − VT )2 , where VDD is the supply voltage and VT is the threshold voltage. So, unfortunately, as the supply voltage decreases, the propagation delay increases as well, and so clock frequency (i.e., speed) may need to be decreased [48]. Accordingly, many contemporaneous processor families, such as Intel’s XScale [49], IBM’s PowerPC 405LP [50], and Transmeta’s Crusoe [51], offer dynamic voltage and frequency scaling features. For example, the Intel 80200 processor, which belongs to the XScale family of processors mentioned earlier, supports a software programmable clock frequency. Specifically, the voltage can be varied from 1.0 to 1.5 V, in small increments, with the frequency varying correspondingly from 200 to 733 MHz, in steps of 33/66 MHz. The simplest way to take advantage of the scaling features discussed above is by carefully identifying the “smallest” supply voltage (and corresponding operating frequency) that guarantee that the target embedded application meets its timing constraints, and run the processor for that fix setting. If the workload is reasonably constant throughout execution, this simple scheme may suffice. However, if the workload varies substantially during execution, more sophisticated techniques that dynamically adapt the processor’s voltage and frequency to the varying workload, can deliver more substantial power savings. Naturally, when designing such techniques it is critical to consider the cost of transitioning from one setting to another, that is, the delay and power consumption overheads incurred by each transition. For example, for Intel 80200 processor, changing the processor frequency could take up to 20 µsec, while changing the voltage could take up to 1 msec [49]. Most processors developed for the mobile/portable market already support some form of built-in mechanism for voltage/frequency scaling. Intel’s SpeedStep technology, for example, detects if the system is currently plugged into a power outlet or running on a battery, and based on that, either runs the



16-9

High voltage, High frequency

Low voltage, Low frequency Deadline

Idle time

Power

Power

Deadline

Time

Time

FIGURE 16.2 Power consumption without and with dynamic voltage and frequency scaling.

processor at the highest voltage/frequency or switches it to a less power hungry mode. Transmeta’s Crusoe processors offer a power manager called LongRun [51], which is implemented in the processor’s firmware. LongRun relies on the historical utilization of the processor to guide clock rate selection: it increases the processor’s clock frequency if the current utilization is high and decreases it if the utilization is low. More sophisticated/aggressive dynamic scaling techniques should vary the core’s voltage/frequency based on some predictive strategy, while carefully monitoring performance, so as to ensure that it does not drop beyond a certain threshold and/or to ensure that task deadlines are not consistently missed [52–55] (see Figure 16.2). The use of such voltage/frequency scaling techniques must necessarily rely on adequate dynamic workload prediction and performance metrics, and thus requires the direct intervention of the operating system and/or of the applications themselves. Although more complex than the simple schemes discussed above, several such predictive techniques have been shown to deliver substantial gains for applications with well-defined task deadlines, for example, hard/soft real-time systems. Simple interval-based prediction schemes consider the amount of idle time on a previous interval as a measure of the processor’s utilization for the next time interval, and use that prediction to decide on the voltage/frequency settings to be used throughout its duration. Naturally, many critical issues must be factored in when defining the duration of such an interval (or “prediction window”), including overhead costs associated with switching voltage/frequency settings. While a prediction scheme based on a single interval may deliver substantial power gains with marginal loss in performance [52], looking exclusively at a single interval may not suffice for many applications. Namely, the voltage/frequency settings may end up oscillating in an inefficient way between different settings [53]. Simple smoothing techniques, for example, using an exponentially moving average of previous intervals, can be employed to mitigate the problem [53,56]. Finally, note that the benefits of dynamic voltage and frequency scaling are not limited to reducing dynamic power consumption. When the voltage/frequency of a battery operated system is lowered, the instantaneous current drawn by the processor decreases accordingly, leading to a more effective utilization of the battery capacity, and thus to increased duration of service.

16.4.2 Dynamic Resource Scaling Dynamic resource scaling refers to exploiting adaptive, fine-grained hardware resource reconfiguration techniques in order to improve power efficiency. Dynamic resource scaling requires enhancing the microarchitecture with the ability to selectively “disable” components, fully or partially, through either clock gating or VDD gating.2 The effectiveness of dynamic resource scaling is predicated on the fact that many applications have variable workloads, that is, have execution phases with substantial Instruction-Level Parallelism (ILP), and other phases with much less inherent parallelism. Thus, by dynamically “scaling down” micro-architecture components during such “low activity” periods, substantial power savings can potentially be achieved. 2 Note

that, in contrast to clock gating, all state information is lost when a circuit is VDD gated.


16-10


Techniques for reducing static power dissipation on processor cores can definitely exploit resource scaling features, once they become widely available on processors. Namely, under-utilized or idling resources can be partially or fully VDD gated, thus reducing leakage power. Several utilization-driven techniques have already been proposed, that can selectively shutdown functional units, segments of register files, and other datapath components, when such conditions arise [57–60]. Naturally, the delay overhead incurred by power supply gating should be carefully factored into these techniques, so that the corresponding static energy savings are achieved with only a small degradation in performance. Dynamic energy consumption can also be reduced by dynamically “scaling down” power hungry microarchitecture components, for example, reducing the size of the issue window and/or reducing the effective width of the pipeline, during periods of low“activity”(say, when low ILP code is being executed). Moreover, if one is able to scale down these micro-architecture components that define the critical path delay of the machine’s pipeline (typically located on the rename and window access stages), additional opportunities for voltage scaling, and thus dynamic power savings, can be created. This is, thus, a very promising area still undergoing intensive research [61].

16.4.3 Processor Core Selection The power aware techniques discussed so far can be broadly applied to general purpose as well as application specific processors. Naturally, the selection of processor core(s) to be instantiated in the architecture’s processing subsystem is likely to have also a substantial impact on overall power consumption, particularly when computation intensive embedded systems are considered. A plethora of programmable processing elements, including microcontrollers, general-purpose processors, DSPs, and ASIPs, addressing the specific needs of virtually every segment of the embedded systems’ market, is currently offered by vendors. As alluded to above, for computation intensive embedded systems with moderately high to stringent timing constraints, the selection of processor cores is particularly critical, since high performance usually signifies high levels of power dissipation. For these systems, ASIPs and DSPs have the potential to be substantially more energy efficient than their general-purpose counterparts, yet their “specialized” nature poses significant compilation challenges3 [62]. In contrast, general-purpose processors are easier to compile to, being typically shipped with good optimizing compilers, as well as debuggers and other development tools. Unfortunately, their “generality” incurs a substantial power overhead. In particular, high-performance general-purpose processors require the use of power hungry hardware assists, including reservation stations, reorder buffers and rename logic, and complex branch prediction logic to alleviate control stalls. Still, their flexibility and high quality development tools are very attractive for systems with stringent time-to-market constraints, making them definitively relevant for embedded systems. IBM/Motorola’s PowerPC family and the ARM family are examples of general-purpose processors enhanced with power aware features that are widely used in modern embedded systems. A plethora of specialized/customizable processors is also offered by several vendors, including specialized media cores from Philips, Trimedia, MIPS, etc., DSP cores offered by Texas Instruments, StarCore, and Motorola, and customizable cores from Hewlett-Packard-STMicroelectronics and Tensilica. The Instruction Set Architectures (ISAs) and features offered on these specialized processors can vary substantially, since they are designed and optimized for different classes of applications. However, several of them, including TI’s TMS320C6x family, HP-STS’s Lx, the Starcore and Trimedia families, and Philip’s Nexperia use a “Very Large Instruction Word” (VLIW) [63,64] or “Explicitly Parallel Instruction Computing” (EPIC) [65] paradigm. One of the key differences between VLIW and superscalar architectures is that VLIW machines rely on the compiler to extract ILP, and then schedule and bind instructions to functional units statically, while their high performance superscalar counterparts use dedicated (power hungry) hardware to perform runtime dependence checking, instruction reordering, etc. Thus, in broad 3 Several

DSP/ASIP vendors provide preoptimized assembly libraries to mitigate this problem.



16-11

terms, VLIW machines eliminate power hungry micro-architecture components by moving the corresponding functionality to the compiler.4 Moreover, wide VLIW machines are generally organized as a set of small clusters with local register files.5 Thus, in contrast to traditional superscalar machines, which rely on power hungry multiported monolithic register files, multicluster VLIW machines scale better with increasing issue widths, that is, dissipate less dynamic power and can work at faster clock rates. Yet, they are harder to compile to [66–72]. In summary, VLIW paradigm works very well for many classes of embedded applications, as attested to by the large number of VLIW processors currently available in the market. However, it poses substantial compilation challenges, some of which are still undergoing active research [8,73,74]. Fortunately, many processing intensive embedded applications have only a few time critical loop kernels. Thus, only a very small percentage of the overall code needs to be actually subjected to the complex, time consuming compiler optimizations required by VLIW machines. In the case of media applications, for example, such loop kernels may represent as little as 3% of the overall program, and yet take up to 95% of the execution time [75]. When the performance requirements of an embedded system are extremely high, using a dedicated coprocessor aimed at accelerating the execution of time-critical kernels (under the control of a host computer), may be the only feasible solution. Imagine, one such programmable coprocessors, was designed to accelerate the execution of kernels of streaming media applications [76]. It can deliver up to 20 GFLOPS at a relatively low power cost (2 GFLOPS/W), yet such a power efficiency does not come without a cost [76]. As expected, Imagine has a complex programming paradigm, that requires extracting all time-critical kernels from the target application, and carefully reprogramming them using Imagine’s “stream-oriented” coding style, so that, both, data and instructions can be efficiently routed from the host to the accelerator. Programming Imagine thus require a substantial effort, yet its power efficiency makes it very attractive for systems that demand such high levels of performance. Finally, at the highest end of the performance spectrum, one may need to consider using fully customized hardware accelerators. Comprehensive methodologies to design such accelerators are discussed in detail in Reference 77.

16.5 Energy Efficient Memory Subsystems While processor speeds have been increasing at a very fast rate (∼60% a year), memory performance has increased at a comparatively modest rate (∼7% a year), leading to the well-known “processor-memory performance gap” [31]. In order to alleviate the memory access latency problem, modern processor designs use increasingly large on-chip caches, with up to 60% of the transistors dedicated to on-chip memory and support circuitry [31]. As a consequence, power dissipation in the memory subsystem contributes to a substantial fraction of the energy consumed by modern processors. A study targeting the StrongARM SA-110, a low-power processor widely used in embedded systems, revealed that more than 40% of the processor’s power budget is taken up by on-chip data and instruction caches [78]. For high-performance general-purpose processors, this percentage is even higher, with up to 90% of the power budget consumed by memory elements and circuits aimed at alleviating the aforementioned memory bottleneck [31]. Power aware memory designs have thus received considerable attention in recent years.

16.5.1 Cache Hierarchy Tuning The energy cost of accessing data/instructions from off-chip memories can be as much as two orders of magnitude higher than that of an access to on-chip memory [79]. By retaining instructions and data with high spatial and/or temporal locality on-chip, caches can substantially reduce the number of costly off-chip data transfers, thus leading to potentially quite substantial energy savings. 4 Since functional unit binding decisions are made by the compiler, VLIW code is larger than RISC/superscalar code. We will discuss techniques to address this problem later in our survey. 5 A cluster is a set of functional units connected to a local register file. Clusters communicate with each other through a dedicated interconnection network.


16-12


The “one-size-fits-all” nature of the general-purpose domain dictates that one should use large caches with a high degree of associativity, so as to try to ensure high hit rates (and thus low average memory access latencies) for as many applications as possible. Unfortunately, as one increases cache size and associativity, larger circuits and/or more circuits are activated on each access to the cache, leading to a corresponding increase in dynamic energy consumption. Clearly, in the context of embedded systems, one can do much better. Specifically, by carefully tuning/scaling the configuration of the cache hierarchy, so that it more efficiently matches the bandwidth requirements and access patterns of the target embedded application, one can essentially achieve memory access latencies similar to those delivered by “larger” (general purpose) memory subsystems, while substantially decreasing the average energy cost of such accesses. Since several of the cache hierarchy parameters exhibit conflicting trends, an aggressive design space exploration over many candidate cache configurations is typically required in order to properly tune the cache hierarchy of an embedded system. Figure 16.3 summarizes the results of one such design space exploration performed for a media application. Namely, the graph in Figure 16.3 plots the energydelay product metric for a wide range of L1 and L2 on-chip D-cache configurations. The delay term is given by the number of cycles taken to process a representative data set from start to completion. The energy term accounts for the energy spent on data memory accesses for the particular execution trace.6 As it can be seen, the design space is very complex, reflecting the conflicting trends alluded to above. For this case study, the best set of cache hierarchy configurations exhibit an energy-delay product that is about one order of magnitude better (i.e., lower) than that of the worst configurations. Perhaps even more interesting is that some of the worst memory subsystem configurations use quite large caches with

× 105

Energy × delay (mJ × cycles × 100,000)

3 2.5 2 1.5 1 0.5 0

L2 Cache: 64 B /128 B/256 B, 1way/2 way 8 K /16 K/32 K/64 K kill window

L1 Cache: 4 KB/16 KB/32 KB 32 B, 1 way/2 way 1 K/4 K/16 K/32 K kill window

FIGURE 16.3 Design space exploration: energy-delay product for various L1 and L2 D-cache configurations for a JPEG application running on an XScale-like processor core.

6 Specifically, the

energy term accounts for accesses to the L1 and L2 on-chip D-caches, and to main memory.



16-13

a high degree of associativity, clearly indicating that no substantial performance gains would be achieved (for this particular media application) by using such an aggressively dimensioned memory subsystems. For many embedded systems/applications, the power efficiency of the memory subsystems can be improved even more aggressively, yet this requires the use of novel (nonstandard) memory system designs, as discussed in the sections that follow.

16.5.2 Novel Horizontal and Vertical Cache Partitioning Schemes In recent years, several novel cache designs have been proposed to aggressively reduce the average dynamic energy consumption incurred by memory accesses. Energy efficiency is improved in these designs by taking direct advantage of specific characteristics of target classes of applications. The memory footprint of instructions and data in media applications, for example, tends to be very small, thus creating unique opportunities for energy savings [75]. Since streaming media applications are pervasive in today’s portable electronics market, they have been a preferred application domain for validating the effectiveness of such novel cache designs. Vertical partition schemes [80–83], as the name suggests, introduce additional small buffers/caches before the first level of the “traditional” memory hierarchy. For applications with “small” working sets, this strategy can lead to considerable dynamic power savings. A concrete example of a vertical partition scheme is the filter cache [84], which is a very small cache placed in front of the standard L1 data cache. If the filter cache is properly dimensioned, dynamic energy consumption in the memory hierarchy can be substantially reduced, not only by accessing most of the data from the filter cache, but also by powering down (clock gating) the L1 D-cache to a STANDBY mode during periods of inactivity [84]. Although switching the L1 D-cache to STANDBY mode results in delay/energy penalties when there is a miss in the filter cache, it was observed that for media applications, the energy-delay product did improve quite significantly when the two techniques were combined. Predecoded instruction buffers [85] and loop buffers [86] are variants of the vertical partitioning scheme discussed above, yet applied to instruction caches (I-caches). The key idea of the first partitioning scheme mentioned is to store recently used instructions on an instruction buffer, in a decoded form, so as to reduce the average dynamic power spent on fetching and decoding instructions. The second partitioning scheme allows one to hold time-critical loop bodies (identified a priori by the compiler or by the programmer) on small and thus energy efficient dedicated loop buffers. Horizontal partition schemes refer to the placement of additional (small) buffers or caches at the same level as the L1 cache. For each memory reference, the appropriate (level one) cache to be accessed is determined by dedicated decoding circuitry residing between the processor core and the memory hierarchy. Naturally, the method used to partition data across the set of first level caches should ensure that the cache selection logic is simple, and thus cache access times are not significantly affected. Region-based Caches implement one such“horizontal partitioning”scheme, by adding two small 2 KB L1 D-caches to the first level of the memory hierarchy, one for stack and one for global data. This arrangement has also been shown to achieve substantial gains in dynamic energy consumption for streaming media applications with a negligible impact on performance [87].

16.5.3 Dynamic Scaling of Memory Elements With an increasing number of on-chip transistors being devoted to storage elements in modern processors, of which only a very small set is active at any point in time, static power dissipation is expected to soon become a key contributor to a processor’s power budget. State-of-the-art techniques to reduce static power consumption in on-chip memories are based on the simple observation that, in general, data or instructions fetched into a given cache line have an immediate flurry of accesses during a “small” interval of time, followed by a relatively “long” period of time where they are not used, before eventually being evicted to make way for new data/instructions [88,89]. If one can “guess” when that period starts, it is


16-14


possible to “switch-off ” (i.e., VDD gate) the corresponding cache lines without introducing extra cache misses, thereby saving static energy consumption with no impact on performance [90,91]. Cache Decay was one of the earliest attempts to exploit such a “generational” memory usage behavior to decrease leakage power [90]. The original Cache Decay implementation used a simple policy that turned off cache lines after a fixed number of cycles (decay interval) since the last access. Note that if the selected decay interval happens to be too small, cache lines are switched off prematurely, causing extra cache misses, and if it is too large, opportunities for saving leakage energy are missed. Thus, when such a simple scheme is used, it is critical to tune the “fixed” decay interval very carefully, so that it adequately matches the access patterns of the embedded application of interest. Adaptive strategies, varying the decay interval at runtime so as to dynamically adjust it to the changing access patterns, have been proposed more recently, so as to enable the use of the cache decay principle across a wider range of applications [90,91]. Similar leakage energy reduction techniques have also been proposed for issue queues [59,60,92] and branch prediction tables [93]. Naturally, leakage energy reduction techniques for instruction/program caches are also very critical [94]. A technique has been recently proposed that monitors the performance of the instruction cache over time, and dynamically scales (via VDD gating) its size, so as to closely match the size of the working set of the application [94].

16.5.4 Software-Controlled Memories, Scratch-Pad Memories Most of novel designs and/or techniques discussed so far require an application-driven tuning of several architecturally visible parameters. However, similar to more “traditional” cache hierarchies, the memory subsystem interface implemented on these novel designs still exposes a flat view of the memory hierarchy to the compiler/software. That is, the underlying details of the memory subsystem architecture are essentially transparent to both. Dynamic power dissipation incurred by accesses to basic memory modules occurs owing to switching activity in bit lines, word lines, and input and output lines. Traditional caches have additional switching overheads, owing to the circuitry (comparators, multiplexers, tags, etc.) needed to provide the “flat” memory interface alluded to above. Since the hardware assists necessary to support such a transparent view of the memory hierarchy are quite power hungry, additional energy saving opportunities can be created by relying more on the compiler (and less on dedicated hardware) to manage the memory subsystem. The use of software-controlled (rather than hardware-controlled) memory components is thus becoming increasing prevalent in power aware embedded system design. Scratch-Pads are an example of such novel, software-controlled memories [95–100]. Scratch-Pads are essentially on-chip partitions of main memory directly managed by the compiler. Namely, decisions concerning data/instruction placement in on-chip Scratch-Pads are made statically by the compiler, rather than dynamically, using dedicated hardware circuitry. Therefore, these memories are much less complex and thus less power hungry than traditional caches. As one would expect, the ability to aggressively improve energy-delay efficiency through the use of Scratch-Pads is predicated on the quality of the decisions made by the compiler on the subset of data/instructions that are to be assigned to that limited memory space [95,101]. Several compiler-driven techniques have been proposed to identify the data/instructions that can be assigned to the Scratch-Pad more profitably, with frequency of use being one of the key selection criteria [102–104]. The Cool Cache architecture [105], also proposed for media applications, is a good example of a novel, power aware memory subsystem that relies on the use of software-controlled memories. It uses a small Scratch-Pad and a “software-controlled cache,” each of which is implemented on a different on-chip SRAM. The program’s scalars are mapped to the small (2 KB) Scratch-Pad [100].7 Nonscalar data is mapped to the software-controlled cache, and the compiler is responsible for translating virtual addresses to SRAM lines, using a small register lookup area. Even though cache misses are handled in software, 7 This

size was found to be sufficient for most media applications.



16-15

thereby incurring substantial latency/energy penalties, the overall architecture has been shown to yield substantial energy-delay product improvements for media applications, when compared to traditional cache hierarchies [105]. The effectiveness of techniques such as the above is so pronounced that several embedded processors currently offer a variety of software-controlled memory blocks, including configurable Scratch-Pads (TI’s 320C6x [106]), lockable caches (Intel’s XScale [49] and Trimedia [107]) and stream buffers (Intel’s StrongARM [49]).

16.5.5 Improving Access Patterns to Off-Chip Memory During the last few decades, there has been a substantial effort in the compiler domain aimed at minimizing the number of off-chip memory accesses incurred by optimized code, as well as enabling the implementation of aggressive prefetching strategies. This includes devising compiler techniques to restructure, reorganize, and layout data in off-chip memory, as well as techniques to properly reorder a program’s memory access patterns [108–112]. Prefetching techniques have received considerable attention lately, particularly in the domain of embedded streaming media applications. Instruction and data prefetching techniques can be hardware- or software-driven [113–120]. Hardware-based data prefetching techniques try to dynamically predict when a given piece of data will be needed, so as to load it into cache (or into some dedicated on-chip buffer), before it is actually referenced by the application (i.e., explicitly required by a demand access) [114–116]. In contrast, software-based data prefetching techniques work by inserting prefetch instructions for selected data references at carefully chosen points in the program — such explicit prefetch instructions are executed by the processor, to move data into cache [117–120]. It has been extensively demonstrated that, when properly used, prefetching techniques can substantially improve average memory access latencies [113–120]. Moreover, techniques that prefetch substantial chunks of data (rather than, say, a single cache line), possibly to a dedicated buffer, can also simultaneously decrease dynamic power dissipation [121]. Namely, when data is brought from off-chip memory in large bursts, then energy-efficient burst/page access modes can be more effectively exploited. Moreover, by prefetching large quantities of instructions/data, the average length of DRAM idle times is expected to increase, thus creating more profitable opportunities for the DRAM to be switched to a lower power mode [122–124]. Naturally, it is important to ensure that the overhead associated with the prefetching mechanism itself, as well as potential increases in static energy consumption owing to additional storage requirements, do not outweigh the benefits achieved from enabling more energy-efficient off-chip accesses [124].

16.5.6 Special Purpose Memory Subsystems for Media Streaming As alluded to before, streaming media applications have been a preferred application domain for validating the effectiveness of many novel, power aware memory designs. Although the compiler is consistently given a more preeminent role in the management of these novel memory subsystems, they require no fundamental changes to the adopted programming paradigm. Additional opportunities for energy savings can be unlocked by adopting a programming paradigm that directly exposes those elements of an application that should be considered by an optimizing compiler, during performance versus power trade-off exploration. The two special purpose memory subsystems discussed below do precisely that, in the context of streaming media applications. Xtream-Fit is a special purpose data memory subsystem targeted to generic uni-processor embedded system platforms executing media applications [124]. Xtream-Fit’s on-chip memory consists of a ScratchPad, to hold constants and scalars, and a novel software-controlled streaming memory, partitioned into regions, each of which holds one of the input or output streams used/produced by the target application. The use of software-controlled memories by Xtream-Fit ensures that dynamic energy consumption is low, while the region based organization of the streaming memory enables the implementation of very simple


16-16


and yet effective shutdown policies to turn off different memory regions, as the data they hold become “dead.” Xtream-Fit’s programming model is actually quite simple, requiring only a minor “reprogramming” effort. It simply requires organizing/partitioning the application code into a small set of processing and data transfer tasks. Data transfer tasks prefetch streaming media data (the amount required by the next set of processing tasks) into the streaming memory. The amount of prefetched data is explicitly exposed via a single customization parameter. By varying this single customization parameter, the compiler can thus aggressively minimize energy-delay product, by considering, both, dynamic and leakage power, dissipated in on-chip and in off-chip memories [124]. While Xtream-Fit provides sufficient memory bandwidth for generic uni-processor embedded media architectures, it cannot support the very high bandwidth requirements of high-performance media accelerators. For example, Imagine, the multicluster media accelerator alluded previously, uses its own specialized memory hierarchy, consisting of a streaming memory, a 128 KB stream register file, and stream buffers and register files local to each of its eight clusters. Imagine’s memory subsystem delivers a very high bandwidth (2.1 GB/sec) with very high energy-efficiency, yet it requires the use of a specialized programming paradigm. Namely, data transfers to/from the host are controlled by a stream controller, and between the stream register file and the functional units by a microcontroller, both of which have to be programmed separately, using Imagine’s own “stream-oriented” programming style [125]. Systems that demand still higher performance and/or energy-efficiency may require memory architectures fully customized to the target application. Comprehensive methodologies for designing high-performance memory architectures for custom hardware accelerators are discussed in detail in [36,126].

16.5.7 Code Compression Code size affects both program storage requirements, and off-chip memory bandwidth requirements, and can thus have a first order impact on the overall power consumed by an embedded system. Instruction compression schemes decrease both such requirements by storing in main memory (i.e., off-chip) frequently fetched/executed instruction sequences in an encoded/compressed form [127–129]. Naturally, when one such scheme is adopted, it is important to factor in the overhead incurred by the on-chip decoding circuitry, so that it does not outweigh the gains achieved on storage and interconnect elements. Furthermore, different approaches have been considered for storing such selected instruction sequences on-chip in either compressed or decompressed forms. On-chip storage of instructions in compressed form saves on-chip storage, yet instructions must be decoded every time they are executed, adding additional latency/power overheads. Instruction subsetting is an alternative instruction compression scheme, where instructions that are not commonly used are discarded from the instruction set, thus enabling the “reduced” instruction set to be encoded using less bits [130]. The Thumb instruction set is a classic example of a compressed instruction set, featuring the most commonly used 32-bit ARM instructions, compressed to 16-bit wide format. The Thumb instructions set is decompressed transparently to full 32-bit ARM instructions in real time, with no performance loss.

16.5.8 Interconnect Optimizations Power dissipation in on- and off-chip interconnect structures is also a significant contributor to an embedded system’s power budget [131]. A shared bus is a commonly used interconnect structure, as it offers a good trade-off between generality/simplicity and performance. Power consumption on the bus can be reduced by decreasing its supply voltage, capacitance, and/or switching activity. Bus splitting, for example, reduces bus capacitance by splitting long bus lines into smaller sections, with one section relaying the data to the next [132]. Power consumption in this approach is reduced at the expense of a small penalty in latency, incurred at each relay point. Bus switching activity, and thus dynamic power dissipation, can also be substantially reduced by using an appropriate bus encoding scheme [133–138].



16-17

Bus invert coding [133], for example, is a simple, and yet widely used coding scheme. The first step of bus invert coding is to compute the Hamming distance between the current bus value and the previous bus value. If this value is greater than half the number of total bits, then the data value is transmitted in inverted form, with an additional invert bit to interpret the data at the other end. Several other encoding schemes have been proposed, achieving lower switching activity at the expense of higher encoding and decoding complexity [134–138]. With the increasing adoption of System-on-Chip design methodologies for embedded systems, devising energy-delay efficient interconnect architectures for such large scale systems is becoming increasingly critical and is still undergoing intensive research [5].

16.6 Summary Design methodologies for today’s embedded systems must necessarily treat power consumption as a primary figure of merit. At the system and architecture levels of design abstraction, power aware embedded system design requires the availability of high-fidelity power estimation and simulation frameworks. Such frameworks are essential to enable designers to explore and evaluate, in reasonable time, the complex energy-delay trade-offs realized by different candidate architectures, subsystem realizations, and power management techniques, and thus quickly identify promising solutions for the target application of interest. The detailed, system and architecture level design phases that follow should adequately combine coarse, system level dynamic power management strategies, with fine-grained, self-monitoring techniques, exploiting voltage and frequency scaling, as well as advanced dynamic resource scaling and power-driven reconfiguration techniques.

References [1] G. Micheli, R. Ernst, and W. Wolf, Eds. Readings in Hardware/Software Co-Design. Morgan Kaufman Publishers, Norwell, MA, 2002. [2] F. Balarin, P. Giusto, A. Jurecska, C. Passerone, E. Sentovich, B. Tabbara, M. Chiodo, H. Hsieh, L. Lavagno, A.L. Sangiovanni-Vincentelli, and K. Suzuki. Hardware–Software Co-Design of Embedded Systems: The POLIS Approach. Kluwer Academic Publishers, Dordrecht, 1997. [3] S. Borkar. Design Challenges of Technology Scaling. IEEE Micro, 19: 23–29, 1999. [4] http://public.itrs.net/ [5] M. Pedram and J.M. Rabaey. Power Aware Design Methodologies. Kluwer Academic Publishers, Dordrecht, 2002. [6] R. Gonzalez, B. Gordon, and M. Horowitz. Supply and Threshold Voltage Scaling for Low Power CMOS. IEEE Journal of Solid-State Circuits, 32(8): 1210–1216, 1997. [7] A.P. Chandrakasan, S. Sheng, and R. W. Brodersen. Low-Power CMOS Digital Design. IEEE Journal of Solid-State Circuits, 27(4): 473–484, 1992. [8] A.A. Jerraya, S. Yoo, N. Wehn, and D. Verkest, Eds. Embedded Software for SoC. Kluwer Academic Publishers, Dordrecht, 2003. [9] A.P. Chandrakasan and R.W. Brodersen. Low Power Digital CMOS Design. Kluwer Academic Publishers, Dordrecht, 1995. [10] T.L. Martin, D.P. Siewiorek, A. Smailagic, M. Bosworth, M. Ettus, and J. Warren. A Case Study of a System-Level Approach to Power-Aware Computing. ACM Transactions on Embedded Computing Systems, Special Issue on Power-Aware Embedded Computing, 2(3): 255–276, 2003. [11] A.S. Vincentelli and G. Martin. A Vision for Embedded Systems: Platform-Based Design and Software Methodology. IEEE Design and Test of Computers, 18(6): 23–33, 2001. [12] J.M. Rabaey and A.S. Vincentelli. System-on-a-Chip — A Platform Perspective. In Keynote Presentation, Korean Semiconductor Conference, 2002. Available at http://bwrc.eecs.berkeley.edu/ People/Faculty/jan/presentations/platformdesign.pdf


16-18


[13] J.T. Buck, S. Ha, E.A. Lee, and D.G. Messerschmitt. Ptolemy: A Framework for Simulating and Prototyping Heterogeneous Systems. International Journal of Computer Simulation, Special Issue on Simulation Software Development, 4: 155–182, 1994. [14] V. Tiwari, S. Malik, and A. Wolfe. Power Analysis of Embedded Software: A First Step Towards Software Power Minimization. IEEE Transactions on Very Large Scale Integration Systems, 2(4): 437–445, 1994. [15] P.M. Chau and S.R. Powell. Power Dissipation of VLSI Array Processing Systems. Journal of VLSI Signal Processing, 4(2–3): 199–212, 1992. [16] J. Russell and M. Jacome. Software Power Estimation and Optimization for High-Performance 32-bit Embedded Processors. In Proceedings of the International Conference on Computer Design, 1998, pp. 328–333. [17] C. Brandolese, W. Fornaciari, F. Salice, and D. Sciuto. An Instruction-Level Functionality-Based Energy Estimation Model for 32-bits Microprocessors. In Proceedings of the Design Automation Conference, 2000, pp. 346–351. [18] G. Qu, N. Kawabe, K. Usami, and M. Potkonjak. Function-Level Power Estimation Methodology for Microprocessors. In Proceedings of the Design Automation Conference, 2000, pp. 810–813. [19] D.C. Burger and T.M. Austin. The SimpleScalar Tool Set, Version 2.0. Computer Architecture News, 25(3): 13–25, 1997. [20] D. Brooks, V. Tiwari, and M. Martonosi. Wattch: A Framework for Architectural Level Power Analysis and Optimizations. In Proceedings of the International Symposium on Computer Architecture, 2000, pp. 83–94. [21] G. Cai and C.H. Lim. Architectural Level Power/Performance Optimization and Dynamic Power Estimation. In Cool Chips Tutorial, International Symposium on Microarchitecture, 1999. [22] W. Ye, N. Vijaykrishnan, M. Kandemir, and M. J. Irwin. The Design and Use of Simplepower: A Cycle-Accurate Energy Estimation Tool. In Proceedings of the Design Automation Conference, 2000, pp. 340–345. [23] G. Jochens, L. Kruse, E. Schmidt, and W. Nebel. A New Parameterizable Power Macro-Model for Datapath Components. In Proceedings of the Design Automation and Test in Europe, 1999. [24] A. Bogliolo, L. Benini, and G.D. Micheli. Regression-Based RTL Power Modeling. ACM Transactions on Design Automation of Electronic Systems, 5(3): 337–372, 2000. [25] S.A. Theoharis, C.E. Goutis, G. Theodoridis, and D. Soudris. Accurate Data Path Models for RT-Level Power Estimation. In Proceedings of the International Workshop on Power and Timing Modeling, Optimization and Simulation, 1998, pp. 213–222. [26] M. Khellah and M.I. Elmasry. Effective Capacitance Macro-Modelling for Architectural-Level Power Estimation. In Proceedings of the Eighth Great Lakes Symposium on VLSI, 1998, pp. 414–419. [27] Z. Chen, K. Roy, and E.K. Chong. Estimation of Power Dissipation Using a Novel Power Macromodeling Technique. IEEE Transactions on Computer Aided Design of Integrated Circuits and Systems, 19(11): 1363–1369, 2000. [28] R. Melhem and R. Graybill, Eds. Challenges for Architectural Level Power Modeling. In Power Aware Computing. Kluwer Academic Publishers, Dordrecht, 2001. [29] J.A. Butts and G.S. Sohi. A Static Power Model for Architects. In Proceedings of the International Symposium on Microarchitecture, 2000, pp. 191–201. [30] Y. Zhang, D. Parikh, K. Sankaranarayanan, K. Skadron, and M. Stan. HotLeakage: A TemperatureAware Model of Subthreshold and Gate Leakage for Architects. Technical report, Department of Computer Science, University of Virginia, 2003. [31] D. Patterson, T. Anderson, N. Cardwell, R. Fromm, K. Keeton, C. Kozyrakis, R. Thomas, and K. Yelick. A Case for Intelligent RAM. IEEE Micro, 17(2): 33–44, 1997. [32] S.J.E. Wilton and N.M. Jouppi. CACTI: An Enhanced Cache Access and Cycle Time Model. Technical report, Digital Equipment Corporation, Western Research Lab, 1996. [33] M. Kamble and K. Ghose. Analytical Energy Dissipation Models For Low Power Caches. In Proceedings of the International Symposium on Low Power Electronics and Design, 1997, pp. 143–148. © 2006 by Taylor & Francis Group, LLC


16-19

[34] G. Reinman and N.M. Jouppi. CACTI 2.0: An Integrated Cache Timing and Power Model. Technical report, Compaq Computer Corporation, Western Research Lab, 2001. [35] J. Edler and M.D. Hill. Dinero IV Trace-Driven Uniprocessor Cache Simulator, 1998. http://www.cs.wisc.edu/∼markhill/DineroIV/ [36] F. Catthoor, S. Wuytack, E. DeGreef, F. Balasa, L. Nachtergaele, and A. Vandecappelle. Custom Memory Management Methodology: Exploration of Memory Organization for Embedded Multimedia System Design. Kluwer Academic Publishers, Dordrecht, 1998. [37] T. Martin and D. Siewiorek. A Power Metric for Mobile Systems. In International Symposium on Lower Power Electronics and Design, 1996, pp. 37–42. [38] M. Pedram and Q. Wu. Battery-Powered Digital CMOS Design. IEEE Transactions on Very Large Scale Integration Systems, 10: 601–607, 2002. [39] P. Rong and M. Pedram. An Analytical Model for Predicting the Remaining Battery Capacity of Lithium-Ion Batteries. In Proceedings of the Design Automation and Test in Europe, 2003, pp. 11148–11149. [40] M. Srivastava, A. Chandrakasan, and R. Brodersen. Predictive System Shutdown and other Architectural Techniques for Energy Efficient Programmable Computation. IEEE Transactions on Very Large Scale Integration Systems, 4(1): 42–55, 1996. [41] Q. Qiu and M. Pedram. Dynamic Power Management Based on Continuous-Time Markov Decision Processes. In Proceedings of the Design Automation Conference, 1999, pp. 555–561. [42] T. Simunic, L. Benini, P. Glynn, and G. De Micheli. Dynamic Power Management of Portable Systems. In Proceedings of the International Conference on Mobile Computing and Networking, 2000, pp. 11–19. [43] J. Liu, P. Chou, N. Bagherzadeh, and F. Kurdahi. A Constraint-Based Application Model and Scheduling Techniques for Power-Aware Systems. In Proceedings of the International Conference on Hardware/Software Codesign, 2001, pp. 153–158. [44] http://www.acpi.info/ [45] T. Simunic, L. Benini, A. Acquaviva, P. Glynn, and G. De Micheli. Dynamic Voltage Scaling for Portable Systems. In Proceedings of the Design Automation Conference, 2001, pp. 524–529. [46] R. Gonzalez and M. Horowitz. Energy Dissipation in General Purpose Microprocessors. IEEE Journal of Solid-State Circuits, 31(9): 1277–1284, 1996. [47] T.D. Burd and R.W. Brodersen. Processor Design for Portable Systems. Journal of VLSI Signal Processing, 13(2–3): 203–221, 1996. [48] T. Ishihara and H. Yasuura. Voltage Scheduling Problem for Dynamically Variable Voltage Processors. In Proceedings of the International Symposium on Low Power Electronics and Design, 1998, pp. 197–202. [49] http://www.intel.com/ [50] http://www.ibm.com/ [51] http://www.transmeta.com/ [52] M. Weiser, B. Welch, A.J. Demers, and S. Shenker. Scheduling for Reduced CPU Energy. In Proceedings of the Symposium on Operating Systems Design and Implementation, 1994, pp. 13–23. [53] K. Govil, E. Chan, and H. Wasserman. Comparing Algorithm for Dynamic Speed-Setting of a Low-Power CPU. In Proceedings of the International Conference on Mobile Computing and Networking, 1995, pp. 13–25. [54] T. Pering, T. Burd, and R. Brodersen. The Simulation and Evaluation of Dynamic Voltage Scaling Algorithms. In Proceedings of the International Symposium on Low Power Electronics and Design, 1998, pp. 76–81. [55] T. Pering, T. Burd, and R. Brodersen. Voltage Scheduling in the lpARM Microprocessor System. In Proceedings of the International Symposium on Low Power Electronics and Design, 2000, pp. 96–101. [56] K. Flautner, S. Reinhardt, and T. Mudge. Automatic Performance Setting for Dynamic Voltage Scaling. ACM Journal of Wireless Networks, 8(5): 507–520, 2002. © 2006 by Taylor & Francis Group, LLC

16-20


[57] D. Brooks and M. Martonosi. Value-Based Clock Gating and Operation Packing: Dynamic Strategies for Improving Processor Power and Performance. ACM Transactions on Computer Systems, 18(2): 89–126, 2000. [58] S. Dropsho, V. Kursun, D.H. Albonesi, S. Dwarkadas, and E.G. Friedma. Managing Static Leakage Energy in Microprocessor Functional Units. In Proceedings of the International Symposium on Microarchitecture, 2002, pp. 321–332. [59] D. Ponomarev, G. Kucuk, and K. Ghose. Reducing Power Requirements of Instruction Scheduling Through Dynamic Allocation of Multiple Datapath Resources. In Proceedings of the International Symposium on Microarchitecture, 2001, pp. 90–101. [60] A. Buyuktosunoglu, D. Albonesi, P. Bose, P. Cook, and S. Schuster. Tradeoffs in Power-Efficient Issue Queue Design. In Proceedings of the International Symposium on Low Power Electronics and Design, 2002, pp. 184–189. [61] C.J. Hughes, J. Srinivasan, and S.V. Adve. Saving Energy with Architectural and Frequency Adaptations for Multimedia Applications. In Proceedings of the International Symposium on Microarchitecture, 2001, pp. 250–261. [62] M.F. Jacome and G. de Veciana. Design Challenges for New Application Specific Processors. IEEE Design and Test of Computers, Special Issue on System Design of Embedded Systems, 17(2): 50–60, 2000. [63] R.P. Colwell, R.P. Nix, J.J.O. Donnell, D.B. Papworth, and P.K. Rodman. A VLIW Architecture for a Trace Scheduling Compiler. IEEE Transactions on Computers, 37(8): 967–979, 1988. [64] G.R. Beck, D.W.L. Yen, and T.L. Anderson. The Cydra 5: Mini-Supercomputer: Architecture and Implementation. The Journal of Supercomputing, 7(1/2): 143–180, 1993. [65] M.S. Schlansker and B.R. Rau. EPIC: An Architecture for Instruction-Level Parallel Processors. Technical report HPL-99-111, Hewlewtt Packard Laboratories, 2000. [66] W.W. Hwu, R.E. Hank, D.M. Gallagher, S.A. Mahlke, D.M. Lavery, G.E. Haab, J.C. Gyllenhaal, and D.I. August. Compiler Technology for Future Microprocessors. Proceedings of the IEEE, 83(12): 1625–1640, 1995. [67] J. R. Ellis. Bulldog: A Compiler for VLIW Architectures. MIT Press, Cambridge, MA, 1985. [68] J. Dehnert and R. Towle. Compiling for the Cydra-5. Journal of Supercomputing, 7(1/2): 181–227, 1993. [69] C. Dulong, R. Krishnaiyer, D. Kulkarni, D. Lavery, W. Li, J. Ng, and D. Sehr. An Overview of the Intel IA-64 Compiler. Intel Technology Journal, Q4, 1999, pp. 1–15. [70] M.F. Jacome, G. de Veciana, and V. Lapinskii. Exploring Performance Tradeoffs for Clustered VLIW ASIPs. In Proceedings of the International Conference on Computer-Aided Design, 2000, pp. 504–510. [71] V. Lapinskii, M.F. Jacome, and G. de Veciana. Application-Specific Clustered VLIW Datapaths: Early Exploration on a Parameterized Design Space. IEEE Transactions on Computer Aided Design of Integrated Circuits and Systems, 21(8): 889–903, 2002. [72] S. Pillai and M.F. Jacome. Compiler-Directed ILP Extraction for Clustered VLIW/EPIC Machines: Predication, Speculation and Modulo Scheduling. In Proceedings of the Design Automation and Test in Europe, 2003, p. 10422. [73] P. Marwedel and G. Goosens, Eds. Code Generation for Embedded Processors. Kluwer Academic Publishers, Dordrecht, 1995. [74] C. Liem. Retargetable Compilers for Embedded Core Processors. Kluwer Academic Publishers, Dordrecht, 1997. [75] J. Fritts, W. Wolf, and B. Liu. Understanding Multimedia Application Characteristics for Designing Programmable Media Processors. In SPIE Photonics West, Media Processors, 1999, pp. 2–13. [76] B. Khailany, W.J. Dally, S. Rixner, U.J. Kapasi, P. Mattson, J. Namkoong, J.D. Owens, B. Towles, and A. Chang. Imagine: Media Processing with Streams. IEEE Micro, 21: 35–46, 2001.



16-21

[77] J. Rabaey, H. De Man, J. Vanhoof, G. Goossens, and F. Catthoor. CATHEDRAL-II : A Synthesis System for Multiprocessor DSP Systems. In Silicon Compilation. Addison-Wesley, Reading, MA, 1987. [78] J. Montanaro, R.T. Witek, K. Anne, A.J. Black, E.M. Cooper, D.W. Dobberpuhl, P.M. Donahue, J. Eno, A. Farell, G.W. Hoeppner, D. Kruckemyer, T.H. Lee, P. Lin, L. Madden, D. Murray, M. Pearce, S. Santhanam, K.J. Snyder, R. Stephany, and S.C. Thierauf. A 160 MHz 32b 0.5 W CMOS RISC Microprocessor. In Proceedings of the International Solid-State Circuits Conference, Digest of Technical Papers, 31(11): 1703–1714, 1996. [79] P. Hicks, M. Walnock, and R.M. Owens. Analysis of Power Consumption in Memory Hierarchies. In Proceedings of the International Symposium on Low Power Electronics and Design, 1997, pp. 239–242. [80] K. Ghose and M.B. Kamble. Reducing Power in Superscalar Processor Caches Using Subbanking, Multiple Line Buffers and Bit-Line Segmentation. In Proceedings of the International Symposium on Low Power Electronics and Design, 1999, pp. 70–75. [81] C.-L. Su and A.M. Despain. Cache Design Trade-Offs for Power and Performance Optimization: A Case Study. In Proceedings of the International Symposium on Low Power Electronics and Design, 1995, pp. 63–68. [82] J. Kin, M. Gupta, and W.H. Mangione-Smith. Filtering Memory References to Increase Energy Efficiency. IEEE Transactions on Computers, 49(1): 1–15, 2000. [83] A.H. Farrahi, G.E. Téllez, and M. Sarrafzadeh. Memory Segmentation to Exploit Sleep Mode Operation. In Proceedings of the Design Automation Conference, 1995, pp. 36–41. [84] J. Kin, M. Gupta, and W.H. Mangione-Smith. The Filter Cache: An Energy Efficient Memory Structure. In Proceedings of the International Symposium on Microarchitecture, 1997, pp. 184–193. [85] R.S. Bajwa, M. Hiraki, H. Kojima, D.J. Gorny, K. Nitta, A. Shridhar, K. Seki, and K. Sasaki. Instruction Buffering to Reduce Power in Processors for Signal Processing. IEEE Transactions on Very Large Scale Integration Systems, 5(4): 417–424, 1997. [86] L. Lee, B. Moyer, and J. Arends. Instruction Fetch Energy Reduction Using Loop Caches for Embedded Applications with Small Tight Loops. In Proceedings of the International Symposium on Low Power Electronics and Design, 1999, pp. 267–269. [87] H.-H. Lee and G. Tyson. Region-Based Caching: An Energy-Delay Efficient Memory Architecture for Embedded Processors. In Proceedings of the International Conference on Compilers, Architectures and Synthesis for Embedded Systems, 2000, pp. 120–127. [88] D.A. Wood, M.D. Hill, and R.E. Kessler. A Model for Estimating Trace-Sample Miss Ratios. In Proceedings of the SIGMETRICS Conference on Measurement and Modeling of Computer Systems, 1991, pp. 79–89. [89] D.C. Burger, J.R. Goodman, and A. Kagi. The Declining Effectiveness of Dynamic Caching for General-Purpose Microprocessors. University of Wisconsin-Madison Computer Sciences Technical report 1261, 1995. [90] S. Kaxiras, Z. Hu, and M. Martonosi. Cache Decay: Exploiting Generational Behavior to Reduce Cache Leakage Power. In Proceedings of the International Symposium on Computer Architecture, 2001, pp. 240–251. [91] H. Zhou, M.C. Toburen, E. Rotenberg, and T.M. Conte. Adaptive Mode Control: A Static-PowerEfficient Cache Design. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques, 2001, pp. 61–72. [92] D. Folegnani and A. Gonzalez. Energy-Effective Issue Logic. In Proceedings of the International Symposium on Computer, 2001, pp. 230–239. [93] Z. Hu, P. Juang, K. Skadron, D. Clark, and M. Martonosi. Applying Decay Strategies to Branch Predictors for Leakage Energy Savings. In Proceedings of the International Conference on Computer Design, 2002, pp. 442–445.


16-22


[94] S.-H. Yang, M.D. Powell, B. Falsafi, K. Roy, and T.N. Vijaykumar. An Integrated Circuit/ Architecture Approach to Reducing Leakage in Deep-Submicron High-Performance I-Caches. In Proceedings of the High-Performance Computer Architecture, 2001, pp. 147–158. [95] P.R. Panda, N.D. Dutt, and A. Nicolau. Efficient Utilization of Scratch-Pad Memory in Embedded Processor Applications. In Proceedings of the European Design and Test Conference, 1997, pp. 7–11. [96] D. Chiou, P. Jain, S. Devadas, and L. Rudolph. Application-Specific Memory Management for Embedded Systems Using Software-Controlled Caches. In Proceedings of the Design Automation Conference, 2000, pp. 416–419. [97] L. Benini, A. Macii, and M. Poncino. A Recursive Algorithm for Low-Power Memory Partitioning. In Proceedings of the International Symposium on Low Power Electronics and Design, 2000, pp. 78–83. [98] R. Banakar, S. Steinke, B.-S. Lee, M. Balakrishnan, and P. Marwedel. Scratchpad Memory: A Design Alternative for Cache On-Chip Memory in Embedded Systems. In Proceedings of the International Workshop on Hardware/Software Codesign, 2002, pp. 73–38. [99] M. Kandemir, J. Ramanujam, M. Irwin, N. Vijaykrishnan, I. Kadayif, and A. Parikh. Dynamic Management of Scratch-Pad Memory Space. In Proceedings of the Design Automation Conference, 2001, pp. 690–695. [100] O.S. Unsal, Z. Wang, I. Koren, C.M. Krishna, and C.A. Moritz. On Memory Behavior of Scalars in Embedded Multimedia Systems. In Proceedings of the Workshop on Memory Performance Issues, Goteborg, Sweden, 2001. [101] P.R. Panda, F. Catthoor, N.D. Dutt, K. Danckaert, E. Brockmeyer, C. Kulkarni, A. Vandercappelle, and P.G Kjeldsberg. Data and Memory Optimization Techniques for Embedded Systems. ACM Transactions on Design Automation of Electronic Systems, 6(2): 149–206, 2001. [102] J. Sjödin, B. Fröderberg, and T. Lindgren. Allocation of Global Data Objects in On-Chip RAM. In Proceedings of the Workshop on Compiler and Architectural Support for Embedded Computer Systems, Washington DC, USA, 1998. [103] T. Ishihara and H. Yasuura. A Power Reduction Technique with Object Code Merging for Application Specific Embedded Processors. In Proceedings of the Design, Automation and Test in Europe, 2000, pp. 617–623. [104] S. Steinke, L. Wehmeyer, B.-S. Lee, and P. Marwedel. Assigning Program and Data Objects to Scratchpad for Energy Reduction. In Proceedings of the Design Automation and Test in Europe, 2002, pp. 409–417. [105] O.S. Unsal, R. Ashok, I. Koren, C.M. Krishna, and C.A. Moritz. Cool Cache: A Compiler-Enabled Energy Efficient Data Caching Framework for Embedded/Multimedia Processors. ACM Transactions on Embedded Computing Systems, Special Issue on Power-Aware Embedded Computing, 2(3): 373–392, 2003. [106] http://www.ti.com/ [107] http://www.trimedia.com/ [108] M.E. Wolf and M. Lam. A Data Locality Optimizing Algorithm. In Proceedings of the Conference on Programming Language Design and Implementation, 1991, pp. 30–44. [109] S. Carr, K.S. McKinley, and C. Tseng. Compiler Optimizations for Improving Data Locality. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems, 1994, pp. 252–262. [110] S. Coleman and K.S. McKinley. Tile Size Selection Using Cache Organization and Data Layout. In Proceedings of the Conference on Programming Language Design and Implementation, 1995. [111] M.J. Wolfe. High Performance Compilers for Parallel Computing. Addison-Wesley Publishers, Reading, MA, 1995, pp. 279–290. [112] M. Kandemir, J. Ramanujam, A. Choudhary, and P. Banerjee. A Layout-Conscious Iteration Space Transformation Technique. IEEE Transactions on Computers, 50(12): 1321–1335, 2001.



16-23

[113] N.P. Jouppi. Improving Direct-Mapped Cache Performance by the Addition of a Small FullyAssociative Cache and Prefetch Buffers. In Proceedings of the International Symposium on Computer Architecture, 1990, pp. 364–373. [114] T.F. Chen and J.L. Baer. Effective Hardware-Based Data Prefetching for High Performance Processors. IEEE Transactions on Computers, 44(5): 609–623, 1995. [115] J.W.C. Fu, J.H. Patel, and B.L. Janssens. Stride Directed Prefetching in Scalar Processor. In Proceedings of the International Symposium on Microarchitecture, 1992, pp. 102–110. [116] S.S. Pinter and A. Yoaz. A Hardware-Based Data Prefetching Technique for Superscalar Processors. In Proceedings of the International Symposium on Microarchitecture, 1996, pp. 214–225. [117] D. Callahan, K. Kennedy, and A. Porterfield. Software Prefetching. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems, 1991, pp. 40–52. [118] A.C. Klaiber and H.M. Levy. An Architecture for Software Controlled Data Prefetching. In Proceedings of the International Symposium on Computer Architecture, 1991, pp. 43–53. [119] T.C. Mowry, M.S. Lam, and A. Gupta. Design and Evaluation of a Compiler Algorithm for Prefetching. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems, 1992, pp. 62–73. [120] D.F. Zucker, R.B. Lee, and M.J. Flynn. Hardware and Software Cache Prefetching Techniques for MPEG Benchmarks. IEEE Transactions on Circuits and Systems for Video Technology, 10(5): 782–796, 2000. [121] Y. Choi and T. Kim. Memory Layout Technique for Variables Utilizing Efficient DRAM Access Modes in Embedded System Design. In Proceedings of the Design Automation Conference, 2003, pp. 881–886. [122] X. Fan, C.S. Ellis, and A.R. Lebeck. Memory Controller Policies for DRAM Power Management. In Proceedings of the International Symposium on Low Power Electronics and Design, 2001, pp. 129–134. [123] V. Delaluz, A. Sivasubramaniam, M. Kandemir, N. Vijaykrishnan, and M.J. Irwin. SchedulerBased DRAM Energy Management. In Proceedings of the Design Automation Conference, 2002, pp. 697–702. [124] A. Ramachandran and M. Jacome. Xtream-Fit: An Energy-Delay Efficient Data Memory Subsystem for Embedded Media Processing. In Proceedings of the Design Automation Conference, 2003, pp. 137–142. [125] P. Mattson. A Programming System for the Imagine Media Processor. PhD thesis, Stanford University, 2001. [126] P. Grun, N. Dutt, and A. Nicolau. Memory Architecture Exploration for Programmable Embedded Systems. Kluwer Academic Publishers, Dordrecht, 2003. [127] A. Wolfe and A. Chanin. Executing Compressed Programs on an Embedded RISC Architecture. In Proceedings of the International Symposium on Microarchitecture, 1992, pp. 81–91. [128] C. Lefurgy, P. Bird, I-C. Cheng, and T. Mudge. Improving Code Density Using Compression Techniques. In Proceedings of the International Symposium on Microarchitecture, 1997, pp. 194–203. [129] H. Lekatsas and W. Wolf. SAMC: A Code Compression Algorithm for Embedded Processors. IEEE Transactions on Computer Aided Design of Integrated Circuits and Systems, 18(12): 1689–1701, 1999. [130] W.E. Dougherty, D.J. Pursley, and D.E. Thomas. Instruction Subsetting: Trading Power for Programmability. In Proceedings of the International Workshop on Hardware/Software Codesign, 1998. [131] D. Sylvester and K. Keutzer. A Global Wiring Paradigm for Deep Submicron Design. IEEE Transactions on Computer Aided Design of Integrated Circuits and Systems, 19(2): 242–252, 2000. [132] C.-T. Hsieh and M. Pedram. Architectural Power Optimization by Bus Splitting. In Proceedings of the Conference on Design, Automation and Test in Europe, 2000, pp. 612–616.


16-24


[133] M.R. Stan and W.P. Burleson. Bus-Invert Coding for Low-Power I/O. IEEE Transactions on Very Large Scale Integration Systems, 3(1): 49–58, 1995. [134] H. Mehta, R.M. Owens, and M.J. Irwin. Some Issues in Gray Code Addressing. In Proceedings of the Sixth Great Lakes Symposium on VLSI, 1996, pp. 178–181. [135] L. Benini, G. De Micheli, E. Macii, M. Poncino, and S. Quez. System-Level Power Optimization of Special Purpose Applications — The Beach Solution. In Proceedings of the International Symposium on Low Power Electronics and Design, 1997, pp. 24–29. [136] L. Benini, G. De Micheli, E. Macii, D. Sciuto, and C. Silvano. Address Bus Encoding Techniques for System-Level Power Optimization. In Proceedings of the Design, Automation and Test in Europe, 1998, pp. 861–867. [137] P.R. Panda and N.D. Dutt. Low-Power Memory Mapping Through Reducing Address Bus Activity. IEEE Transactions on Very Large Scale Integration Systems, 7(3): 309–320, 1999. [138] N. Chang, K. Kim, and J. Cho. Bus Encoding for Low-Power High-Performance Memory Systems. In Proceedings of the Design Automation Conference, 2000, pp. 800–805.


Security in Embedded Systems 17 Design Issues in Secure Embedded Systems A.G. Voyiatzis, A.G. Fragopoulos, and D.N. Serpanos


17 Design Issues in Secure Embedded Systems 17.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17-1 17.2 Security Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17-2 Abilities of Attackers • Security Implementation Levels • Implementation Technology and Operational Environment

17.3 Security Constraints in Embedded Systems Design . . . 17-4 Energy Considerations • Processing Power Limitations • Flexibility and Availability Requirements • Cost of Implementation

17.4 Design of Secure Embedded Systems. . . . . . . . . . . . . . . . . . . 17-7 System Design Issues • Application Design Issues

17.5 Cryptography and Embedded Systems . . . . . . . . . . . . . . . . . 17-10

A.G. Voyiatzis, A.G. Fragopoulos, and D.N. Serpanos University of Patras

Physical Security • Side-Channel Cryptanalysis • SideChannel Implementations • Fault-Based Cryptanalysis • Passive Side-Channel Cryptanalysis • Countermeasures


17.1 Introduction A computing system is typically considered as an embedded system when it is a programmable device with limited resources (energy, memory, computation power, etc.) that serves one or few applications and is embedded in a larger system. Their limited resources make them ineffective to be used as general-purpose computing systems. However, they usually have to meet hard requirements, such as time deadlines and other real-time processing requirements. Embedded systems can be classified in two general categories: (1) standalone embedded systems, where all hardware and software components of the system are physically close and incorporated into a single device, for example, a Personal Digital Assistant (PDA) or a system in a washing machine or a fax, and there is no attachment to a network, and (2) distributed (networked) embedded systems, where several autonomous components — each one a standalone embedded system — communicate with each other over a network in order to deliver services or support an application. Several architectural and design parameters leading to the development of distributed embedded applications, such as the placement of processing power at the physical point where an event takes place, data reduction, etc. [1].

17-1


17-2


The increasing capabilities of embedded systems combined with their decreasing cost have enabled their adoption in a wide range of applications and services, from financial and personalized entertainment services to automotive and military applications in the field. Importantly, in addition to the typical requirements for responsiveness, reliability, availability, robustness, and extensibility, many conventional embedded systems and applications have significant security requirements. However, security is a resource-demanding function that needs special attention in embedded computing. Furthermore, the wide deployment of small devices which are used in critical applications has triggered the development of new, strong attacks that exploit more systemic characteristics, in contrast to traditional attacks that focused on algorithmic characteristics, due to the inability of attackers to experiment with the physical devices used in secure applications. Thus, design of secure embedded systems requires special attention. In this chapter we provide an overview of security issues in embedded systems. Section 17.2 presents the parameters of security systems, while Section 17.3 describes the effect of security in the resourceconstrained environment of embedded systems. Section 17.4 presents the main issues in the design of secure embedded systems. Finally, Section 17.5 covers in detail attacks and countermeasures of cryptographic algorithm implementations in embedded systems, considering the critical role of cryptography and the novel systemic attacks developed due to the wide availability of embedded computing systems.

17.2 Security Parameters Security is a generic term used to indicate several different requirements in computing systems. Depending on the system and its use, several security properties may be satisfied in each system and in each operational environment. Overall, secure systems need to meet all or a subset of the following requirements [2,3]: 1. Confidentiality. Data stored in the system or transmitted from the system have to be protected from disclosure; this is usually achieved through data encryption. 2. Integrity. A mechanism to ensure that data received in a data communication was indeed the data transmitted. 3. Nonrepudiation. A mechanism to ensure that all entities (systems or applications) participating in a transaction cannot deny their actions in the transaction. 4. Availability. The system’s ability to perform its primary functions and serve its legitimate users without any disruption, under all conditions, including possible malicious attacks that target to disrupt service, such as the well-known Denial of Service (DoS) attacks. 5. Authentication. The ability of the receiver of a message to identify the message sender. 6. Access control. The ability to ensure that only legal users may take part in a transaction and have access to system resources. To be effective, access control is typically used in conjunction with authentication. These requirements are placed by different parties involved in the development and use of computing systems, for example, vendors, application providers, and users. For example, vendors need to ensure the protection of their Intellectual Property (IP) that is embedded in the system, while end users want to be certain that the system will provide secure user identification (only authorized users may access the system and its applications, even if the system gets in the hands of malicious users) and will have high availability, that is, the system will be available under all circumstances; also, content providers are concerned for the protection of their IP, for example, that the data delivered through an application are not copied. Ravi et al. [3,4] have identified the participating parties in system and application development and use as well as their security requirements. This classification enables us to identify several possible malicious users, depending on a party’s view; for example, for the hardware manufacturer, even a legal end user of a portable device (e.g., a PDA or a mobile phone) can be a possible malicious user. Considering the security requirements and the interested parties above, the design of a secure system requires identification and definition of the following parameters: (1) the abilities of the attackers, (2) the level at which security should be implemented, and (3) implementation technology and operational environment.


Design Issues in Secure Embedded Systems

17-3

17.2.1 Abilities of Attackers Malicious users can be classified in several categories, depending on their knowledge, equipment, etc. Abraham et al. [5] propose a classification in three categories, depending on their knowledge, their hardware–software equipment, and their funds: 1. Class I — clever outsiders. Very intelligent attackers, not well funded and with no sophisticated equipment. They do not have specific knowledge of the attacked system; basically they are trying to exploit hardware vulnerabilities and software glitches. 2. Class II — knowledgeable insiders. Attackers with outstanding technical background and education, using highly sophisticated equipment and, often, with inside information for the system under attack; such attackers include former employees who participated in the development cycle of the system. 3. Class III — funded organizations. Attackers who are mostly working in teams, and have excellent technical skills and theoretical background. They are well funded, have access to very advanced tools and also have the capability to analyze the system — technically and theoretically — developing highly sophisticated attacks. Such organizations could be well-organized education foundations, government institutions, etc.

17.2.2 Security Implementation Levels Security can be implemented at various system levels, ranging from protection of the physical system itself to application and network security. Clearly, different mechanisms and implementation technologies have to be used to implement security at different levels. In general, the levels of security considered are four: (1) physical, (2) hardware, (3) software, and (4) network and protocol security. Physical security mechanisms target to protect systems from unauthorized physical access to the system itself. Protecting systems physically ensures data privacy and data and application integrity. According to US Federal Standard 1027, physical security mechanisms are considered successful when they ensure that a possible attack will have low possibility of success and high possibility of tracking the malicious attacker, in reasonable time. The wide adoption of embedded computing systems in a variety of devices, such as smartcards, mobile devices, and sensor networks, as well as the ability to network them, for example, through the Internet or VPNs, has led to revision and reconsideration of physical security. Weingart [6] surveys possible attacks and countermeasures concerning physical security issues, concluding that physical security needs continuous improvement and revision in order to keep at the leading edge. Hardware security may be considered as a subset of physical security, referring to security issues concerning the hardware parts of a computer system. Hardware-level attacks exploit circuit and technological vulnerabilities and take advantage of possible hardware defects. These attacks do not necessarily require very sophisticated and expensive equipment. Anderson and Kuhn [7] describe several ways to attack smartcards and microcontrollers through the use of unusual voltages and temperatures that affect the behavior of specific hardware parts or through microprobing a smartcard chip, such as the Subscriber Identity Module (SIM) chip found in cellular phones. Reverse engineering attack techniques are equally successful as Blythe et al. [8] reported for the case of a wide range of microprocessors. Their work concluded that special hardware protection mechanisms are necessary to avoid these types of attacks; such mechanisms include silicon coatings of the chip, increased complexity in the chip layout, etc. One of the major goals in the design of secure systems is the development of secure software, which is free of flaws and security vulnerabilities that may appear under certain conditions. Numerous software security flaws have been identified in real systems, for example, by Landwehr et al. [9], and there have been several cases where malicious intruders hack into systems through exploitation of software defects [10]. Some methods for the prevention of such problems have been proposed by Tevis and Hamilton [11]. The use of the Internet, which is an unsafe interconnection for information transfer, as a backbone network for communicating entities and the wide deployment of wireless networks demonstrate that improvements have to be done in existing protocol architectures in order to provide new, secure protocols [12,13]. Such protocols will ensure authentication between communicating entities, integrity of


17-4


communicated data, protection of the communicating parties, and nonrepudiation (the inability of an entity to deny its participation in a communication transaction). Furthermore, special attention has to be paid in the design of secure protocols for embedded systems, due to their physical constraints, that is, limited battery power, limited processing, and memory resources, as well as their cost and communication requirements.

17.2.3 Implementation Technology and Operational Environment In regard to implementation technology, systems can be classified by static versus programmable technology and fixed versus extensible architecture. When static technology is used, the hardware-implemented functions are fixed and inflexible, but they offer higher performance and can reduce cost. However, static systems can be more vulnerable to attacks, because, once a flaw is identified — for example, in the design of the system — it is impossible to patch already deployed systems, especially in the case of large installations, such as SIM cards for cellular telephony or pay-per-view TV. Static systems should be implemented only once and correctly, which is an unattainable expectation in computing. In contrast, programmable systems are not limited as static ones, but they can be proven flexible in the hands of an attacker as well; system flexibility may allow an attacker to manipulate the system in ways not expected or defined by the designer. Programmability is typically achieved through the use of specialized software over a general-purpose processor or hardware. Fixed architectures are composed of specific hardware components that cannot be altered. Typically, it is almost impossible to add functionality in later stages, but they have lower cost of implementation and are, in general, less vulnerable because they offer limited choices to attackers. An extensible architecture is like a general-purpose processor, capable to interface with several peripherals through standardized connections. Peripherals can be changed or upgraded easily to increase security or to provide new functionality. However, an attacker can connect malicious peripherals or interface the system in untested or unexpected cases. As testing is more difficult relatively to static systems, one cannot be too confident that the system operates correctly under every possible input. Field Programmable Gate Arrays (FPGAs) combine benefits of all types of systems and architectures because they combine hardware implementation performance and programmability, enabling system reconfiguration. They are widely used to implement cryptographic primitives in various systems. Thus, significant attention has to be paid to the security of FPGAs as independent systems. There exist research efforts addressing this issue where systematic approaches are developed and open problems in FPGA security are addressed; for example, Wollinger et al. [14] provide such an approach and address several open problems, including resistance under physical attacks.

17.3 Security Constraints in Embedded Systems Design The design of secure systems requires special considerations, because security functions are resource demanding, especially in terms of processing power and energy consumption. The limited resources of embedded systems require novel design approaches in order to deal with trade-offs between efficiency — speed and cost — and effectiveness — satisfaction of the functional and operational requirements.

17.3.1 Energy Considerations Embedded systems are often battery powered, that is, they are power constrained. Battery capacity constitutes a major bottleneck to processing for security on embedded systems. Unfortunately, improvements in battery capacity do not follow the improvements of increasing performance, complexity, and functionality of the systems they power. Gunther et al. [15], Buchmann [16], and Lahiri et al. [17] report the widening “battery gap,” due to the exponential growth of power requirements and the linear growth in energy density. Thus, the power subsystem of embedded systems is a weak point of system security. A malicious



17-5

attacker, for example, may form a DoS attack by draining the system’s battery more quickly than the usual. Martin et al. [18] describe three ways in which such an attack may take place: (1) service request power attacks, (2) benign power attacks, and (3) malignant power attacks. In service request attacks, a malicious user may request repeatedly from the device to serve a power hungry application, even if the application is not supported by the device. In benign power attacks, the legitimate user is forced to execute an application with high-power requirements, while in malignant power attacks malicious users modify the executable code of an existing application, in order to drain as much battery power as possible without changing the application functionality. They conclude that such attacks may reduce battery life by one to two orders of magnitude. Inclusion of security functions in an embedded system places extra requirements on power consumption due to: (1) extra processing power necessary to perform various security functions, such as authentication, encryption, decryption, signing, and data verification, (2) transmission of security-related data between various entities, if the system is distributed, that is, a wireless sensor network, and (3) energy required to store security-related parameters. Embedded systems are often used to deploy performance-critical functions, which require a lot of processing power. Inclusion of cryptographic algorithms that are used as building blocks in secure embedded design may lead to great consumption of system battery. The energy consumption of the cryptographic algorithms used in security protocols has been analyzed well, for example, by Potlapally et al. [19]. They present a general framework that shows asymmetric algorithms having the highest energy cost, symmetric algorithms as the next power-hungry category, and hash algorithms at the bottom. The power required by cryptographic algorithms is significant as measurements indicate [20]. Importantly, in many applications the power consumed by security functions is larger than that used for the applications themselves. For example, Raghunathan et al. [21] present the battery gap for a sensor node with an embedded processor, calculating the number of transactions that the node can serve working in secure or insecure mode until system battery runs out. Their results state that working in secure mode consumes the battery in less than half time than when working in insecure mode. Many applications that involve embedded systems are implemented through distributed, networked platforms, resulting in a power overhead due to communication between the various nodes of the system [1]. Considering a wireless sensor network, which is a typical distributed embedded system, one can easily see that significant energy is consumed in communication between various nodes. Factors such as modulation type, data rate, transmit power, and security overhead affect power consumption significantly [22]. Savvides et al. [23] showed that the radio communication between nodes consumes most of the power, that is, 50 to 60% of the total power, when using the WINS — Wireless Integrated Network Sensor — platform [24]. Furthermore, in a wireless sensor network, the security functions consume energy due to extra internode exchange of cryptographic information — key exchange, authentication information — and per-message security overhead, which is a function of both the number and the size of messages [20]. It is important to identify the energy consumption of alternative security mechanisms. Hodjat and Verbauwhede [25], for example, have measured the energy consumption using two widely used algorithms for the key exchange of information between entities in a distributed environment, (1) Diffie–Hellman protocol [26] and (2) basic Kerberos protocol [26]. Their results show that Diffie–Hellman, implemented using elliptic curve public key cryptography, consumes 1213.7 mJ, 4296 mJ, and 9378.3 mJ for 128-bit, 192-bit, and 256-bit keys, respectively, while the Kerberos key exchange protocol using symmetric cryptography consumes 139.62 mJ; this indicates that the Kerberos protocol configuration consumes significantly less energy.

17.3.2 Processing Power Limitations Security processing places significant additional requirements on the processing power of embedded systems, since conventional architectures are quite limited. The term security processing is used to indicate the portion of the system computational effort that is dedicated to the implementation of the security requirements. Since embedded systems have limited processing power, they cannot cope efficiently with


17-6


the execution of complex cryptographic algorithms, which are used in the secure design of an embedded system. For example, the generation of a 512-bit key for the RSA public key algorithm requires 3.4 min for the PalmIIIx PDA, while encryption using DES takes only 4.9 msec per block, leading to an encryption rate of 13 Kbps [27]. The adoption of modern embedded systems in high-end systems (servers, firewalls, and routers) with increasing data transmission rates and complex security protocols, such as SSL, make the security processing gap wider and demonstrate that the existing embedded architectures need to be improved, in order to keep up with the increasing computational requirements that are placed by security processing. The wide processing gap has been exposed by measurements, as by Ravi et al. [4], who measured the security processing gap in the client-server model using the SSL protocol for various embedded microprocessors. Specifically, considering a StrongARM (206 MHz SA-1110) processor, which may be used in a low-end system such as a PDA or a mobile device, 100% of the processing power dedicated to SSL processing can achieve data rates up to 1.8 Mbps, while a 2.8 GHz Xeon achieves data rates up to 29 Mbps. Considering that the data rates of low-end systems range between 128 Kbps and 2 Mbps, while data rates of high-end systems range between 2 and 100 Mbps, it is clear that the processors mentioned above cannot achieve higher data rates than their maximum, leading to a security processing gap.

17.3.3 Flexibility and Availability Requirements The design and implementation of security in an embedded system does not mean that the system will not change its operational security characteristics through time. Considering that security requirements evolve and security protocols are continuously strengthened, embedded systems need to be flexible and adaptable to changes in security requirements, without losing their performance and availability goals as well as their primary security objectives. Modern embedded systems are characterized by their ability to operate in different environments, under various conditions. Such an embedded system must be able to achieve different security objectives in every environment; thus, the system must be characterized by significant flexibility and efficient adaptation. For example, consider a PDA with mobile telecommunication capabilities that may operate in a wireless environment [28–30] or provide 3G cellular services [31]; different security objectives must be satisfied in each case. Another issue that must be addressed is the implementation of different security requirements at different layers of the protocol architecture. Consider, for example, a mobile PDA that must be able to execute several security protocols, such as IPSec [13], SSL [12], and WEP [32], depending on its specific application. Importantly, availability is a significant requirement that needs special support, considering that it should be provided in an evolving world in terms of functionality and increasing system complexity. Conventional embedded systems should target to provide high availability characteristics not only in their expected, attack-free environment but in an emerging hostile environment as well.

17.3.4 Cost of Implementation Inclusion of security in embedded system design can increase system cost dramatically. The problem originates from the strong resource limitations of embedded systems, through which the system is required to exhibit great performance as well as high level of security while retaining a low cost of implementation. It is necessary to perform a careful, in-depth analysis of the designed system, in terms of the abilities of the possible adversaries, the environmental conditions under which the system will operate, etc., in order to estimate cost realistically. Consider, for example, the incorporation of a tamper-resistant cryptographic module in an embedded system. As described by Ravi et al. [4], according to the Federal Information Processing Standard [33], a designer can distinguish four levels of security requirements for cryptographic modules. The choice of the security level influences design and implementation cost significantly; so, the manufacturer faces a trade-off between the security requirements that will be implemented and the cost of manufacturing.



17-7

17.4 Design of Secure Embedded Systems Secure embedded systems must provide basic security properties, such as data integrity, as well as mechanisms and support for more complex security functions, such as authentication and confidentiality. Furthermore, they have to support the security requirements of applications, which are implemented, in turn, using the security mechanisms offered by the system. In this section, we describe the main design issues at both the system and application level.

17.4.1 System Design Issues Design of secure embedded systems needs to address several issues and parameters ranging from the employed hardware technology to software development methodologies. Although several techniques used in general-purpose systems can be effectively used in embedded system development as well, there are specific design issues that need to be addressed separately, because they are unique or weaker in embedded systems, due to the high volume of available low-cost systems that can be used for development of attacks by malicious users. The major of these design issues are tamper-resistance properties, memory protection, IP protection, management of processing power, communication security, and embedded software design. These issues are covered in the following paragraphs. Modern secure embedded systems must be able to operate in various environmental conditions, without loss of performance and deviation from their primary goals. In many cases they must survive various physical attacks and have tamper-resistance mechanisms. Tamper resistance is the property that enables systems to prevent the distortion of physical parts. Additionally to tamper-resistance mechanisms, there exist tamper-evidence mechanisms, which allow users or technical staff to identify tampering attacks and take countermeasures. Computer systems are vulnerable to tampering attacks, where malicious users intervene in hardware system parts and compromise them, in order to take advantage of them. Security of many critical systems relies on tamper resistance of smartcards and other embedded processors. Anderson and Kuhn [7] describe various techniques and methods to attack tamper-resistance systems, concluding that tamper-resistance mechanisms need to be extended or reevaluated. Memory technology may be an additional weakness in system implementation. Typical embedded systems have ROM, RAM, and EEPROM memory to store data. EEPROM memory constitutes the vulnerable spot of such systems, because it can be erased with the use of appropriate electrical signaling by malicious users [7]. Intellectual Property (IP) protection of manufacturers is an important issue addressed in secure embedded systems. Complicated systems tend to be partitioned in smaller independent modules leading to module reusability and cost reduction. These modules include IP of the manufacturers, which needs to be protected from third-party users, who might claim and use these modules. The illegal users of an IP block do not necessarily need to have full, detailed knowledge of the IP component, since IP blocks are independent modules which can very easily be incorporated and integrated with the rest of the system components. Lach et al. [34] propose a fingerprinting technique for IP blocks implemented using FPGAs, through an embedded unique marker onto the IP hardware that identifies both the origin and the recipient of the IP block. Also, they are stating that the removal of such a mark is extremely difficult, with a probability of less than one in a million. Implementation of security techniques for tamper resistance, tamper prevention, and IP protection may require additional processing power, which is limited in embedded systems. The “processing gap” between the computational requirements of security and the available processing power of embedded processors requires special consideration. A variety of architectures and enhancements in security protocols have been proposed, in order to bridge that gap. Burke et al. [35] propose enhancements in the Instruction Set Architecture (ISA) of embedded processors, in order to efficiently calculate various cryptographic primitives, such as permutations, bit rotations, fast substitutions, and modular arithmetic. Another approach is to build dedicated cryptographic embedded coprocessors with their own ISA. The Cryptomaniac coprocessor [36] is an example of this approach. Several vendors, for example,


17-8


Infineon [37] and ARM [38], have manufactured microcontrollers that have embedded coprocessors dedicated to serve cryptographic functions. Intel [39] announced a new generation of 64-bit embedded processors that have some features that can speed up processing hungry algorithms, such as cryptographic ones; these features include larger register sets, parallel execution of computations, improvements in large integers multiplication, etc. In a third approach, software optimizations are exploited. Potlapally et al. [40] have conducted extensive research in the improvement of public-key algorithms, studying various algorithmic optimizations, identifying an algorithm design space where performance is improved significantly. Also, SmartMIPS [41] provides system flexibility and adaptation to any changes in security requirements through high-performance software-based enhancements of its cryptographic modules, while it supports various cryptographic algorithms. Even if the “processing gap” is bridged and security functions are provided, embedded systems are required to support secure communications as well, considering that, often, embedded applications are implemented in a distributed environment where communicating systems may exchange (possibly) sensitive data over an untrusted network-wired, wireless or mobile-like Internet, a Virtual Private Network, the Public Telephone network, etc. In order to fulfill the basic security requirements for secure communications, embedded systems must be able to use strong cryptographic algorithms and to support various protocols. One of the fundamental requirements regarding secure protocols is interoperability, leading to the requirement for system flexibility and adaptability. Since an embedded system can operate in several environments, for example, a mobile phone may provide 3G cellular services or connect to a wireless LAN, it is necessary for the system to operate securely in all environments without loss of performance. Furthermore, as security protocols are developed for various layers of the OSI reference model, embedded systems must be adaptable to different security requirements at each layer of the architecture. Finally, the continuous evolutions of security protocols require system flexibility as new standards are developed, and requirements are reevaluated and new cryptographic techniques are added to overall architecture. A comprehensive presentation of the evolution of security protocols in wireless communications, such as WTLS [42], MET [43], and IPSec [13], is provided by Raghunathan et al. [21]. An important consideration in the development of flexible secure communication subsystems for embedded systems is the limitation of energy, and processing and memory resources. The performance/cost trade-off leads to special attention for the placement of protocol functions in hardware — for high performance — or software — for cost reduction. Embedded software, such as the operating system or application-specific code, constitutes a crucial factor in secure embedded system design. Kocher and co-workers [3] identify three basic factors that make embedded software development a challenging area of security: (1) complexity of the system, (2) system extensibility, and (3) connectivity. Embedded systems serve critical, complex, and hard to implement applications, with many parameters that need to be considered, which, in turn, leads to “buggy” and vulnerable software. Furthermore, the required extensibility of conventional embedded systems makes the exploitation of vulnerabilities relatively easy. Finally, as modern embedded systems are designed with network connectivity, the higher the connectivity degree of the system, the higher the risk for a software breach to expand as time goes by. Many attacks can be implemented by malicious users that exploit software glitches and lead to system unavailability, which can have a disastrous impact, for example, a DoS attack on a military embedded system. Landwehr et al. [9] present a survey of common software security faults, helping designers to learn from their faults. Tevis and Hamilton [11] propose some methods to detect and prevent software vulnerabilities, focusing on some possible weaknesses that have to be avoided, preventing buffer overflow attacks, heap overflow attacks, array indexing attacks, etc. They also provide some coded security programs that help designers to analyze the security of their software. Buffer overflow attacks constitute the most widely used type of attacks that lead to unavailability of the attacked system; with these attacks, malicious users exploit system vulnerabilities and are able to execute malicious code, which can cause several problems such as a system crash — preventing legitimate users from using the system, loss of sensitive data, etc. Shao et al. [44] propose a technique, called Hardware–Software Defender, which targets to protect an embedded system from buffer overflow attacks; their proposal is to design a secure instruction set, extending the instruction set of existing microprocessors, and to demand from outside



17-9

software developers to call secure functions from that set. The limited memory resources of embedded systems, specifically the lack of disk space and virtual memory, make the system vulnerable in cases of memory-hungry applications: applications that require excessive amount of memory do not have a swap file to grow and can very easily cause an out-of-memory unavailability of the system. Given the significance of this potential problem and attack, Biswas et al. [45] propose mechanisms to protect an embedded system from such a memory overflow, thus providing reliability and availability of the system: (1) use of software runtime checks, in order to check possible out-of-memory conditions, (2) allowing out-of-memory data segments to be placed in free system space, and (3) compressing already used and unnecessary data.

17.4.2 Application Design Issues Embedded system applications present significant challenges to system designers, in order to achieve efficient and secure systems. A key issue in secure embedded design is user identification and access control. User identification includes the necessary mechanisms that guarantee that only legitimate users have access to system resources and can also verify, whenever requested, the identity of the user who has access to the system. The explosive growth of mobile devices and their use in critical, sensitive transactions, such as bank transactions, e-commerce, etc., demand secure systems with high performance and low cost. This demand has become urgent and crucial considering the successful attacks on these systems, such as the recent hardware hacking attacks on PIN (Personal Identification Number)-based bank ATMs (Automatic Teller Machines) that have led to significant loss of money and decreased the credibility of financial organizations toward people. A solution to this problem may come from an emerging new technology for user identification that is based on biometric recognition, for both user identification and verification. Biometrics are based on pattern recognition in acquired biological data taken from a user who wants to gain access to a system, that is, palm prints [46], finger prints [47], iris scan, etc., and comparing them with the data that have been stored in databases identifying the legitimate users of the system [48]. Moon et al. [49] propose a secure smartcard that uses biometrics capabilities, claiming that the development of such a system is less vulnerable to attacks when compared with software-based solutions and that the combination of smartcard and fingerprint recognition is much more robust than PIN-based identification. Implementation of such systems is realistic as Tang et al. [50] illustrated with the implementation of a fingerprint recognition system with high reliability and high speed; they achieved an average computational time per fingerprint image less than 1 sec, using a fixed-point arithmetic StrongARM 200 MHz embedded processor. As mentioned previously, an embedded system must store information that enables it to identify and validate users that have access to the system. But, how does an embedded system store this information? Embedded systems use several types of memory to store different types of data: (1) ROM EPROM to store programming data used to serve generic applications, (2) RAM to store temporary data, and (3) EEPROM and FLASH memories to store mobile downloadable code [20]. In an embedded device such as a PDA or a mobile phone several pieces of sensitive information, such as PINs, credit card numbers, personal data, keys, and certificates for authorization purposes, may be permanently stored in secondary storage media. The requirement to protect this information as well as the rapid growth of communications capabilities of embedded devices, for example, mobile Internet access, which make embedded systems vulnerable to network attacks as well, lead to increasing demands for secure storage space. The use of hard cryptographic algorithms to ensure data integrity and confidentiality is not feasible in most embedded systems, mainly due to their limited computational resources. Benini et al. [51] present a survey of architectures and techniques used to implement memory for embedded systems, taking into consideration energy limitations of embedded systems. Rosenthal [52] presents an effective way to ensure that data cannot be erased or destroyed by “hiding” memory from the processor through use of a serial EEPROM, which is the same as standard EEPROM with the only difference that a serial link binds the memory with the processor reading/writing data, using a strict protocol. Actel [53] describes security issues and design considerations for the implementation of embedded memories using FPGAs claiming


17-10


that SRAM FPGAs are vulnerable to Level I [5] attacks, while it is more preferable to use nonvolatile Flash and Antifuse-based FPGA memories, which provide higher levels of security relatively to SRAM FPGAs. Another key issue in secure embedded systems design is to ensure that any digital content already stored or downloaded in the embedded system will be used according to the terms and conditions the content provider has set and in accordance with the agreements between user and provider; such content includes software for a specific application or a hardware component embedded in the system by a third-party vendor. It is essential that conventional embedded devices, mobile or not, be enhanced with Digital Right Management (DRM) mechanisms, in order to protect the digital IP of manufacturers and vendors. Trusted computing platforms constitute one approach to resolve this problem. Such platforms are significant, in general, as indicated by the Trusted Computing Platform Alliance (TCPA) [54], which tries to standardize the methods to build trusted platforms. For embedded systems, IP protection can be implemented in various ways. A method to produce a trusted computing platform based on a trusted, secure hardware component, called spy, can lead to systems executing one or more applications securely [55,56].Ways to transform a 3G mobile device into a trusted one have been investigated by Messerges and Dabbish [57], who are capable of protecting content through analysis and probing of the various components in a trusted system; for example, the operating system of the embedded system is enhanced with DRM security hardware–software part, which transforms the system to a trusted one. Alternatively, Thekkath et al. [58] propose a method to prevent unauthorized reading, modification, and copying of proprietary software code, using eXecute Only Memory (XOM) system that permits only code execution. The concept is that code stored in a device can be marked as “execute only” and content-sensitive applications can be stored in independent compartments [59]. If an application tries to access data outside its compartment, then it is stopped. Significant attention has to be paid to protect against possible attacks through malicious downloadable software, such as viruses, Trojans, logic bombs, etc. [60]. The wide deployment of distributed embedded systems and the Internet have resulted in the requirement for an ability of portable embedded systems, for example, mobile phones and PDAs, to download and execute various software applications. This ability may be new to the world of portable, highly constrained embedded systems, but it is not new in the world of general-purpose systems, which have had the ability to download and execute Java applets and executable files from the Internet or from other network resources for a long time. One major problem in this service is that users cannot be sure about the content of the software that is downloaded and executed on their system(s), who the creator is and what its origin is. Kingpin and Mudge [61] provide a comprehensive presentation of security issues in personal digitals assistants, analyzing in detail what malicious software is, that is, viruses, Trojans, backdoors, etc., where it resides and how it is spread, giving to the future users of such devices a deeper understanding about the extra security risks that arise with the use of mobile downloadable code. An additional important consideration is the robustness of the downloadable code: once the mobile code is considered secure, downloaded, and executed, it must not affect preinstalled system software. Various techniques have been proposed to protect remote hosts from malicious mobile code. The sandbox technique, proposed by Rubin and Geer [62], is based on the idea that the mobile code cannot execute system functions, that is, it cannot affect the file system or open network connections. Instead of disabling mobile code from execution, one can empower it using enhanced security policies as Venkatakrishnan et al. [63] propose. Necula [64] suggests the use of proof-carrying code. The producer of the mobile code, a possibly untrusted source, must embed some type of proof that can be tested by the remote host in order to prove the validity of the mobile code.

17.5 Cryptography and Embedded Systems Secure embedded systems should support the basic security functions for (1) confidentiality, (2) integrity, and (3) authentication. Cryptography provides a mechanism that ensures that the previous three requirements are met. However, implementation of cryptography in embedded systems can be a challenging task. The requirement of high performance has to be achieved in a resource-limited environment; this



17-11

task is even more challenging when low power constraints exist. Performance usually dictates an increased cost, which is not always desirable or possible. Cryptography can protect digital assets provided that the secret keys of the algorithms are stored and accessed in a secure manner. For this, the use of specialized hardware devices to store the secret keys and to implement cryptographic algorithms is preferred over the use of general-purpose computers. However, this also increases the implementation cost and results in reduced flexibility. On the other hand, flexibility is required, because modern cryptographic protocols do not rely on a specific cryptographic algorithm but rather allows the use of a wide range of algorithms for increased security and adaptability to advances on cryptanalysis. For example, both the SSL and IPSec network protocols support numerous cryptographic algorithms to perform the same function, for example, encryption. The protocol enables negotiation of the algorithms to be used, in order to ensure that both parties use the desirable level of protection dictated by their security policies. Apart from the performance issue, a correct cryptographic implementation requires expertise that is not always available or affordable during the lifecycle of a system. Insecure implementations of theoretically secure algorithms have made their way to headline news quite often in the past. An excellent survey on cryptography implementation faults is provided in [65], while Anderson [66] focuses on the causes of cryptographic systems failures in banking applications. A common misunderstanding is the use of random numbers. Pure Linear Feedback Shift Registers (LFSRs) and other pseudorandom number generators produce random-looking sequences that may be sufficient for scientific experiments but can be disastrous for cryptographic algorithms that require some unpredictable random input. On the other hand, the cryptographic community has focused on proving the theoretical security of various cryptographic algorithms and has paid little attention to actual implementations on specific hardware platforms. In fact, many algorithms are designed with portability in mind and efficient implementation on a specific platform meeting specific requirements can be quite tricky. This communication gap between vendors and cryptographers intensifies in the case of embedded systems, which can have many design choices and constraints that are not easily comprehensible. In the late 1990s, Side-Channel Attacks (SCAs) were introduced. SCA are a method of cryptanalysis that focuses on the implementation characteristics of a cryptographic algorithm in order to derive its secret keys. This advancement bridged the gap between embedded systems, a common target of such attacks, and cryptographers. Vendors became aware and concerned by this new form of attacks, while cryptographers focused on the specifics of the implementations, in order to advance their cryptanalysis techniques. In this section, we present side-channel cryptanalysis. First, we introduce the concept of tamper resistance, the implementation of side channels and information leakage through them from otherwise secure devices; then, we demonstrate how this information can be exploited to recover the secret keys of cryptographic algorithm, presenting case studies of attacks to the RSA algorithm.

17.5.1 Physical Security Secrecy is always a desirable property. In the case of cryptographic algorithms, the secret keys of the algorithm must be stored, accessed, used, and destroyed in a secure manner, in order to provide the required security functions. This statement is often overlooked and design or implementation flaws result in insecure cryptographic implementations. It is well known that general-purpose computing systems and operating systems can not provide enough protection mechanisms for cryptographic keys. For example, SSL certificates for web servers are stored unprotected on servers’ disks and rely on file system permissions for protection. This is necessary, because web servers can offer secure services unattended. Alternatively, a human would provide the password to access the certificate for each connection; this would not be an efficient decision in the era of e-commerce, where thousands of transactions are made every day. On the other hand, any software bug in the operating system, in a high-privileged application or in the web server software itself may expose this certificate to malicious users. Embedded systems are commonly used for implementing security functions. Since they are complete systems, they can perform the necessary cryptographic operations in a sealed and controlled environment [67–69]. Tamper resistance refers to the ability of a system to resist to tampering attacks, that is,


17-12


attempts to bypass its attack-prevention mechanisms. The IBM PCI Cryptographic Coprocessor [70] is such a system, having achieved FIPS 140-2 Level 4 certification [33]. Advancements of DRM technology to consumer devices and general-purpose computers drives the use of embedded systems for cryptographic protection of IP. Smartcards are a well-known example of tamper-resistant embedded systems that are used for financial transactions and subscription-based service provision. In many cases, embedded systems used for security-critical operations do not implement any tamperresistance mechanisms. Rather, a thin layer of obscurity is preferred, both for simplicity and performance issues. However, as users become more interested in bypassing the security mechanisms of the system, the thin layer of obscurity is easily broken and the cryptographic keys are publicly exposed. The Adobe eBook software encryption [71], the Microsoft XBox case [72], the USB hardware token devices [73], and the DVD CSS copy protection scheme [74] are examples of systems that have implemented security by obscurity and were easily broken. Finally, an often neglected issue is a lifecycle-wide management of cryptographic systems. While a device may be withdrawn from operation, the data it has stored or processed over time may still need to be protected. The security of keys that relies on the fact that only authorized personnel has access to the system may not be sufficient for the recycled device. Garfinkel and Shelat [75], Skorobogatov [76], and Gutman [77] present methods for recovering data from devices using noninvasive techniques.

17.5.2 Side-Channel Cryptanalysis Until the middle 1990s, academic research on cryptography focused on the mathematical properties of the cryptographic algorithms. Paul Kocher was the first to present cryptanalysis attacks on implementations of cryptographic algorithms, which were based on the implementation properties of a system. Kocher observed that a cryptographic implementation of the RSA algorithm required varying amounts of time to encrypt a block of data depending on the secret key used. Careful analysis of the timing differences, allowed him to derive the secret key and he extended this method to other algorithms as well [78]. This result came as a surprise, since the RSA algorithm has withstood years of mathematical cryptanalysis and was considered secure [79]. A short time later, Boneh et al. presented theoretical attacks on how to derive the secret keys on implementations of the RSA algorithm and the Fiat-Shamir and Schnorr identification schemes [80], revised in Reference 81, while similar results were presented by Bao et al. [82]. These findings revealed a new class of attacks on cryptographic algorithms. The term side-channel attacks (SCAs), first appeared in Reference 83, has been widely used to refer to this type of cryptanalysis, while the terms fault-based cryptanalysis, implementation cryptanalysis, active/passive hardware attacks, leakage attacks, and others have been used also. Cryptographic algorithms acquired a new security dimension, that of their exact implementation. Cryptographers had previously focused on understanding the underlying mathematical problems to prove or conjecture for the security of a cryptographic algorithm based on the abstract mathematical symbols. Now, in spite of the hard underlying mathematical problems to be solved, an implementation may be vulnerable and allow the extraction of secret keys or other sensitive material. Implementation vulnerabilities are of course not a new security concept. In the previous section, we presented some impressive attacks on security that were based on implementation faults. The new concept of SCA is that even cryptographic algorithms that are otherwise considered secure can be also vulnerable to such faults. This observation is of significant importance, since cryptography is widely used as a major building block for security; if cryptographic algorithms can be driven insecure, the whole construction collapses. Embedded systems and especially smartcards are a popular target for SCA. To understand this, recall that such systems are usually owned by a service provider, such as a mobile phone operator, a TV broadcaster or a bank, and possessed by service clients. The service provider resides on the security of the embedded system in order to prove service usage by the clients, such as phone calls, movie viewing or a purchase, and charge the client accordingly. On the other hand, consumers have the incentive to bypass these mechanisms in order to enjoy free services. Given that SCA are implementation specific and rely, as we will present later,



17-13

on the ability to interfere, passively or actively with the device implementing a cryptographic algorithm, embedded systems are a further attractive target, given their resource limitation, which makes the attack efforts easier. In the following, we present the classes of SCA and countermeasures that have been developed. The technical field remains highly active, since ingenious channels are continuously appearing in the bibliography. Embedded system vendors must study the attacks carefully, evaluate the associated risks for their environment, and ensure that appropriate countermeasures are implemented in their systems; furthermore, they must be prepared to adapt promptly to new techniques for deriving secrets from their systems.

17.5.3 Side-Channel Implementations A side channel is any physical channel that can carry information from the operation of a device while implementing a cryptographic operation; such channels are not captured by the existing abstract mathematical models. The definition is quite broad and the inventiveness of attackers is noticeable. Timing differences, power consumption, electromagnetic emissions, acoustic noise, and faults have been currently exploited for leaking information out of cryptographic systems. The channel realization can be categorized in three broad classes: physical or probing attacks, faultinduction or glitch attacks, and emission attacks, such as TEMPEST. We shortly review the first two classes; readers interested in TEMPEST attacks are referred to Reference 84. The side channels may seem unavoidable and a frightening threat. However, it should be strongly emphasized that in most cases, reported attacks, both theoretical and practical, rely for their success on the detailed knowledge of the platform under attack and the specific implementation of the cryptographic algorithm. For example, power analysis is successful in most cases, because cryptographic algorithms tend to use only a small subset of a processor’s instruction set and especially simple instructions, such as LOAD, STORE, XOR, AND, and SHIFT, in order to develop elegant, portable, and high-performance implementations. This decision allows an attacker to minimize the power profiles he or she must construct and simplifies the distinction of different instructions that are executed. 17.5.3.1 Fault-Induction Techniques Devices are always susceptible to erroneous computations or other kinds of faults for several reasons. Faulty computations are a known issue from space systems, because, in deep space, devices are exposed to radiation that can cause temporary or permanent bit flips, gate destruction, or other problems. Incomplete testing during manufacturing may allow imperfect designs from reaching the market, as in the case of the Intel Pentium FDIV bug [85], or in the case of device operation in conditions out of their specifications [86]. Careful manipulation of the power supply or the clock oscillator can also cause glitches in code execution by tricking the processor, for example, to execute unknown instructions or bypass a control statement [87]. Some researchers have questioned the feasibility of fault-injection attacks on real systems [88]. While fault injection may seem as an approach that requires expensive and specialized equipment, there have been reports that fault injection can be achieved with low cost and readily available equipment. Anderson and Kuhn [89] and Anderson [66] present low-cost attacks for tamper-resistant devices, which achieve extraction of secret information from smartcards and similar devices. Kömmerling and Kuhn [87] present noninvasive fault-injection techniques, for example, by manipulating power supply. Anderson [90] supports the view that the underground community has been using such techniques for quite a long time to break the security of smartcards of pay-TV systems. Furthermore, Weingart [6] and Aumüller et al. [91] present attacks performed in a controlled lab environment, proving that fault-injection attacks are feasible. Skorobogatov and Anderson [140] introduces low-cost light flashes, such as a camera flash, as a means to introduce errors, while eddy-current attacks are introduced in Reference 92. A complete presentation of the fault-injection methods is presented in Reference 93, along with experimental evidence on the applicability of the methods to industrial systems and anecdotal information.


17-14


The combined time–space isolation problem [94] is of significant importance in fault-induction attacks. The space isolation problem refers to isolation of the appropriate space (area) of the chip in which to introduce the fault. The space isolation problem has four parameters: 1. Macroscopic. The part of the chip where the fault can be injected. Possible answers can be one or more of the following: main memory, address bus, system bus, register file. 2. Bandwidth. The number of bits that can be affected. It may be possible to change just one bit or multiple bits at once. The exact number of changed bits can be controllable (e.g., one) or follow a random distribution. 3. Granularity. The area where the error can occur. The attacker may drive the fault-injection position at a bit level or a wider area, such as a byte or a multibyte area. The fault-injected area can be covered by a single error or by multiple errors. How are these errors distributed with respect to the area? They may focus around the mark or may be evenly distributed. 4. Lifetime. The time duration of the fault. It may be a transient fault or a permanent fault. For example, a power glitch may cause a transient fault at a memory location, since the next time the location will be written, a new value will be correctly written. In contrast, a cell or gate destruction will result in a permanent error, since the output bit will be stuck at 0 or 1, independently of the input. The time isolation problem refers to the time at which a fault is injected. An attacker may be able to synchronize exactly with the clock of the chip or may introduce the error in a random fashion. This granularity is the only parameter of the time isolation problem. Clearly, the ability to inject a fault in a clock period granularity is desirable, but impractical in real-world applications. 17.5.3.2 Passive Side Channels Passive side channels are not a new concept in cryptography and security. The information available from the now partially declassified TEMPEST project reveals helpful insights in how electromagnetic emissions occur and can be used to reconstruct signals for surveillance purposes. A good review of the subject is provided in chapter 15 of Reference 90. Kuhn [84,95,96] presents innovative use of electromagnetic emissions to reconstruct information from CRT and LCD displays, while Loughry and Umphress [97] reconstructs information flowing through network devices using the emissions of their LEDs. The new concept in this area is the fact that such emissions can be also used to derive secret information from an otherwise secure device. Probably, the first such attack took place in 1956 [98]. MI5, the British Intelligence, used a microphone to capture the sound of the rotor clicks of a Hagelin machine in order to deduce the core position of some of its rotors. This resulted in reducing the problem to calculate the initial setup of the machine within the range of their then available resources, and to eavesdrop the encrypted communications for quite a long time. While the so-called acoustic cryptanalysis may seem outdated, researchers provided a fresh look on this topic recently, by monitoring low-frequency (KHz) sounds and correlating them with operations performed by a high-frequency (GHz) processor [99]. Researchers have been quite creative and have used many types of emissions or other physical interactions of the device with the environment it operates. Kocher [78] introduces the idea of monitoring the execution time of a cryptographic algorithm and tries to identify the secret keys used. The key concept in this approach is that an implementation of an algorithm may contain branches and other conditional execution or the implementation may follow different execution paths. If these variances are based on the bit values of a secret key, then a statistical analysis can reveal the secret key bit-by-bit. Coron et al. [100] explain the power dissipation sources and causes, while Kocher et al. [101] present how power consumption can be also correlated with key bits. Rao and Rohatgi [102] and Quisquater and Samyde [86] introduce electromagnetic analysis. Probing attacks can also be applied to reveal the Hamming weight of data transferred across a bus or stored in memory. This approach is also heavily dependent on the exact hardware platform [103,104]. While passive side channels are usually thought in the context of embedded systems and other resourcelimited environments, complex computing systems may also have passive side channels. Page [105] explores the theoretical use of timing variations due to processor cache in order to extract secret keys.



17-15

Song et al. [106] take advantage of a timing channel in the secure communication protocol SSH to recover user passwords, while Felten and Schneider [107] present timing attacks on web privacy. A malicious web server can inject client-side code that fetches some specific pages transparently on behalf of the user; the server would like to know if the user has visited these pages before. The time difference between fetching the web page from the remote server and accessing it from the user’s cache is sufficient to identify if the user has visited this page before. A more impressive result, directly related to cryptography, is presented in Reference 108, where remote timing attacks on web servers implementing the SSL protocol are shown to be practical and the malicious user can extract the server’s certificate private key by measuring its response times.

17.5.4 Fault-Based Cryptanalysis The first theoretical active attacks are presented in References 80 and 82. The attacks in the former paper focused on RSA, when implemented with the Chinese Remainder Theorem (CRT) and Montgomery multiplication method, and the Fiat-Shamir and the Schnorr identification schemes. The latter work focuses on cryptosystems where security is based on the Discrete Logarithm Problem and presents attacks on the ElGamal signature scheme, the Schnorr signature scheme and DSA. The attack on the Schnorr signature scheme is extended, with some modification, to the identification scheme as well. Furthermore, the second paper reports independently an attack on the RSA–Montgomery. Since then, this area has been quite active, both in developing attacks based on fault induction and countermeasures. The attacks have succeeded in most of the popular and widely used algorithms. In the following, we give a brief review of the bibliography. The attacks on RSA with Montgomery have been extended by attacking the signing key, instead of the message [109]. Furthermore, similar attacks are presented for LUC and KMOV (based on elliptic curves) cryptosystems. In Reference 110, the attacks are generalized for any RSA-type cryptosystem, with examples of LUC and Demytko cryptosystems. Faults can be used to expose the private key of RSA–KEM scheme [111] and transient faults can be used to derive the RSA and DSA secret keys from applications compatible with the OpenPGP format [112]. The Bellcore attack on the Fiat-Shamir scheme is shown to be incomplete in Reference 94; the Precautious Fiat-Shamir scheme is introduced, which defends against it. A new attack that succeeds against both the classical and the Precautious Fiat-Shamir scheme is presented in Reference 113. Beginning with Biham and Shamir [114], fault-based cryptanalysis focused on symmetric key cryptosystems. DES is shown to be vulnerable to the so-called Differential Fault Analysis (DFA), using only 50 to 200 faulty ciphertexts. The method is also extended to unknown cryptosystems and an example of an attack on the once classified algorithm SkipJack is presented. Another variant of the attack on DES takes advantage of permanent instead of transient faults. The same ideas are also explored and extended for completely unknown cryptosystems in Reference 115, while Jacob et al. [116] use faults to attack obfuscated ciphers in software and extract secret material by avoiding de-obfuscating the code. For some time it was believed that fault-induction attacks can only succeed on cryptographic schemes based on algebraic-based hard mathematical problems, such as number factoring and discrete logarithm computation. Elliptic Curve Cryptosystems (ECCs) are a preferable choice to implement cryptography, since they offer equivalent security with that of algebraic public key algorithms, requiring only about a tenth of key bits. Biehl et al. [117] extend the DFA on ECC and, especially, on schemes whose security is based on the discrete logarithm problem over elliptic curve fields. Furthermore, Zheng and Matsumoto [118] use transient and permanent faults to attack random number generators, a crucial building block for cryptographic protocols, and the ElGamal signature scheme. Rijndael [119] was nominated as the AES algorithm [120], the replacement of DES. The case of the AES algorithm is quite interesting, considering that it was submitted after the introduction of SCA; thus, authors have taken all the appropriate countermeasures to ensure that the algorithm resisted all known cryptanalysis techniques applicable to their design. The original proposal [119] even noted timing attacks and how they could be prevented. Koeune and Quisquater [121] describe how a careless implementation


17-16


of the AES algorithm can utilize a timing attack and derive the secret key used. The experiments carried show that the key can be derived having 3000 samples per key byte, with minimal cost and high probability. The proposal of the algorithm is aware of this issue and immune against such a simple attack. However, DFA proved to be successful against AES. Although DFA was designed for attacking algorithms with a Feistel structure, such as DES, Dusart et al. [122] show that it can be applied to AES, which does not have such a structure. Four different fault-injection models are presented and the attacks succeed for all key sizes (128, 192, and 256 bits). Their experiments show that with ten pairs of faulty/correct messages in hand, a 128-bit AES key can be extracted in a few minutes. Blömer et al. [123] present additional faultbased attacks on AES. The attack assumes multiple kinds of fault models. The stricter model, requiring exact synchronization in space and time for the error injection, succeeds in deriving a 128-bit secret key after collecting 128 faulty ciphertexts, while the least strict model derives the 128-bit key, after collecting 256 faulty ciphertexts. 17.5.4.1 Case Study: RSA–Chinese Remainder Theorem The RSA cryptosystem remains a viable and preferable public key cryptosystem, having withstood years of cryptanalysis [79]. The security of the RSA public key algorithm relies on the hardness of the problem of factoring large numbers to prime factors. The elements of the algorithm are N = pq, the product of two large prime numbers, that is, the public and secret exponents respectively, and the modular exponentiation operation m k mod N . To sign a message m, the sender computes s = m d mod N , using his or her public key. The receiver computes m = s e mod N to verify the signature of the received message. The modular exponentiation operation is computationally intensive for large primes and it is the major computational bottleneck in an RSA implementation. The CRT allows fast modular exponentiation. Using RSA with CRT, the sender computes s1 = m d mod p, s2 = m d mod q and combines the two results, based on the CRT, to compute S = as1 + bs2 mod N for some predefined values a and b. The CRT method is quite popular, especially for embedded systems, since it allows four times faster execution and smaller memory storage for intermediate results (for this, observe that typically p and q have half the size of N ). The Bellcore attack [80,81], as it is commonly referenced, is quite simple and powerful against RSA with CRT. It suffices to have one correct signature S for a message m and one faulty signature S , which is caused by an incorrect computation of one of the two intermediate results s1 and s2 . It does not matter either if the error occurred on the first or the second intermediate result, or how many bits were affected by the error. Assuming that an error indeed occurred, it suffices to compute gcd(S − S , N ), which will equal q, if the error occurred in computation of s1 and p if it occurred in s2 . This allows to factor N and thus, the security of the algorithm is broken. Lenstra [124] improves this attack by requiring a known message and a faulty signature, instead of two signatures. In this case, it suffices to compute gcd(M − (S )d , N ) to reveal one of the two prime factors. Boneh et al. [80] propose double computations as means to detect such erroneous computations. However, this is not always efficient, especially in the case of resource-limited environments or where performance is an important issue. Also, this approach is of no help in the case a permanent error has occurred. Kaliski and Robshaw [125] propose signature verification, by checking the equality S e mod N = M . Since the public exponent may be quite large, this check can be rather time consuming for a resource-limited system. Shamir [126] describes a software method for protecting RSA with CRT from fault and timing attacks. The idea is to use a random integer t and perform a “blinded” CRT by computing: Spt = m d mod p × t and Sqt = m d mod q × t . If the equality Spt = Sqt mod t holds, then the initial computation is considered error-free and the result of the CRT can be released from the device. Yen et al. [127] further improve this countermeasure for efficient implementation without performance penalties, but Blömer et al. [128] show that this improvement in fact renders RSA with CRT totally insecure. Aumüller et al. [91] provide another software implementation countermeasure for faulty RSA–CRT computations. However, Yen et al. [129], using a weak fault model, show that both these countermeasures [91,126] are still vulnerable, if the attacker focuses on the modular reduction operation sp = sp mod p of the countermeasures. The attacks are valid for both transient and permanent errors and again, appropriate countermeasures are proposed.



17-17

As we show, the implementation of error checking functions using the final or intermediate results of RSA computations can create an additional side meta-channel, although faulty computations never leave a sealed device. Assume that an attacker knows that a bit in a register holding part of the key was invasively set to zero during the computation and that the device checks the correctness of the output by double computation. If the device outputs a signed message, then no error was detected and thus, the respective bit of the key is zero. If the device does not output a signed message or outputs an error message, then the respective bit of the key is one. Such a safe-error attack is presented in Reference 130, focusing on the RSA when implemented with Montgomery multiplication. Yen et al. [131] extend the idea of safe-error attacks from memory faults to computational faults and present such an attack on RSA with Montgomery, which can also be applied to scalar multiplication on elliptic curves. An even simpler attack would be to attack both an intermediate computation and the condition check. A condition check can be a single point of failure and an attacker can easily mount an attack against it, provided that he or she has means to introduce errors in computations [128]. Indeed, in most cases, a condition check is implemented as a bit comparison with a zero flag. Blömer et al. [128] extend the ideas of checking vulnerable points of computation by exhaustively testing every computation performed for an RSA–CRT signing, including the CRT combination. The proposed solution seems the most promising at the moment, allowing only attacks by powerful adversaries that can solve precisely the time–space isolation problem. However, it should be already clear that advancements in this area of cryptanalysis are continuous and they should be always prepared to adapt to new attacks.

17.5.5 Passive Side-Channel Cryptanalysis Passive side-channel cryptanalysis has received a lot of attention, since its introduction in 1996 by Paul Kocher [78]. Passive attacks are considered harder to defend against and many people are concerned, due to their noninvasive nature. Fault-induction attacks require some form of manipulating the device and thus, sensors or other similar means can be used to detect such actions and shut down or even zero out the device. In the case of passive attacks, the physical characteristics of the device are just monitored, usually with readily available probes and other hardware. So, it is not an easy task to detect the presence of a malicious user, especially in the case where only a few measurements are required or abnormal operation (such as continuous requests for encryptions/decryptions) can not be identified. The first results are by Kocher [78]. Timing variations in the execution of a cryptographic algorithm such as Diffie–Hellman key exchange, RSA, and DSS are used to derive bit-by-bit the secret keys of these algorithms. Although mentioned before, we should emphasize that timing attacks and other forms of passive SCA require knowledge of the exact implementation of the cryptographic algorithm under attack. Dhem et al. [132] describe a timing attack against the RSA signature algorithm. The attack derives a 512-bit secret key with 200,000 to 300,000 timing measurements. Schindler et al. [133] improve the timing attacks on RSA modular exponentiation by a factor of 50, allowing extraction of a 512-bit key using as few as 5,000 timing measurements. The approach used is an error-correction (estimator) function, which can detect erroneous bit detections as key extraction process evolves. Hevia and Kiwi [134] introduce a timing attack against DES, which reveals the Hamming weight of the key, by exploiting the fact that a conditional bit “wrap around” function results on variable execution time of the software implementing the algorithm. They succeed in recovering the Hamming weight of the key and 3.95 key bits (out of a 56-bit key). The most threatening issue is that keys with low or high Hamming weight are sparse; so, if the attack reveals that the key has such a weight, the key space that must be searched reduces dramatically. The RC5 algorithm has also been subjected to timing attacks, due to conditional statement execution in its code [135]. Kocher et al. [101] extend the attackers’ arsenal further by introducing the vulnerability of DES to power analysis attacks and more specifically to Differential Power Analysis (DPA), a technique that combines differential cryptanalysis and careful engineering, and to Simple Power Analysis (SPA). SPA refers to power analysis attacks that can be performed only by monitoring a single or a few power traces, probably with the same encryption key. SPA succeeds in revealing the operations performed by the device, such as permutations, comparisons, and multiplications. Practically, any algorithm implementation that executes


17-18

Embedded Systems Handbook INPUT :M,N,d = (dn –1dn –2...d1d0)2 OUTPUT :S = M d mod N S = 1; for (i = n–1; i >= 0;i--) { S = S 2 mod N ; if (di == 1) { S = S ∗M mod N ; } } return S ;

FIGURE 17.1 Left-to-right repeated square-and-multiply algorithm.

some statements conditionally, based on data or key material, is at least susceptible to power analysis attacks. This holds for public key, secret key, and ECC. DPA has been successfully applied at least to block ciphers, such as IDEA, RC5, and DES [83]. Electromagnetic Attacks (EMAs) have contributed some impressive results on what information can be reconstructed. Gandolfi et al. [136] report results from cryptanalysis of real-world cryptosystems, such as DES and RSA. Furthermore, they demonstrate that electromagnetic emissions may be preferable to power analysis, in the sense that fewer traces are needed to mount an attack and these traces carry richer information to derive the secret keys. However, the full power of EMA attacks has not been utilized yet and we should expect more results on real-world cryptanalysis of popular algorithms. 17.5.5.1 Case Study: RSA–Montgomery Previously, we explained the importance of a fast modular exponentiation primitive for the RSA cryptosystem. Montgomery multiplication is a fast implementation of this primitive function [137]. The left-to-right repeated square-and-multiply method is depicted in Figure 17.1, in C pseudocode. The timing attack of Kocher [78] exploits the timing variation caused by the condition statement on the fourth line. If the respective bit of the secret exponent is “1,” then a square (line 3) and multiply (line 5) operations are executed, while if the bit is “0” only a square operation is performed. In summary, the exact time of executing the loop n times is only dependent on the exact values of the bits of the secret exponent. An attacker proceeds as follows. Assume that the first m bits of the secret exponent are known. The attacker has an identical device with that containing the secret exponent and can control the key used for each encryption. The attacker collects from the attacked device the total execution time T1 , T2 , . . . , Tk of each signature operation on some known messages, M1 , M2 , . . . , Mk . He also performs the same operation on the controlled device, so as to collect another set of measurements, t1 , t2 , . . . , tk , where he fixes the m first bits of the key, targeting for the m + 1 bit. Kocher’s key observation is that, if the unknown bit dm+1 = 1, then the two sets of measurements are correlated. If dm+1 = 0, then the two sets behave like independent random variables. This differentiation allows the attacker to extract the secret exponent bit-by-bit. Depending on the implementation, a simpler form of the attack could be implemented. SPA does not require lengthy statistical computations but rather relies on power traces of execution profiles of a cryptographic algorithm. For this example, Schindler et al. [133] explain how we can use the power profiles. Execution of line 5 in the above code requires an additional multiplication. Even if the spikes in power consumption of the squaring and multiplication operations are indistinguishable, the multiplication requires additional load operations and thus, power spikes will be wider than in the case where only squaring is performed.

17.5.6 Countermeasures In the previous sections we provided a review of SCA, both fault-based and passive. In this section, we review the countermeasures that have been proposed. The list is most exhaustive and new results appear continuously, since countermeasures are steadily improving.



17-19

The proposed countermeasures can be classified into two main classes: hardware protection mechanisms and mathematical protection mechanisms. A first layer of protection against SCA are hardware protection layers, such as passivation layers that do not allow direct access between a (malicious) user and the system implementing the cryptographic algorithm or memory address bus obfuscation. Various sensors can also be embodied in the device, in order to detect and react to abnormal environmental conditions, such as extreme temperatures, power, and clock variations. Such mechanisms are widely employed in smartcards for financial transactions and other high-risk applications. Such protection layers can be effective against fault-injection attacks, since they shield the device against external manipulation. However, they cannot protect the device from attacks based on external observation, such as power analysis techniques. The previous countermeasures do not alter the current designs of the circuits, but rather add protection layers on top of them. A second approach is the design of a new generation of chips to implement cryptographic algorithms and to process sensitive information. Such circuits have asynchronous/ self-clocking/dual rail logic; each part of the circuit may be clocked independently [138]. Fault attacks that rely on external clock manipulation (such as glitch attacks) are not feasible in this case. Furthermore, timing or power analysis attacks become harder for the attacker, since there is no global clock that correlates the input data and the emitted power. Such countermeasures have the potential to become a common practice. Their application, however, must be carefully evaluated, since they may occupy a large area of the circuit; such expansions are justified by manufacturers usually in order to increase the system’s available memory and not to implement another security feature. Furthermore, such mechanisms require changes in the production line, which is not always feasible. A third approach targets to implement the cryptographic algorithms so that no key information leaks. Proposed approaches include modifying the algorithm to run in constant time, adding random delays in the execution of the algorithm, randomizing the exact sequence of operations without affecting the final result, and adding dummy operations in the execution of the algorithm. These countermeasures can defeat timing attacks, but careful design must be employed to defeat power analysis attacks too. For example, dummy operations or random delays are easily distinguishable in a power trace, since they tend to consume less power than ordinary cryptographic operations. Furthermore, differences in power traces between profiles of known operations can also reveal permutation of operations. For example, a modular multiplication is known to consume more power than a simple addition, so if the execution order is interchanged, they will still be identifiable. In more resource-rich systems, where high-level programming languages are used, compiler or human optimizations can remove these artifacts from the program or change the implementation resulting to vulnerability against SCA. The same holds, if memory caches are used and the algorithm is implemented so that the latency between cache and main memory can be detected, either by timing or power traces. Insertion of random delays or other forms of noise should also be considered carefully, because a large mean value of delay translates directly to reduced performance, which is not always acceptable. The second class of countermeasures focuses on the mathematical strengthening of the algorithms against such attacks. The RSA blinding technique by Shamir [126] is such an example; the proposed method guards the system from leaking meaningful information, because the leaked information is related to the random number used for blinding instead of the key; thus, even if the attacker manages to reveal a number, this will be the random number and not the key. It should be noted, however, that a different random number is used for each signing or encryption operation. Thus, the faults injected in the system will be applied on a different, random number every time and the collected information is useless. At a crossline between mathematical and implementation protection, it is proposed to check cryptographic operations for correctness, in case of fault-injection attacks. However, these checks can be also exploited as side channels of information or can degrade performance significantly. For example, double computations and comparison of the results halves the throughput an implementation can achieve; furthermore, in the absence of other countermeasures, the comparison function can be bypassed (e.g., by a clock glitch or a fault injection in the comparison function) or used as a side channel as well. If multiple checks are employed, measuring the rejection time can reveal in what stage of the algorithm the error


17-20


occurred; if the checks are independent, this can be utilized to extract the secret key, even when the implementation does not output the faulty computation [111,139].

17.6 Conclusions Security constitutes a significant requirement in modern embedded computing systems. Their widespread use in services that involve sensitive information in conjunction with their resource limitations have led to a significant number of innovative attacks that exploit system characteristics and result in loss of critical information. Development of secure embedded systems is an emerging field in computer engineering requiring skills from cryptography, communications, hardware, and software. In this chapter, we surveyed the security requirements of embedded computing systems and described the technologies that are more critical to them, relatively to general-purpose computing systems. Considering the innovative system (side-channel) attacks that were developed with motivation to break secure embedded systems, we presented in detail the known SCA and described the technologies for countermeasures against the known attacks. Clearly, the technical area of secure embedded systems is far from mature. Innovative attacks and successful countermeasures are continuously emerging, promising an attractive and rich technical area for research and development.

References [1] W. Wolf, Computers as Components — Principles of Embedded Computing Systems Design. Elsevier, Amsterdam, 2000. [2] W. Freeman and E. Miller, An experimental analysis of cryptographic overhead in performance — critical systems. In Proceedings of the Seventh International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems, 1999, p. 348. [3] S. Ravi, P. Kocher, R. Lee, G. McGraw, and A. Raghunathan, Security as a new dimension in embedded system design. In Proceedings of the 41st Annual Conference on Design Automation, 2004, pp. 753–760. [4] S. Ravi, A. Raghunathan, P. Kocher, and S. Hattangady, Security in embedded systems: design challenges. Transactions on Embedded Computing Systems, 3, 461–491, 2004. [5] D.G. Abraham, G.M. Dolan, G.P. Double, and J.V. Stevens, Transaction security system. IBM Systems Journal, 30, 206–229, 1991. [6] S.H. Weingart, Physical security devices for computer subsystems: a survey of attacks and defenses. In Cryptographic Hardware and Embedded Systems — CHES 2000: Second International Workshop, 2000, p. 302. [7] R. Anderson and M. Kuhn, Tamper resistance — a cautionary note. In Proceedings of the Second Usenix Workshop on Electronic Commerce, 1996, pp. 1–11. [8] S. Blythe, B. Fraboni, S. Lall, H. Ahmed, and U. de Riu, Layout reconstruction of complex silicon chips. IEEE Journal of Solid-State Circuits, 28, 138–145, 1993. [9] C.E. Landwehr, A.R. Bull, J.P. McDermott, and W.S. Choi, A taxonomy of computer program security flaws. ACM Computing Surveys, 26, 211–254, 1994. [10] H. Greg and M. Gary, Exploiting Software: How to Break Code. Addison-Wesley Professional, Reading, MA, 2004. [11] J.J. Tevis and J.A. Hamilton, Methods for the prevention, detection and removal of software security vulnerabilities. In Proceedings of the 42nd Annual Southeast Regional Conference, 2004, pp. 197–202. [12] P. Kocher, SSL 3.0 specification. http://wp.netscape.com/eng/ssl3/ [13] IETF, IPSec working group. http://www.ietf.org/html.charters/ipsec-charter.html [14] T. Wollinger, J. Guajardo, and C. Paar, Security on FPGAs: state-of-the-art implementations and attacks. Transactions on Embedded Computing Systems, 3, 534–574, 2004.



17-21

[15] S.H. Gunther, F. Binns, D.M. Carmean, and J.C. Hall, Managing the impact of increasing microprocessor power consumption. Intel Journal of Technology, Q1: 9, 2001. http://developer.intel.com/technology/itj/q12001/articles/art_4.htm [16] I. Buchmann, Batteries in a Portable World, 2nd ed. Cadex Electronics Inc, May 2001. [17] K. Lahiri, S. Dey, D. Panigrahi, and A. Raghunathan, Battery-driven system design: a new frontier in low power design. In Proceedings of the 2002 Conference on Asia South Pacific Design Automation/VLSI Design, 2002, p. 261. [18] T. Martin, M. Hsiao, D. Ha, and J. Krishnaswami, Denial-of-service attacks on battery-powered mobile computers. In Proceedings of the Second IEEE International Conference on Pervasive Computing and Communications (PerCom’04), 2004, p. 309. [19] N.R. Potlapally, S. Ravi, A. Raghunathan, and N.K. Jha, Analyzing the energy consumption of security protocols. In Proceedings of the 2003 International Symposium on Low Power Electronics and Design, 2003, pp. 30–35. [20] D.W. Carman, P.S. Kruus, and B.J. Matt, Constraints and approaches for distributed sensor network security. NAI Labs, Technical report 00-110, 2000. Available at: http://www.cs.umbc.edu/courses/graduate/CMSC691A/Spring04/papers/nailabs_report_00010_final.pdf [21] A. Raghunathan, S. Ravi, S. Hattangady, and J. Quisquater, Securing mobile appliances: new challenges for the system designer. In Design, Automation and Test in Europe Conference and Exhibition (DATE’03). IEEE, 2003, p. 10176. [22] V. Raghunathan, C. Schurgers, S. Park, and M. Srivastava, Energy aware wireless microsensor networks. IEEE Signal Processing Magazine 19, 40–50, 2002. [23] A. Savvides, S. Park, and M.B. Srivastava, On modeling networks of wireless microsensors. In Proceedings of the 2001 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems, 2001, pp. 318–319. [24] Rockwell Scientific, Wireless integrated networks systems. http://wins.rsc.rockwell.com [25] A. Hodjat and I. Verbauwhede, The energy cost of secrets in ad-hoc networks (Short paper). http://citeseer.ist.psu.edu/hodjat02energy.html [26] B. Schneier, Applied Cryptography: Protocols, Algorithms, and Source Code in C. John Wiley & Sons, New York, 1995. [27] N. Daswani and D. Boneh, Experimenting with electronic commerce on the PalmPilot. In Proceedings of the Third International Conference on Financial Cryptography, 1999, pp. 1–16. [28] A. Perrig, J. Stankovic, and D. Wagner, Security in wireless sensor networks. Communications of the ACM, 47, 53–57, 2004. [29] S. Ravi, A. Raghunathan, and N. Potlapally, Securing wireless data: system architecture challenges. In Proceedings of the 15th International Symposium on System Synthesis, 2002, pp. 195–200. [30] IEEE 802.11 Working Group, IEEE 802.11 wireless LAN standards. http://grouper.ieee.org/ groups/802/11/ [31] 3GPP, 3G Security; Security Architecture. 3GPP Organization, TS 33.102, 30-09-2003, Rel-6, 2003. [32] Intel Corporation, VPN and WEP, wireless 802.11b security in a corporate environment. http://www.intel.com/business/bss/infrastructure/security/vpn_wep.htm [33] NIST, FIPS PUB 140-2 security requirements for cryptographic modules. Available at http://csrc.nist.gov/cryptval/140-2.htm [34] J. Lach, W.H. Mangione-Smith, and M. Potkonjak, Fingerprinting digital circuits on programmable hardware. In Information Hiding: Second International Workshop, IH’98, Vol. 1525 of Lecture Notes in Computer Science, Springer-Verlag, 1998, pp. 16–31. [35] J. Burke, J. McDonald, and T. Austin, Architectural support for fast symmetric-key cryptography. In Proceedings of the Ninth International Conference on Architectural Support for Programming Languages and Operating Systems, 2000, pp. 178–189.


17-22


[36] L. Wu, C. Weaver, and T. Austin, CryptoManiac: a fast flexible architecture for secure communication. In Proceedings of the 28th Annual International Symposium on Computer Architecture, 2001, pp. 110–119. [37] Infineon, SLE 88 Family Products. http://www.infineon.com/ [38] ARM, ARM SecurCore Family, Vol. 2004. http://www.arm.com/products/CPUs/securcore.html [39] S. Moore, Enhancing Security Performance Through IA-64 Architecture, 2000. Intel Corp., http://www.intel.com/cd/ids/developer/asmo-na/eng/microprocessors/itanium/index.htm [40] N. Potlapally, S. Ravi, A. Raghunathan, and G. Lakshminarayana, Optimizing public-key encryption for wireless clients. In Proceedings of the IEEE International Conference on Communications, May 2002. [41] MIPS Inc., SmartMIPS Architecture, Vol. 2004. http://www.mips.com/ProductCatalog/ P_SmartMIPSASE/productBrief [42] Open Mobile Appliance, http://www.wapforum.org/what/technical.htm [43] Mobile Electronic Transactions, http://www.mobiletransaction.org/ [44] Z. Shao, C. Xue, Q. Zhuge, E.H. Sha, and B. Xiao, Security protection and checking in embedded system integration against buffer overflow attacks. In Proceedings of the International Conference on Information Technology: Coding and Computing (ITCC’04), Vol. 2, 2004, pp. 409. [45] S. Biswas, M. Simpson, and R. Barua, Memory overflow protection for embedded systems using run-time checks, reuse and compression. In Proceedings of the 2004 International Conference on Compilers, Architecture, and Synthesis for Embedded Systems, 2004, pp. 280–291. [46] J. You, Wai-Kin Kong, D. Zhang, and King Hong Cheung, On hierarchical palmprint coding with multiple features for personal identification in large databases. IEEE Transactions on Circuits and Systems for Video Technology, 14, 234–243, 2004. [47] K.C. Chan, Y.S. Moon, and P.S. Cheng, Fast fingerprint verification using subregions of fingerprint images. IEEE Transactions on Circuits and Systems for Video Technology, 14, 95–101, 2004. [48] A.K. Jain, A. Ross, and S. Prabhakar, An introduction to biometric recognition. IEEE Transactions on Circuits and Systems for Video Technology, IEEE, 14, 4–20, 2004. [49] Y.S. Moon, H.C. Ho, and K.L. Ng, A secure smart card system with biometrics capability. In Proceedings of the IEEE 1999 Canadian Conference on Electrical and Computer Engineering, 1999, pp. 261–266. [50] T.Y. Tang, Y.S. Moon, and K.C. Chan, Efficient implementation of fingerprint verification for mobile embedded systems using fixed-point arithmetic. In Proceedings of the 2004 ACM Symposium on Applied Computing, 2004, pp. 821–825. [51] L. Benini, A. Macii, and M. Poncino, Energy-aware design of embedded memories: a survey of technologies, architectures, and optimization techniques. Transactions on Embedded Computing Systems, 2, 5–32, 2003. [52] Scott Rosenthal, Serial EEPROMs provide secure data storage for embedded systems. SLTF Consulting, http://www.sltf.com/articles/pein/pein9101.htm [53] Actel Corporation, Design security in nonvolatile flash and antifuse FPGAs. Technical report 5172163-0/11.01, 2001. [54] Trusted Computing Group: Home. TCG ©, https://www.trustedcomputinggroup.org/home [55] D.N. Serpanos and R.J. Lipton, Defense against man-in-the-middle attack in client-server systems with secure servers. In Proceedings of IEEE ISCC’2001. Hammammet, Tunisia, July 3–5, 2001, pp. 9–14. [56] R.J. Lipton, S. Rajagopalan, and D.N. Serpanos, Spy: a method to secure clients for network services. Proceedings of the 22nd International Conference on Distributed Computing Systems Workshops (Workshop ADSN’2002). Vienna, Austria, July 2–5, 2002, pp. 23–28. [57] T.S. Messerges and E.A. Dabbish, Digital rights management in a 3G mobile phone and beyond. In Proceedings of the 2003 ACM Workshop on Digital Rights Management, 2003, pp. 27–38.



17-23

[58] D.L.C. Thekkath, M. Mitchell, P. Lincoln, D. Boneh, J. Mitchell, and M. Horowitz, Architectural support for copy and tamper resistant software. In Proceedings of the Ninth International Conference on Architectural Support for Programming Languages and Operating Systems, 2000, pp. 168–177. [59] J.H. Saltzer and M.D. Schroder, The protection of information in computer systems. Proceedings of the IEEE, 63, 1278–1308, 1975. [60] T. King, Security + Training Guide. Que Cerification, Boger, Paul, 2003. [61] Kingpin and Mudge, Security analysis of the palm operating system and its weaknesses against malicious code threats. In Proceedings of the 10th Usenix Security Symposium, 2001, pp. 135–152. [62] A.D. Rubin and D.E. Geer Jr., Mobile code security. Internet Computing, IEEE, 2, 30–34, 1998. [63] V.N. Venkatakrishnan, R. Peri, and R. Sekar, Empowering mobile code using expressive security policies. In Proceedings of the 2002 Workshop on New Security Paradigms, 2002, pp. 61–68. [64] G.C. Necula, Proof-carrying code. In Proceedings of the 24th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL ’97), 1997, pp. 106–119. [65] P. Gutmann, Lessons learned in implementing and deploying crypto software. In Proceedings of the 11th USENIX Security Symposium, 2002, pp. 315–325. [66] R.J. Anderson, Why cryptosystems fail. In Proceedings of ACM CSS’93, ACM Press, pp. 215–217, November 1993. [67] Andrew J. Clark, Physical protection of cryptographic device. In Proceedings of Eurocrypt ’87, 1987, pp. 83–93. [68] D. Chaum, Design concepts for tamper-responding system. In Advances in Cryptology Proceedings of Crypto ’83, 1983, pp. 387–392. [69] S.H. Weingart, S.R. White, W.C. Arnold, and G.P. Double, An evaluation system for the physical security of computing systems. In Proceedings of the Sixth Annual Computer Security Applications Conference, 1990, pp. 232–243. [70] IBM Corporation, IBM PCI Cryptographic Coprocessor, September, 2004. Available at http:// www-3.ibm.com/security/cryptocards/html/pcicc.shtml [71] EFF, U.S. v. ElcomSoft and Sklyarov FAQ, September, 2004. Available at http://www.eff.org/IP/ DMCA/US_v_Elcomsoft/us_v_sklyarov_faq.html [72] A. Huang, Keeping secrets in hardware: the Microsoft Xbox case study. In Revised Papers from the Fourth International Workshop on Cryptographic Hardware and Embedded Systems, 2003, pp. 213–227. [73] Kingpin, Attacks on and countermeasures for USB hardware token device. In Proceedings of the Fifth Nordic Workshop on Secure IT Systems Encouraging Co-operation, 2000, pp. 135–151. [74] D.S. Touretzky, Gallery of CSS Descramblers, September 2004. Available at http://www.cs. cmu.edu/∼dst/DeCSS/Gallery [75] S.L. Garfinkel and A. Shelat, Remembrance of data passed: a study of disk sanitization practices. IEEE Security and Privacy Magazine, 1, 17–27, 2003. [76] S. Skorobogatov, Low temperature data remanence in static RAM. Technical report UCAM-CLTR-536, University of Cambridge, 2002. [77] P. Gutman, Data remanence in semiconductor devices. In Proceedings of the 10th USENIX Security Symposium, 2001. [78] P.C. Kocher, Timing attacks on implementations of Diffie-Hellman RSA DSS and other systems. In Proceedings of CRYPTO ’96, Lecture Notes in Computer Science, 1996, pp. 104–113. [79] D. Boneh, Twenty years of attacks on the RSA cryptosystem. Notices of the American Mathematical Society (AMS), 46, 203–213, 1999. [80] Dan Boneh, Richard A. DeMillo, and Richard J. Lipton, On the importance of checking cryptographic protocols for faults. In Proceedings of Eurocrypt’97, Vol. 1233 of Lecture Notes in Computer Science, 1997, pp. 37–51.


17-24


[81] Dan Boneh, Richard A. DeMillo, and Richard J. Lipton, On the importance of eliminating errors in cryptographic computations. Journal of Cryptology: The Journal of the International Association for Cryptologic Research, 14, 101–119, 2001. [82] F. Bao, R.H. Deng, Y. Han, A.B. Jeng, A.D. Narasimhalu, and T. Ngair, Breaking public key cryptosystems on tamper resistant devices in the presence of transient faults. In Proceedings of the Fifth International Workshop on Security Protocols, 1998, pp. 115–124. [83] John Kelsey, Bruce Schneier, David Wagner, and Chris Hall, Side channel cryptanalysis of product ciphers. In Proceedings of ESORICS 1998, 1998, pp. 97–110. [84] Markus G. Kuhn, Compromising emanations: eavesdropping risks of computer displays. Technical report UCAM-CL-TR-577, University of Cambridge, December 2003. [85] Intel Corporation, Analysis of the floating point flaw in the Pentium processor. November 1994. Available at http://support.intel.com/support/processors/pentium/fdiv/wp/ (September 2004). [86] Jean-Jacques Quisquater and David Samyde, ElectroMagnetic analysis (EMA): measures and countermeasures for smart cards. In Proceedings of the International Conference on Research in Smart Cards, E-Smart 2001, Lecture Notes in Computer Science, 2001, pp. 200–210. [87] Oliver Kömmerling and Markus G. Kuhn, Design principles for tamper-resistant smartcard processors. In Proceedings of the USENIX Workshop on Smartcard Technology (Smartcard ’99). USENIX Association, Chicago, IL, May 10–11, 1999, pp. 9–20. [88] D.P. Maher, Fault induction attacks, tamper resistance, and hostile reverse engineering in perspective. In Proceedings of the First International Conference on Financial Cryptography, 1997, pp. 109–122. [89] Ross J. Anderson and Markus G. Kuhn, Low cost attacks on tamper resistant devices. In Proceedings of the Fifth International Security Protocols Conference, Vol. 1361 of Lecture Notes on Computer Science. M. Lomas et al. Ed. Springer-Verlag, Paris, France, April 7–9, 1997, pp. 125–136. [90] Ross J. Anderson, Security Engineering: A Guide to Building Dependable Distributed Systems. John Wiley & Sons, New York, 2001. [91] C. Aumüller, P. Bier, W. Fischer, P. Hofreiter, and J. Seifert, Fault attacks on RSA with CRT: concrete results and practical countermeasures. In Revised Papers from the Fourth International Workshop on Cryptographic Hardware and Embedded Systems, Springer-Verlag, 2003, pp. 260–275. [92] David Samyde, Sergei Skorobogatov, Ross Anderson, and Jean-Jacques Quisquater, On a new way to read data from memory. In Proceedings of CHES 2002, Lecture Notes in Computer Science, 2003. [93] Hagai Bar-El, Hamid Choukri, David Naccache, Michael Tunstall, and Claire Whelan, The Sorcerer’s apprentice guide to fault attacks. In Workshop on Fault Diagnosis and Tolerance in Cryptography, 2004. [94] Artemios G. Voyiatzis and Dimitrios N. Serpanos, Active hardware attacks and proactive countermeasures. In Proceedings of IEEE ISCC 2002, 2002. [95] Markus G. Kuhn, Optical time-domain eavesdropping risks of CRT displays. In Proceedings of the IEEE Symposium on Security and Privacy, 2002, pp. 3–18. [96] Markus G. Kuhn, Electromagnetic eavesdropping risks of flat-panel displays. Presented at the Fourth Workshop on Privacy Enhancing Technologies, May 26–28, 2004, Toronto, Canada. [97] J. Loughry and D.A. Umphress, Information leakage from optical emanations. ACM Transactions on Information and System Security, 5, 262–289, 2002. [98] P. Wright, Spycatcher: The Candid Autobiography of a Senior Intelligence Officer. Viking, NY, 1987. [99] Adi Shamir and Eran Tromer, Acoustic cryptanalysis — on noisy people and noisy machines. In Eurocrypt 2004 Rump Session Presentation, September, 2004. Available at http://www.wisdom.weizmann.ac.il/∼tromer/acoustic/ [100] J. Coron, D. Naccache, and P. Kocher, Statistics and secret leakage. ACM Transactions on Embedded Computing Systems, 3, 492–508, 2004.



17-25

[101] P. Kocher, J. Jaffe, and B. Jun, Differential power analysis. In Proceedings of the CRYPTO ’99, IACR, 1999, pp. 388–397. [102] Josyula R. Rao, and Pankaj Rohatgi, Empowering side-channel attacks. IACR Cryptography ePrint Archive: report 2001/037, September, 2004. Available at http://eprint.iacr.org/ 2001/037/ [103] Mehdi-Laurent Akkar, Régis Bevan, Paul Dischamp, and Didier Moyar, Power analysis, what is now possible. In Advances in Cryptology — ASIACRYPT 2000: 6th International, Springer-Verlag, 2000, pp. 489–502. [104] Thomas S. Messerges, Ezzy A. Dabbish, and Robert H. Sloan, Investigation of power analysis attacks on smartcards. In Proceedings of USENIX Workshop on Electronic Commerce, 1999, pp. 151–161. [105] D. Page, Theoretical use of cache memory as a cryptanalytic side-channel. Technical report CSTR-02-003, Computer Science Department, University of Bristol, Bristol, 2002. [106] Dawn Xiaodong Song, David Wagner, and Xuqing Tian, Timing analysis of keystrokes and timing attacks on SSH. In Proceedings of the 10th USENIX Security Symposium, USENIX Association, 2001. [107] E.W. Felten and M.A. Schneider, Timing attacks on web privacy. In Proceedings of the Seventh ACM Conference on Computer and Communications Security, ACM Press, 2000, pp. 25–32. [108] David Brumley and Dan Boneh, Remote timing attacks are practical. In Proceedings of the 12th USENIX Security Symposium, 2003. [109] J. Marc and Q. Jean-Jacques, Faulty RSA encryption. Technical report CG-1997/8, UCL Crypto Group, 1997. [110] Marc Joye and Jean-Jacques Quisquater, Attacks on systems using Chinese remaindering. Technical report CG1996/9, UCL Crypto Group, Belgium, 1996. [111] Vlastimil Klíma and Tomáš Rosa, Further results and considerations on side channel attacks on RSA. IACR Cryptography ePrint Archive: report 2002/071, September 2004. Available at http://eprint.iacr.org/2002/071/ [112] Vlastimil and Tomáš Rosa, Attack on private signature keys of the OpenPGP format, PGP(TM) programs and other applications compatible with OpenPGP. IACR Cryptology ePrint Archive report 2002/073, IACR, September 2004. Available at http://eprint.iacr.org/2002/076.pdf [113] A.G. Voyiatzis and D.N. Serpanos, A fault-injection attack on Fiat-Shamir cryptosystems. In Proceedings of the 24th International Conference on Distributed Computing Systems Workshops (ICDCS 2004 Workshops), 2004, pp. 618–621. [114] Eli Biham and Adi Shamir, Differential fault analysis of secret key cryptosystems. Lecture Notes in Computer Science. Springer-Verlag, 1294, 513–525, 1997. [115] P. Paillier, Evaluating differential fault analysis of unknown cryptosystems. In Proceedings of the Second International Workshop on Practice and Theory in Public Key Cryptography, 1999, pp. 235–244. [116] M. Jacob, D. Boneh, and E. Felten, Attacking an obfuscated cipher by injecting faults. In Proceedings of the 2002 ACM Workshop on Digital Rights Management, 2002. [117] Ingrid Biehl, Bernd Meyer, and Voker Müller, Differential fault attacks on elliptic curve cryptosystems. In Proceedings of CRYPTO 2000, Vol. 1880 of Lecture Notes in Computer Science, 2000, pp. 131–146. [118] Y. Zheng and T. Matsumoto, Breaking real-world implementations of cryptosystems by manipulating their random number generation. In Proceedings of the 1997 Symposium on Cryptography and Information Security, 1997. [119] Joan Daemen and Vincent Rijmen, The block cipher Rijndael. In Proceedings of Smart Card Research and Applications 2000, Lecture Notes in Computer Science, 2000, pp. 288–296. [120] NIST, NIST, Advanced Encryption Standard (AES), Federal Information Processing Standards Publication 1997, November 26, 2001.


17-26


[121] François Koeune and Jean-Jacques Quisquater, A timing attack against Rijndael. Technical report CG-1999/1, Universite Catholique de Louvain, 1999. [122] P. Dusart, L. Letourneux, and O. Vivolo, Differential fault analysis on AES. In Proceedings of the International Conference on Applied Cryptography and Network Security, Lecture Notes in Computer Science, 2003, pp. 293–306. [123] Johaness Blömer and Jean-Pierre Seifert, Fault-based cryptanalysis of the advanced encryption standard (AES). In Financial Cryptography 2003, Vol. 2742 of Lecture Notes in Computer Science, 2003, pp. 162–181. [124] Arjen Lenstra, Memo on RSA signature generation in the presence of faults. September 28, 1996. (Manuscript, available from the author.) [125] B. Kaliski and M.J.B. Robshaw, Comments on some new attacks on cryptographic devices. RSA Laboratories Bulletin, 5 July, 1997. [126] Adi Shamir, Method and apparatus for protecting public key schemes from timing and fault attacks. US Patent No. 5,991,415, United States Patent and Trademark Office, November 23, 1999. [127] S. Yen, S. Kim, S. Lim, and S. Moon, RSA speedup with residue number system immune against hardware fault cryptanalysis. In Proceedings of the Fourth International Conference on Information Security and Cryptology, Seoul, 2002, pp. 397–413. [128] J. Blömer, M. Otto, and J. Seifert, A new CRT-RSA algorithm secure against bellcore attacks. In Proceedings of the 10th ACM Conference on Computer and Communication Security, 2003, pp. 311–320. [129] Sung-Ming Yen, Sangjae Moon, and Jae-Cheol Ha, Hardware fault attack on RSA with CRT revisited. In Proceedings of ICISC 2002, Lecture Notes in Computer Science, 2003, pp. 374–388. [130] S. Yen and M. Joye, Checking before output may not be enough against fault-based cryptanalysis. IEEE Transactions on Computers, 49, 967–970, 2000. [131] S. Yen, S. Kim, S. Lim, and S. Moon, A countermeasure against one physical cryptanalysis may benefit another attack. In Proceedings of the Fourth International Conference on Information Security and Cryptology, Seoul, 2002, pp. 414–427. [132] J. Dhem, F. Koeune, P. Leroux, P. Mestr, J. Quisquater, and J. Willems, A practical implementation of the timing attack. In Proceedings of the International Conference on Smart Card Research and Applications, 1998, pp. 167–182. [133] Werner Schindler, François Koeune, and Jean-Jacques Quisquater, Unleashing the full power of timing attack. UCL Crypto Group Technical report CG-2001/3, Universite Catholique de Louvain 2001. [134] A. Hevia and M. Kiwi, Strength of two data encryption standard implementations under timing attacks. ACM Transactions on Information and System Security, 2, 416–437, 1999. [135] Helena Handschuh and Heys Howard, A timing attack on RC5. In Proceedings of the Fifth Annual International Workshop on Selected Areas in Cryptography, SAC’98, 1998. [136] K. Gandolfi, C. Mourtel, and F. Olivier, Electromagnetic analysis: concrete results. In Proceedings of the Third International Workshop on Cryptographic Hardware and Embedded Systems, 2001, pp. 251–261. [137] K. Koç, T. Acar, and B.S. Kaliski Jr., Analyzing and comparing montgomery multiplication algorithms. IEEE Micro, 16, 26–33, 1996. [138] Simon Moore, Ross Anderson, Paul Cunningham, Robert Mullins, and George Taylor, Improving smart card security using self-timed circuits. In Proceedings of the Eighth International Symposium on Advanced Research in Asynchronous Circuits and Systems, 2002. [139] Kouichi Sakurai and Tsuyoshi Takagi, A reject timing attack on an IND-CCA2 public-key cryptosystem. In Proceedings of ICISC 2002, Lecture Notes in Computer Science, 2003. [140] S.P. Skorobogatov and R.J. Anderson, Optical fault induction attacks. In Revised Papers from the Fourth International Workshop on Cryptographic Hardware and Embedded Systems, 2003, pp. 2–12.


II System-on-Chip Design 18 System-on-Chip and Network-on-Chip Design Grant Martin

19 A Novel Methodology for the Design of Application-Specific Instruction-Set Processors Andreas Hoffmann, Achim Nohl, and Gunnar Braun

20 State-of-the-Art SoC Communication Architectures José L. Ayala, Marisa López-Vallejo, Davide Bertozzi, and Luca Benini

21 Network-on-Chip Design for Gigascale Systems-on-Chip Davide Bertozzi, Luca Benini, and Giovanni De Micheli

22 Platform-Based Design for Embedded Systems Luca P. Carloni, Fernando De Bernardinis, Claudio Pinello, Alberto L. Sangiovanni-Vincentelli, and Marco Sgroi

23 Interface Specification and Converter Synthesis Roberto Passerone

24 Hardware/Software Interface Design for SoC Wander O. Cesário, Flávio R. Wagner, and A.A. Jerraya

25 Design and Programming of Embedded Multiprocessors: An Interface-Centric Approach Pieter van der Wolf, Erwin de Kock, Tomas Henriksson, Wido Kruijtzer, and Gerben Essink

26 A Multiprocessor SoC Platform and Tools for Communications Applications Pierre G. Paulin, Chuck Pilkington, Michel Langevin, Essaid Bensoudane, Damien Lyonnard, and Gabriela Nicolescu


18 System-on-Chip and Network-on-Chip Design 18.1 18.2 18.3 18.4 18.5 18.6 18.7 18.8 18.9 18.10

Grant Martin Tensilica Inc.

Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . System-on-a-Chip . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . System-on-a-Programmable-Chip . . . . . . . . . . . . . . . . . . . . . IP Cores. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Virtual Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Platforms and Programmable Platforms . . . . . . . . . . . . . . . Integration Platforms and SoC Design . . . . . . . . . . . . . . . . . Overview of the SoC Design Process . . . . . . . . . . . . . . . . . . . System-Level Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Interconnection and Communication Architectures for SoC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18.11 Computation and Memory Architectures for SoC . . . . 18.12 IP Integration Quality and Certification Methods and Standards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18.13 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

18-1 18-2 18-2 18-4 18-5 18-5 18-6 18-7 18-10 18-11 18-11 18-12 18-12 18-13

18.1 Introduction “System-on-Chip” (SoC) is a phrase that has been much talked about in recent years [1]. It is more than a design style, more than an approach to the design of Application-Specific Integrated Circuits (ASICs), more than a methodology. Rather, SoC represents a major revolution in IC design — a revolution enabled by the advances in process technology allowing the integration of all or most of the major components and subsystems of an electronic product onto a single chip, or integrated chipset [2]. This revolution in design has been embraced by many designers of complex chips, as the performance, power consumption, cost, and size advantages of using the highest level of integration made available have proven to be extremely important for many designs. In fact, the design and use of SoCs is arguably one of the key problems in designing real-time embedded systems. The move to SoC began sometime in the mid-1990s. At this point, the leading CMOS-based semiconductor process technologies of 0.35 and 0.25µm were sufficiently capable of allowing the integration of many of the major components of a second-generation wireless handset or a digital set-top box onto a

18-1


18-2


single chip. The digital baseband functions of a cell phone — a Digital Signal Processor (DSP), hardware (HW) support for voice encoding and decoding, and a RISC processor — could all be placed onto a single die. Although such a baseband SoC was far from the complete cell phone electronics — there were major components such as the RF transceiver, the analog power control, analog baseband, and passives that were not integrated — the evolutionary path with each new process generation, to integrate more and more onto a single die, was clear. Today’s chipset would become tomorrow’s chip. The problems of integrating hybrid technologies involved in making up a complete electronic system would be solved. Thus, eventually, SoC could encompass design components drawn from the standard and more adventurous domains of digital, analog, RF, reconfigurable logic, sensors, actuators, optical, chemical, microelectronic mechanical systems, and even biological and nanotechnology. With this viewpoint of continued process evolution leading to ever-increasing levels of integration into ever-more-complex SoC devices, the issue of a SoC being a single chip at any particular point in time is somewhat moot. Rather, the word “system” in System-on-Chip is more important than “chip.” What is most important about a SoC, whether packaged as a single chip, or integrated chipset, or System-inPackage (SiP) or System-on-Package (SoP) is that it is designed as an integrated system, making design trade-offs across the processing domains and across the individual chip and package boundaries.

18.2 System-on-a-Chip Let us define a SoC as a complex integrated circuit, or integrated chipset, which combines the major functional elements or subsystems of a complete end product into a single entity. These days, all interesting SoC designs include at least one programmable processor, and very often a combination of at least one RISC control processor and one DSP. They also include on-chip communications structures — processor bus(es), peripheral bus(es), and perhaps a high-speed system bus. A hierarchy of on-chip memory units, and links to off-chip memory are important especially for SoC processors (cache, main memories, very often separate instruction and data caches are included). For most signal processing applications, some degree of HW-based accelerating functional units are provided, offering higher performance and lower energy consumption. For interfacing to the external, real world, SoCs include a number of peripheral processing blocks, and owing to the analog nature of the real world, this may include analog components as well as digital interfaces (e.g., to system buses at a higher packaging level). Although there is a much interesting research in incorporating MEMS-based sensors and actuators, and in SoC applications incorporating chemical processing (lab-on-a-chip), these are, with rare exceptions, research topics only. However, future SoCs of a commercial nature may include such subsystems as well as optical communications interfaces. Figure 18.1 illustrates what a typical SoC might contain for consumer applications. One key point about SoC that is often forgotten for those approaching them from a HW-oriented perspective is that, all interesting SoC designs encompass both hardware (HW) and software (SW) components — that is, programmable processors, Real-Time Operating Systems (RTOSs), and other aspects of HW-dependent SW such as peripheral device drivers, as well as middleware stacks for particular application domains, and possibly optimized assembly code for DSPs. Thus, the design and use of SoCs cannot remain a HW-only concern — it involves aspects of system-level design and engineering, HW–SW trade-off and partitioning decisions, and SW architecture, design and implementation.

18.3 System-on-a-Programmable-Chip Recently, attention has begun to expand in the SoC world from SoC implementations using custom, ASIC or Application-Specific Standard Part (ASSP) design approaches, to include the design and use of complex reconfigurable logic parts with embedded processors and other application-oriented blocks of intellectual property. These complex FPGAs (Field-Programmable Gate Arrays) are offered by several vendors, including Xilinx (Virtex-II PRO Platform FPGA) and Altera (SOPC), but are referred to by


System-on-Chip and Network-on-Chip Design

18-3 External memory access

Flash

RAM

ICache

DCache

DMA System bus

Microprocessor

RAM

Flash

DCache

ICache

DSP

Peripheral bus PLL

MPEG decode

Test PCI

Video I/F

USB Audio CODE C

Disk controller 100 base-T

Bus bridge

FIGURE 18.1 A typical SoC device for consumer applications.

several names: highly programmable SoCs, system-on-a-programmable-chip, embedded FPGAs. The key idea behind this approach to SoC is to combine large amounts of reconfigurable logic with embedded RISC processors (either custom laid-out, “hardened” blocks, or synthesizable processor cores), in order to allow very flexible and tailorable combinations of HW and SW processing to be applied to a particular design problem. Algorithms that consist of significant amounts of control logic, plus significant quantities of dataflow processing, can be partitioned into the control RISC processor (e.g., in Xilinx Virtex-II PRO, a PowerPC processor) and reconfigurable logic offering HW acceleration. Although the resulting combination does not offer the highest performance, lowest energy consumption, or lowest cost, in comparison with custom IC or ASIC/ASSP implementations of the same functionality, it does offer tremendous flexibility in modifying the design in the field and avoiding expensive NonRecurring Engineering (NRE) charges in the design. Thus, new applications, interfaces, and improved algorithms can be downloaded to products working in the field using this approach. Products in this area also include other processing and interface cores, such as Multiply-ACcumulate (MAC) blocks which are specifically aimed at DSP-type dataflow signal and image processing applications; and high speed serial interfaces for wired communications such as SERDES (serializer/de-serializer) blocks. In this sense, system-on-a-programmable-chip SoCs are not exactly application-specific, but not completely generic either. It remains to be seen whether system-on-a-programmable-chip SoCs are going to be a successful way of delivering high volume consumer applications, or will end up restricted to the two main applications for high-end FPGAs: rapid prototyping of designs which will be re-targeted to ASIC or ASSP implementations; and used in high-end, relatively expensive parts of the communications infrastructure that require in-field


18-4


flexibility and can tolerate the trade-offs in cost, energy consumption, and performance. Certainly, the use of synthesizable processors on more moderate FPGAs to realize SoC style designs is one alternative to the cost issue. Intermediate forms, such as the use of metal-programmable gate-array style logic fabrics together with hard-core processor subsystems and other cores, such as is offered in the “Structured ASIC” offerings of LSI Logic (RapidChip) and NEC (Instant Silicon Solutions Platform) represents an intermediate form of SoC between the full-mask ASIC and ASSP approach and the field-programmable gate array approach. Here the trade-offs are much slower design creation (a few weeks rather than a day or so), higher NRE than FPGA (but much lower than a full set of masks), and better cost, performance, and energy consumption than FPGA (perhaps 15 to 30% worse than an ASIC approach). Further interesting compromise or hybrid style approaches, such as ASIC/ASSP with on-chip FPGA regions, are also emerging to give design teams more choices.

18.4 IP Cores The design of SoC would not be possible if every design started from scratch. In fact, the design of SoC depends heavily on the reuse of Intellectual Property blocks — what are called “IP Cores.” IP reuse has emerged as a strong trend over the last 8 to 9 yrs [3] and has been one key element in closing what the International Technology Roadmap for Semiconductors [4] calls the “design productivity gap” — the difference between the rate of increase of complexity offered by advancing semiconductor process technology, and the rate of increase in designer productivity offered by advances in design tools and methodologies. But reuse is not just important to offer ways of enhancing designer productivity — although it has dramatic impacts on that. It also provides a mechanism for design teams to create SoC products that span multiple design disciplines and domains. The availability of both hard (laid-out and characterized) and soft (synthesizable) processor cores from a number of processor IP vendors allows design teams who would not be able to design their own processor from scratch to drop them into their designs and thus add RISC control and DSP functionality to an integrated SoC without having to master the art of processor design within the team. In this sense, the advantages of IP reuse go beyond productivity — it offers both a large reduction in design risk, and also a way for SoC designs to be done that would otherwise be infeasible owing to the length of time it would take to acquire expertise and design IP from scratch. This ability when acquiring and reusing IP cores — to acquire, in a prepackaged form, design domain expertise outside one’s own design team’s set of core competencies, is a key requirement for the evolution of SoC design going forward. SoC up to this point has concentrated to a large part on integrating digital components together, perhaps with some analog interface blocks which are treated as black boxes. The hybrid SoCs of the future, incorporating domains unfamiliar to the integration team, such as RF or MEMS, requires the concept of “drop-in” IP to be extended to these new domains. We are not yet at that state — considerable evolution in the IP business and the methodologies of IP creation, qualification, evaluation, integration, and verification are required before we will be able to easily specify and integrate truly heterogeneous sets of disparate IP blocks into a complete hybrid SoC. However, the same issues existed at the beginning of the SoC revolution in the digital domain. They have been solved to a large extent, through the creation of standards for IP creation, evaluation, exchange, and integration — primarily for digital IP blocks but extending also to Analog/Mixed-Signal (AMS) cores. Among the leading organizations in the identification and creation of such standards has been the Virtual Socket Interface Alliance (VSIA) [5], formed in 1996 and having at its peak membership more than 200 IP, systems, semiconductor, and Electronic Design Automation (EDA) corporate members. Although often criticized over the years for a lack of formal and acknowledged adoption of its IP standards, VSIA has had a more subtle influence on the electronics industry. Many companies instituting reuse programmes internally; many IP, systems, and semiconductor companies engaging in IP creation and exchange; and many design groups have used VSIA IP standards as a key starting point for developing their own standards and methods for IP-based design. In this sense, use of VSIA outputs has enabled a kind of IP reuse in the IP business.



18-5

VSIA, for example, in its early architectural documents of 1996 to 1997, helped define the strong industry-adopted understanding of what it meant for an IP block to be considered to be in “hard” or “soft” form. Other important contributions to design included the widely read system level design model taxonomy created by one of its working groups. Its standards, specifications, and documents thus represent a very useful resource for the industry [6]. Other important issues for the rise of IP-based design and the emergence of a third party industry in this area (which has taken much longer to emerge than originally hoped in the mid-1990s) are the business issues surrounding IP evaluation, purchase, delivery, and use. Organizations such as the Virtual Component Exchange (VCX) [7] emerged to look at these issues and provide solutions. Although still in existence, it is clear that the vast majority of IP business relationships between firms occur within a more ad hoc supplier to customer business framework.

18.5 Virtual Components The VSIA has had a strong influence on the nomenclature of the SoC- and IP-based design industry. The concept of the “virtual socket” — a description of all the design interfaces which an IP core must satisfy, and design models and integration information which must be provided with the IP core — required to allow it to be more easily integrated or “dropped into” an SoC design — comes from the concept of Printed Circuit Board (PCB) design where components are sourced and purchased in prepackaged form and can be dropped into a board design in a standardized way. The dual of the “virtual socket” then becomes the “virtual component.” Not only in the VSIA context, but also more generally in the interface, an IP core represents a design block which might be reusable. A virtual component represents a design block that is intended for reuse, and which has been developed and qualified to be highly reusable. The things that separate IP cores from virtual components are in general: • Virtual components conform in their development and verification processes to well-established design processes and quality standards. • Virtual components come with design data, models, associated design files, scripts, characterization information, and other deliverables which conform to one or other well-accepted standards for IP reuse — for example, the VSIA deliverables, or another internal or external set of standards. • Virtual components in general should have been fabricated at least once, and characterized postfabrication to ensure that they have validated claims. • Virtual components should have been reused at least once by an external design team, and usage reports and feedback should be available. • Virtual components should have been rated for quality using an industry standard quality metric such as OpenMORE (originated by Synopsys and Mentor Graphics) or the VSI Quality standard (which has OpenMORE as one of its inputs). To a large extent, the developments over the last decade in IP reuse have been focused on defining the standards and processes to turn the ad hoc reuse of IP cores into a well-understood and reliable process for acquiring and reusing virtual components — thus enhancing the analogy with PCB design.

18.6 Platforms and Programmable Platforms The emphasis in the preceding sections has been on IP (or virtual component) reuse on a somewhat ad hoc block-by-block basis in SoC design. Over the past several years, however, there has arisen a more integrated approach to the design of complex SoCs and the reuse of virtual components — what has been called “platform-based design.” This will be dealt with at much greater length in another chapter in this book. Much more information is available in References 8–11. Suffice it here to define platform-based design in the SoC context from one perspective.


18-6


We can define platform-based design as a planned design methodology which reduces the time and effort required, and risk involved, in designing and verifying a complex SoC. This is accomplished by extensive reuse of combinations of HW and SW IP. As an alternative to IP reuse in a block-by-block manner, platform-based design assembles groups of components into reusable platform architecture. This reusable architecture, together with libraries of preverified and precharacterized, application oriented HW and SW virtual components, is a SoC integration platform. There are several reasons for the growing popularity of the platform approach in industrial design. These include the increase in design productivity, the reduction in risk, the ability to utilize preintegrated virtual components from other design domains more easily, and the ability to reuse SoC architectures created by experts. Industrial platforms include full application platforms, reconfigurable platforms, and processorcentric platforms [12]. Full application platforms, such as Philips Nexperia and TI OMAP provide a complete implementation vehicle for specific product domains [13]. Processor-centric platforms, such as ARM PrimeXsys concentrate on the processor, its required bus architecture, and basic sets of peripherals, along with RTOS and basic SW drivers. Reconfigurable or “highly programmable” platforms such as the Xilinx Platform FPGA and Altera’s SOPC deliver hardcore processors plus reconfigurable logic along with associated IP libraries and design tool flows.

18.7 Integration Platforms and SoC Design The use of SoC integration platforms changes the SoC design process in two fundamental ways: 1. The basic platform must be designed, using whatever ad hoc or formalized design process for SoC that the platform creators decide on. Section 18.8 outlines some of the basic steps required to build a SoC, whether building a platform or using a block-based more ad hoc integration process. However, when constructing a SoC platform for reuse in derivative design, it is important to remember that it may not be necessary to take the whole platform and its associated HW and SW component libraries through complete implementation. Enough implementation must be done to allow the platform and its constituent libraries to be fully characterized and modeled for reuse. It is also essential that the platform creation phase produce in an archivable and retrievable form all the design files required for the platform and its libraries to be reused in a derivative design process. This must also include the setup of the appropriate configuration programs or scripts to allow automatic creation of a configured platform during derivative design. 2. A design process must be created and qualified for all the derivative designs which will be created based on the SoC integration platform. This must include processes for retrieving the platform from its archive, for entering the derivative design configuration into a platform configurator, the generation of the design files for the derivative, the generation of the appropriate verification environment(s) for the derivative, the ability for derivative design teams to select components from libraries, to modify these components and validate them within the overall platform context, and, to the extent supported by the platform, to create new components for their particular application. Reconfigurable or highly programmable platforms introduce an interesting addition to the platformbased SoC design process [14]. Platform FPGAs and SOPC devices can be thought of as a “meta-platform”: a platform for creating platforms. Design teams can obtain these devices from companies such as Xilinx and Altera, containing a basic set of more generic capabilities and IP-embedded processors, on-chip buses, special IP blocks such as MACs and SERDES, and a variety of other prequalified IP blocks. They can then customize the meta-platform to their own application space by adding application domain-specific IP libraries. Finally, the combined platform can be provided to derivative design teams, who can select the basic meta-platform and configure it within the scope intended by the intermediate platform creation team, selecting the IP blocks needed for their exact derivative application. More on platform-based design will be found in another chapter in this book.



18-7

18.8 Overview of the SoC Design Process The most important thing to remember about SoC design is that it is a multi-disciplinary design process, which needs to exercise design processes from across the spectrum of electronics. Design teams must gain some fluency with all these multiple disciplines, but the integrative and reuse nature of SoC design means that they may not need to become deep experts in all of them. Indeed, avoiding the need for designers to understand all methodologies, flows, and domain-specific design techniques is one of the key reasons for reuse and enablers of productivity. Nevertheless, from Design-for-Test (DFT) through digital and analog HW design, from verification through system level design, from embedded SW through IP procurement and integration, from SoC architecture through IC analysis, a wide variety of knowledge is required by the team, if not every designer. Figure 18.2 illustrates some of the basic constituents of the SoC design process. We will now define each of these steps as illustrated: SoC requirements analysis. This is the basic step for defining and specifying a complex SoC, based on the needs of the end product into which it will be integrated. The primary input into this step is the marketing definition of the end product and the resulting characteristics of what the SoC should be: both functional and nonfunctional (e.g., cost, size, energy consumption, performance: latency and throughput, package selection). This process of requirements analysis must ultimately answer the question: is the product feasible? Is the desired SoC feasible to design, and with what effort and in what timeframe? How much

SoC requirement analysis

SoC architecture

Communications architecture

Choose processor(s)

System-level design • HW–SW partitioning • System modeling • Performance analysis

Acquisition of HW and SW IP

Build transaction-level golden testbench

Define SW architecture

Configure and floorplan SoC HW microarchitecture DFT architecture and implementation

HW IP assembly and implementation

SW assembly and implementation AMS HW implementation

Final SoC HW assembly and verification Fabrication, testing, packaging, lab verification with SW

FIGURE 18.2 Steps in the SoC design process.


HW and HW–SW verification

18-8


reuse will be possible? Is the SoC design based on legacy designs of previous generation products (or, in the case of platform-based design, to be built based on an existing platform offering)? SoC architecture. In this phase, the basic structure of the desired SoC is defined. Vitally important is to decide on the communications architecture that will be used as the backbone of the SoC on-chip communications network. An inadequate communications architecture will cripple the SoC and have as big an impact as the use of an inappropriate processor subsystem. Of course, the choice of communications architecture is impossible to divorce from making the basic processor(s) choice — for example, do I use a RISC control processor? Do I have an on-board DSP? How many of each? What are the processing demands of my SoC application? Do I integrate the bare processor core, or use a whole processor subsystem provided by an IP company (most processor IP companies have moved from offering just processor cores, to whole processor subsystems including hierarchical bus fabrics tuned to their particular processor needs)? Do I have some ideas, based on legacy SoC design in this space, as to how SW and HW should be partitioned? What memory hierarchy is appropriate? What are the sizes, levels, performance requirements, and configurations of the embedded memories most appropriate to the application domain for the SoC? System-level design. This is an important phase of the SoC process — but one that is often done in a relatively ad hoc way. The whiteboard and the spreadsheet are as much used by the SoC architects as more capable toolsets. However, there has long been use of ad hoc C/C++ based models for the system design phase — to validate basic architectural choices. And designers of complex signal processing algorithms for voice and image processing have long adopted dataflow models and associated tools to define their algorithms, define optimal bit-widths, and validate performance whether destined for HW or SW implementation. A flurry of activity in the last few years on different C/C++ modeling standards for system architects has consolidated on SystemC [15]. The system nature of SoC demands a growing use of system-level design modeling and analysis, as these devices grow more complex. The basic processes carried out in this phase include HW–SW partitioning (the allocation of functions to be implemented in dedicated HW blocks, in SW on processors [and the decision of RISC versus DSP], or a combination of both, together with decisions on the communications mechanisms to be used to interface HW and SW, or HW–HW and SW–SW). In addition, the construction of system-level models, and the analysis of correct functioning, performance, and other nonfunctional attributes of the intended SoC through simulation and other analytical tools, is necessary. Finally, all additional IP blocks required which can be sourced outside, or reused from the design group’s legacy, must be identified — both HW and SW. The remaining new functions will need to be implemented as part of the overall SoC design process. IP acquisition. After system-level design and the identification of the processors and communications architecture, and other HW or SW IP required for the design, the group must undertake an IP acquisition stage. This can, to a large extent, be done at least in part in parallel with other work such as system-level design (assuming early identification of major external IP is made) or building golden transaction-level testbench models. Fortunate design groups will be working in companies with a large legacy of existing well-crafted IP (rather, “virtual components”) organized in databases which can be easily searched; or those with access via supplier agreements to large external IP libraries; or at least those with experience at IP search, evaluation, purchase, and integration. For these lucky groups, the problems at this stage are greatly ameliorated. Others with less experience or infrastructure will need to explore these processes for the first time, hopefully making use of IP suppliers’ experience with the legal and other processes required. Here the external standards bodies such as VSIA and VCX have done much useful work that will smooth the path, at least a little. One key issue in IP acquisition is to conduct rigorous and thorough incoming inspection of IP to ensure its completeness and correctness to the greatest extent possible prior to use, and to resolve any problems with quality early with suppliers — long before SoC integration. Every hour spent on this at this stage will pay back in avoiding much longer schedule slips later. The IP quality guidelines discussed earlier are a foundation level for a quality process at this point. Build a transaction-level golden testbench. The system model built up during the system-level design stage can form the basis for a more elaborated design model, using “transaction-level abstractions” [16], which represents the underlying HW–SW architecture and components in more detail — sufficient detail to act as a functional virtual prototype for the SoC design. This golden model can be used at this stage to



18-9

verify the microarchitecture of the design and to verify detailed design models for HW IP at the Hardware Description Language (HDL) level within the overall system context. It thus can be reused all the way down the SoC design and implementation cycle. Define the SoC SW architecture. SoC is of course not just about HW [17]. As well as often defining the right on-chip communications architecture, the choice of processor(s) and the nature of the application domain have a very heavy influence on the SW architecture. For example, RTOS choice is limited by the processor ports which have been done and by the application domain (OSEK is an RTOS for automotive systems; Symbian OS for portable wireless devices; PalmOS for Personal Digital Assistants, etc.). As well as the basic RTOS, every SoC peripheral device will need a device driver — hopefully based on reuse and configuration of templates; various middleware application stacks (e.g., telephony, multimedia image processing) are important parts of the SW architecture; voice and image encoding and decoding on portable devices often is based on assembly code IP for DSPs. There is thus a strong need in defining the SoC to fully elaborate the SW architecture to allow reuse, easy customization and effective verification of the overall HW–SW device. Configure and floorplan SoC microarchitecture. At this point we are beginning to deal with the SoC on a more physical and detailed logical basis. Of course, during high-level architecture and system-level design, the team has been looking at physical implementation issues (although our design process diagram shows everything as a waterfall kind of flow, in reality SoC design like all electronics design is more of an iterative, incremental process — that is, more akin to the famous ‘spiral’ model for SW). But before beginning the detailed HW design and integration it is important that there is agreement among the team on the basic physical floorplan; that all the IP blocks are properly and fully configured; that the basic microarchitectures (test, power, clocking, bus, timing) have been fully defined and configured, and that HW implementation can proceed. In addition, this process should also generate the downstream verification environments which will be used throughout the implementation processes — whether SW simulation based, emulation based, using rapid prototypes, or other hybrid verification approaches. DFT architecture and implementation. The test architecture is only one of the key microarchitectures which must be implemented; it is complicated by IP legacy and the fact that it is often impossible to impose one DFT style (such as BIST or SCAN) on all IP blocks. Rather, wrappers or adaptations of standard test interfaces (such as JTAG ports) may be necessary to fit all IP blocks together into a coherent test architecture and plan. AMS HW implementation. Most SoCs incorporating AMS blocks use them to interface to the external world. VSIA, among other groups, has done considerable work in defining how AMS IP blocks should be created to allow them to be more easily integrated into mainly digital SoCs (the “Big D/little a” SoC); and guidelines and rules for such integration. Experiences with these rules and guidelines and extra deliverables have been, on the whole, promising — but they have more impact between internal design groups today than on the industry as a whole. The “Big A/Big D” mixed-signal SoC is still relatively rare. HW IP assembly and integration. This design step is in many ways the most traditional. Many design groups have experience in assembling design blocks done by various designers or subgroups in an incremental fashion, into the agreed on architectures for communications, bussing, clocking, power, etc. The main difference with SoC is that many of the design blocks may be externally sourced IP. To avoid difficulties at this stage, the importance of rigorous qualification of incoming IP and the early definition of the SoC microarchitecture, to which all blocks must conform, cannot be overstated. SW assembly and implementation. Just as with HW, the SW IP, together with new or modified SW tasks created for the particular SoC under design, must be assembled together and validated as to conformance to interfaces and expected operational quality. It is important to verify as much of the SW in its normal system operating context as possible. HW and HW–SW verification. Although represented as a single box on the diagram, this is perhaps one of the largest consumers of design time and effort and the major determinant of final SoC quality. Vital to effective verification is the setup of a targeted SoC verification environment, reusing the golden testbench models created at higher levels of the design process. In addition, highly capable, multi-language, mixed simulation environments are important (e.g., SystemC models and HDL implementation models need to


18-10


be mixed in the verification process and effective links between them are crucial). There are a large number of different verification tools and techniques [18], ranging from SW-based simulation environments to HW emulators, HW accelerators, and FPGA and bonded-core-based rapid prototyping approaches. In addition, formal techniques such as equivalence checking, and model/property checking have enjoyed some successful usage in verifying parts of SoC designs, or the design at multiple stages in the process. Mixed approaches to HW–SW verification range from incorporating Instruction Set Simulators (ISSs) of processors in SW-based simulation to linking HW emulation of the HW blocks (compiled from the HDL code) to SW running natively on a host workstation, linked in an ad hoc fashion by design teams or using a commercial mixed verification environment. Alternatively, HDL models of new HW blocks running in a SW simulator can be linked to emulation of the rest of the system running in HW — a mix of emulation and use of bonded-out processor cores for executing SW. It is important that as much of the system SW be exercised in the context of the whole system as possible, using the most appropriate verification technology that can get the design team close to real-time execution speed (no more than 100× slower is the minimum to run significant amounts of SW). The trend to transaction-based modeling of systems, where transactions range in abstraction from untimed functional communications via message calls, through abstract bus communications models, through cycle-accurate bus functional models, and finally to cycle and pin-accurate transformations of transactions to the fully detailed interfaces, allows verification to occur at several levels or with mixed levels of design description. Finally, a new trend in verification is assertion-based verification, using a variety of input languages (PSL/Sugar, e, Vera, or regular Verilog and VHDL) to model design properties, which can then be monitored during simulation, to ensure that either certain properties will be satisfied or certain error conditions never occur. Combinations of formal property checking and simulation-based assertion checking have been created — viz. “semiformal verification.” The most important thing to remember about verification is that armed with a host of techniques and tools, it is essential for design teams to craft a well-ordered verification process that allows them to definitively answer the question “how do we know that verification is done?” and thus allows the SoC to be fabricated. Final SoC HW assembly and verification. Often done in parallel or overlapping“those final few simulation runs” in the verification stage, the final SoC HW assembly and verification phase includes final place and route of the chip, any hand-modifications required, and final physical verification (using design rule checking and layout-versus-schematic [netlist] tools), as well as important analysis steps for issues which occur in advanced semiconductor processes such as IR drop, signal integrity, power network integrity, as well as satisfaction and design transformation for manufacturability (OPC, etc.). Fabrication, testing, packaging, and lab verification. When a SoC has been shipped to fabrication, it would seem time for the design team to relax. Instead, this is an opportunity for additional verification to be carried out — especially more verification of system SW running in context of the HW design — and for fixes, either of SW, or of the SoC HW on hopefully no more than one expensive iteration of the design — to be determined and planned. When the tested packaged parts arrive back for verification in the lab, the ideal scenario is to load the SW into the system and have the SoC and its system booted up and running SW within a few hours. Interestingly, the most advanced SoC design teams, with well-ordered design methodologies and processes, are able to achieve this quite regularly.

18.9 System-Level Design As discussed earlier, when describing the overall SoC design flow, system-level design, and SoC are essentially made for each other. A key aim of IP reuse and of SoC techniques such as platform-based design is to make the “back end” (RTL to GDS II) design implementation processes easier — fast and with lowrisk; and to shift the major design phase for SoC up in time and in abstraction level to the system level. This also means that the back-end tools and flows for SoC designs do not necessarily differ from those used for complex ASIC, ASSP, and custom IC design — it is the methodology of how they are used, and how blocks are sourced and integrated, that overlays the underlying design tools and flows, that may differ for SoC. However, the fundamental nature of IP-based design of SoC has a stronger influence on the system level.



18-11

It is at the system level that the vital tasks of deciding on and validating the basic system architecture and choice of IP blocks are carried out. In general, this is known as “design space exploration” (DSE). As part of this exploration, SoC platform customization for a particular derivative is carried out, should the SoC platform approach be used. Essentially one can think of platform DSE as being a similar task to general DSE, except that the scope and boundaries of the exploration are much more tightly constrained — the basic communications architecture and platform processor choices may be fixed, and the design team may be restricted to choosing certain customization parameters and choosing optional IP from a library. Other tasks include HW–SW partitioning, usually restricted to decisions about key processing tasks which might be mapped onto either HW or SW form and which have a big impact on system performance, energy consumption, on-chip communications bandwidth consumption, or other key attributes. Of course, in multiprocessor systems, there are “SW–SW” partitioning or codesign issues as well, deciding on the assignment of SW tasks to various processor options. Again, perhaps 80 to 95% of these decisions can or are made a priori, especially if a SoC is based on either a platform or an evolution of an existing system; such codesign decisions are usually made on a small number of functions which have critical impact. Because partitioning, codesign and DSE tasks at the system level involve much more than HW–SW issues, a more appropriate term for this is “function-architecture codesign” [19,20]. In this codesign model, systems are described on two equivalent levels: • The functional intent of the system — for example, a network of applications, decomposed into individual sets of functional tasks which may be modeled using a variety of models of computation such as discrete event, finite state machine, or dataflow. • The architectural structure of the system — the communications architecture, major IP blocks such as processor(s), memorie(s), and HW blocks, captured or modeled, for example, using some kind of IP or platform configurator. The methodology implied in this approach is then to build explicit mappings between the functional view of the system and the architectural view, which carry within them the implicit partitioning that is made for both computation and communications. This hybrid model can then be simulated, the results analyzed, and a variety of ancillary models (e.g., cost, power, performance, communications bandwidth consumption, etc.) can be utilized in order to examine the suitability of the system architecture as a vehicle for realizing or implementing the end product functionality. The function-architecture codesign approach has been implemented and used in both research and commercial tools [21] and forms the foundation of many system-level codesign approaches going forward. In addition, it has been found extremely suitable as the best system-level design approach for platform-based design of SoC [22].

18.10 Interconnection and Communication Architectures for SoC This topic is dealt with in more detail in other chapters in this book. Suffice it to say here that current SoC architectures deal in fairly traditional hierarchies of standard on-chip buses: for example, processorspecific buses, high-speed system buses, and lower-speed peripheral buses, using standards such as ARM’s AMBA and IBM’s CoreConnect [13], and traditional master–slave bus approaches. Recently, there has been a lot of interest in Network-on-Chip (NoC) communications architectures, based on packet-SW and a number of approaches have been reported in the literature but this remains primarily a research topic both in universities and industrial research labs [23].

18.11 Computation and Memory Architectures for SoC The primary processors used in SoC are embedded RISCs such as ARM processors, PowerPCs, MIPS architecture processors, and some of the configurable processors designed specifically for SoC such as


18-12


Tensilica and ARC. In addition, embedded DSPs from traditional suppliers as TI, Motorola, ParthusCeva, and others are also quite common in many consumer applications, for embedded signal processing for voice and image data. Research groups have looked at compiling or synthesizing application-specific processors or coprocessors [24,25] and these have interesting potential in future SoCs which may incorporate networks of heterogeneous configurable processors collaborating to offer large amounts of computational parallelism. This is an especially interesting prospect given wider use of reconfigurable logic which opens up the prospect of dynamic adaptation of SoC to application needs. However, most multiprocessor SoCs today involve at most 2 to 4 processors of conventional design; the larger networks are more often found today in the industrial or university lab. Although several years ago most embedded processors in early SoCs did not use cache memory-based hierarchies, this has changed significantly over the years, and most RISC and DSP processors now involve significant amounts of Level 1 Cache memory, as well as higher level memory units both on- and off-chip (off-chip flash memory is often used for embedded SW tasks which may be only infrequently required). System design tasks and tools must consider the structure, size, and configuration of the memory hierarchy as one of the key SoC configuration decisions that must be made.

18.12 IP Integration Quality and Certification Methods and Standards We have emphasized the design reuse aspects of SoC and the need for reuse of both internally and externally sourced IP blocks by design teams creating SoCs. In the discussion of the design process above, we mentioned issues such as IP quality standards and the need for incoming inspection and qualification of IP. The issue of IP quality remains one of the biggest impediments to the use of IP-based design for SoC [26]. The quality standards and metrics available from VSIA and OpenMORE, and their further enhancement help, but only to a limited extent. The industry could clearly use a formal certification body or lab for IP quality that would ensure conformance to IP transfer requirements and the integration quality of the blocks. Such a certification process would be of necessity quite complex owing to the large number of configurations possible for many IP blocks and the almost infinite variety of SoC contexts into which they might be integrated. Certified IP would begin the deliver the “virtual components” of the VSIA vision. In the absence of formal external certification (and such third party labs seem a long way off, if they ever emerge), design groups must provide their own certification processes and real reuse quality metrics, based on their internal design experiences. Platform-based design methods help owing to the advantages of prequalifying and characterizing groups of IP blocks and libraries of compatible domain-specific components. Short of independent evaluation and qualification, this is the best that design groups can do currently. One key issue to remember is that IP not created for reuse, with all the deliverables created and validated according to a well-defined set of standards, is inherently not reusable. The effort required to make a reusable IP block has been estimated to be 50 to 200% more than that required to use it once; however, assuming the most conservative extra cost involved implies positive payback with three uses of the IP block. Planned and systematic IP reuse and investment in those blocks with greatest SoC use potential gives a high chance of achieving significant productivity soon after starting a reuse programme. But ad hoc attempts to reuse existing design blocks not designed to reuse standards have failed in the past and are unlikely to provide the quality and productivity desired.

18.13 Summary In this chapter, we have defined SoC and surveyed a large number of the issues involved in its design. An outline of the important methods and processes involved in SoC design define a methodology which can be adopted by design groups and adapted to their specific requirements. Productivity in SoC design demands high levels of design reuse and the existence of the third party and internal IP groups and the chance to create a library of reusable IP blocks (true virtual components) are all possible for most design groups today.



18-13

The wide variety of design disciplines involved in SoC mean that unprecedented collaboration between designers of all backgrounds — from systems experts through embedded SW designers through architects through HW designers — is required. But the rewards of SoC justify the effort required to succeed.

References [1] Merrill Hunt and Jim Rowson, Blocking in a system on a chip. IEEE Spectrum, 33(11), 35–41, November 1996. [2] Rochit Rajsuman. System-on-a-Chip Design and Test. Artech House, Norwood, Massachusetts, 2000. [3] Michael Keating and Pierre Bricaud, Reuse Methodology Manual for System-on-a-Chip Designs. Kluwer Academic Publishers, Dordrecht, 1998 (1st ed.), 1999 (2nd ed.), 2002 (3rd ed.). [4] International Technology Roadmap for Semiconductors (ITRS), 2001 edn. http://public.itrs.net/. [5] Virtual Socket Interface Alliance, on the web at URL: http://www.vsia.org. This includes access to its various public documents, including the original Reuse Architecture document of 1997, as well as more recent documents supporting IP reuse released to the public domain. [6] B. Bailey, G. Martin, and T. Anderson (eds.). Taxonomies for the Development and verification of Digital Systems. Springer, New York, 2005. [7] The Virtual Component Exchange (VCX). Available at http://www.thevcx.com/. [8] Henry Chang, Larry Cooke, Merrill Hunt, Grant Martin, Andrew McNelly, and Lee Todd, Surviving the SOC Revolution: A Guide to Platform-Based Design. Kluwer Academic Publishers, Dordrecht, 1999. [9] K. Keutzer, S. Malik, A.R. Newton, J. Rabaey, and A. Sangiovanni-Vincentelli, System-Level Design: Orthogonalization of Concerns and Platform-Based Design. IEEE Transactions on CAD of ICs and Systems, 19, 1523, 2000. [10] Alberto Sangiovanni-Vincentelli and Grant Martin, Platform-Based Design and Software Design Methodology for Embedded Systems. IEEE Design and Test of Computers, 18, 23–33, 2001. [11] IEEE Design and Test of Computers Special Issue on Platform-Based Design of SoCs, 19, 4–63, 2002. [12] G. Martin and F. Schirrmeister, A Design Chain for Embedded Systems. IEEE Computer, Embedded Systems Column, 35(3), 100–103, March 2002. [13] Grant Martin and Henry Chang, Eds., Winning the SOC Revolution: Experiences in Real Design. Kluwer Academic Publishers, Dordrecht, May 2003. [14] Patrick Lysaght, FPGAs as Meta-Platforms for Embedded Systems. In Proceedings of the IEEE Conference on Field Programmable Technology. Hong Kong, December 2002. [15] Thorsten Groetker, Stan Liao, Grant Martin, and Stuart Swan, System Design with SystemC. Kluwer Academic Publishers, Dordrecht, May 2002. [16] Janick Bergeron, Writing Testbenches, 3rd ed. Kluwer Academic Publishers, Dordrecht, 2003. [17] G. Martin and C. Lennard, Improving Embedded SW Design and Integration for SOCs. Invited Custom Integrated Circuits Conference Paper, May 2000, pp. 101–108. [18] Prakash Rashinkar, Peter Paterson, and Leena Singh, System-on-a-Chip Verification: Methodology and Techniques. Kluwer Academic Publishers, Dordrecht, 2001. [19] F. Balarin, M. Chiodo, P. Giusto, H. Hsieh, A. Jurecska, L. Lavagno, C. Passerone, A. SangiovanniVincentelli, E. Sentovich, K. Suzuki, and B. Tabbara, Hardware–Software Co-Design of Embedded Systems: The POLIS Approach. Kluwer Academic Publishers, Dordrecht, 1997. [20] S. Krolikoski, F. Schirrmeister, B. Salefski, J. Rowson, and G. Martin, Methodology and Technology for Virtual Component Driven Hardware/Software Co-Design on the System Level. Paper 94.1, ISCAS 99, Orlando, FL, May 30–June 2, 1999. [21] G. Martin and B. Salefski, System Level Design for SOC’s: A Progress Report — Two Years On. In System-on-Chip Methodologies and Design Languages, Jean Mermet, Ed. Kluwer Academic Publishers, Dordrecht, 2001, pp. 297–306.


18-14


[22] G. Martin, Productivity in VC Reuse: Linking SOC Platforms to Abstract Systems Design Methodology. In Virtual Component Design and Reuse, Ralf Seepold and Natividad Martinez Madrid, Eds. Kluwer Academic Publishers, Dordrecht, 2001, pp. 33–46. [23] Axel Jantsch and Hennu Tenhunen, Eds., Networks on Chip. Kluwer Academic Publishers, Dordrecht, 2003. [24] Vinod Kithail, Shail Aditya, Robert Schreiber, B. Ramakrishna Rau, Darren C. Cronquist, and Mukund Sivaraman, PICO: Automatically Designing Custom Computers. IEEE Computer, 35, 39–47, 2002. [25] T.J. Callahan, J.R. Hauser, and J. Wawrzynek, The Garp Architecture and C Compiler. IEEE Computer, 33, 62–69, 2000. [26] DATE 2002 Proceedings, Session 1A: How to Choose Semiconductor IP?: Embedded Processors, Memory, Software, Hardware. In Proceedings of DATE 2002. Paris, March 2002, pp. 14–17.


19 A Novel Methodology for the Design of Application-Specific Instruction-Set Processors 19.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19-1 19.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19-3 19.3 ASIP Design Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19-4 Architecture Exploration • LISA Language

19.4 LISA Processor Design Platform . . . . . . . . . . . . . . . . . . . . . . . . 19-10 Hardware Designer Platform — For Exploration and Processor Generation • Software Designer Platform — For Software Application Design • System Integrator Platform — For System Integration and Verification

19.5 SW Development Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19-12 Assembler and Linker • Simulator

19.6 Architecture Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 19-17 LISA Language Elements for HDL Synthesis • Implementation Results

19.7 Tools for Application Development . . . . . . . . . . . . . . . . . . . . 19-22

Andreas Hoffmann, Achim Nohl, and Gunnar Braun CoWare Inc.

Examined Architectures • Efficiency of the Generated Tools

19.8 Requirements and Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . 19-25 LISA Language • HLL C-compiler • HDL Generator

19.9 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 19-26 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19-26

19.1 Introduction In consumer electronics and telecommunications high product volumes are increasingly going along with short life-times. Driven by the advances in semiconductor technology combined with the need for new From Andreas Hoffmann, Tim Kogel, Achim Nohl, Gunnar Braun, Oliver Schliebusch, Oliver Wahlen, Andreas Wieferink, and Heinrich Meyr. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 20, 2001. With permission.

19-1


19-2


applications like digital TV and wireless broadband communications, the amount of system functionality realized on a single chip is growing enormously. Higher integration and thus increasing miniaturization have led to a shift from using distributed hardware components towards heterogeneous system-on-chip (SOC) designs [1]. Due to the complexity introduced by such SOC designs and time-to-market constraints, the designer’s productivity has become the vital factor for successful products. For this reason a growing amount of system functions and signal processing algorithms is implemented in software rather than in hardware by employing embedded processor cores. In the current technical environment, embedded processors and the necessary development tools are designed manually, with very little automation. This is because the design and implementation of an embedded processor, such as a DSP device embedded in a cellular phone, is a highly complex process composed of the following phases: architecture exploration, architecture implementation, application software design, and system integration and verification. During the architecture exploration phase, software development tools (i.e., HLL compiler, assembler, linker, and cycle-accurate simulator) are required to profile and benchmark the target application on different architectural alternatives. This process is usually an iterative one that is repeated until a best fit between selected architecture and target application is obtained. Every change to the architecture specification requires a complete new set of software development tools. As these changes on the tools are carried out mainly manually, this results in a long, tedious, and extremely error-prone process. Furthermore, the lack of automation makes it very difficult to match the profiling tools to an abstract specification of the target architecture. In the architecture implementation phase, the specified processor has to be converted into a synthesizable HDL model. With this additional manual transformation it is quite obvious that considerable consistency problems arise between the architecture specification, the software development tools, and the hardware implementation. During the software application design phase, software designers need a set of production-quality software development tools. Since the demands of the software application designer and the hardware processor designer place different requirements on software development tools, new tools are required. For example, the processor designer needs a cycle/phase-accurate simulator for hardware/software partitioning and profiling, which is very accurate but inevitably slow, whereas the application designer demands more simulation speed than accuracy. At this point, the complete software development tool-suite is usually re-implemented by hand — consistency problems are self-evident. In the system integration and verification phase, co-simulation interfaces must be developed to integrate the software simulator for the chosen architecture into a system simulation environment. These interfaces vary with the architecture that is currently under test. Again, manual modification of the interfaces is required with each change of the architecture. The efforts of designing a new architecture can be reduced significantly by using a retargetable approach based on a machine description. The Language for Instruction Set Architectures (LISAs) [2,3] was developed for the automatic generation of consistent software development tools and synthesizable HDL code. A LISA processor description covers the instruction-set, the behavioral, and the timing model of the underlying hardware, thus providing all essential information for the generation of a complete set of development tools including compiler, assembler, linker, and simulator. Moreover, it contains enough micro-architectural details to generate synthesizable HDL code of the modelled architecture. Changes on the architecture are easily transferred to the LISA model and are applied automatically to the generated tools and hardware implementation. In addition, speed and functionality of the generated tools allow usage even after the product development has been finished. Consequently, there is no need to rewrite the tools to upgrade them to production quality standard. In its predicate to represent an unambiguous abstraction of the real hardware, a LISA model description bridges the gap between hardware and software design. It provides the software developer with all required information and enables the hardware designer to synthesize the architecture from the same specification the software tools are based on. The chapter is organized as follows: Section 19.2 reviews existing approaches on machine description languages and discusses their applicability for the design of application specific instruction set processors. Section 19.3 presents an overview on a typical ASIP design flow using LISA: from specification to implementation. Moreover, different processor models are worked out which contain the required information


A Novel Methodology for the Design of ASIPs

19-3

the tools need for their retargetation. Besides, sample LISA code segments are presented showing how the different models are expressed in the LISA language. Section 19.4 introduces the LISA processor design platform. Following that, the different areas of application are illuminated in more detail. In Section 19.5 the generated software development tools are presented in more detail with a focus on different simulation techniques that are applicable. Section 19.6 shows the path to implementation and gives results for a case study that was carried out using the presented methodology. To prove the quality of the generated software development tools, in Section 19.7 simulation benchmark results are shown for modelled state-of-theart processors. In Section 19.8, requirements and limitations of the presented approach are explained. Section 19.9 summarizes the chapter and gives an outlook on future research topics.

19.2 Related Work Hardware description languages (HDLs) like VHDL or Verilog are widely used to model and simulate processors, but mainly with the goal to develop hardware. Using these models for architecture exploration and production quality software development tool generation has a number of disadvantages, especially for cycle-based or instruction-level processor simulation. They cover a huge amount of hardware implementation details which are not needed for performance evaluation, cycle-based simulation, and software verification. Moreover, the description of detailed hardware structures has a significant impact on simulation speed [4,5]. Another problem is that the extraction of the instruction set is a highly complex, manual task and some instruction set information, for example, assembly syntax, cannot be obtained from HDL descriptions at all. There are many publications on machine description languages providing instruction-set models. Most approaches using such models are addressing retargetable code generation [6–9]. Other approaches address retargetable code generation and simulation. The approaches of Maril [10] as part of the Marion environment and a system for VLIW compilation [11] are both using latency annotation and reservation tables for code generation. But models based on operation latencies are too coarse for cycle-accurate simulation or even generation of synthesizable HDL code. The language nML was developed at TU Berlin [12,13] and adopted in several projects [14–17]. However, the underlying instruction sequencer does not allow to describe the mechanisms of pipelining as required for cycle-based models. Processors with more complex execution schemes and instruction-level parallelism like the Texas Instruments TMS320C6x cannot be described, even at the instruction-set level, because of the numerous combinations of instructions. The same restriction applies to ISDL [18], which is very similar to nML. The language ISDL is an enhanced version of the nML formalism and allows the generation of a complete tool-suite consisting of HLL compiler, assembler, linker, and simulator. Even the possibility of generating synthesizable HDL code is reported, but no results on the efficiency of the generated tools nor on the generated HDL code are given. The EXPRESSION language [19] allows the cycle-accurate processor description based on a mixed behavioral/structural approach. However, no results on simulation speed have been published nor is it clear if it is feasible to generate synthesizable HDL code automatically. The FlexWare2 environment [20] is capable of generating assembler, linker, simulator, and debugger from the Insulin formalism. A link to implementation is non-existent, but test vectors can be extracted from the Insulin description to verify the HDL model. The HLL compiler is derived from a separate description targeting the CoSy [21] framework. Recently, various ASIP development systems have been introduced [22–24] for systematic co-design of instruction-set and micro-architecture implementation using a given set of application benchmarks. The PEAS-III system [25] is an ASIP development environment based on a micro-operation description of instructions that allows the generation of a complete tool-suite consisting of HLL compiler, assembler, linker, and simulator including HDL code. However, no further information about the formalism is given that parameterizes the tool generators nor have any results been published on the efficiency of the generated tools. The MetaCore system [26] is a benchmark driven ASIP development system based on a formal representation language. The system accepts a set of benchmark programs and estimates the hardware cost and performance for the configuration under test. Following that, software development tools and


19-4


synthesizable HDL code are generated automatically. As the formal specification of the ISA is similar to the ISPS formalism [27], complex pipeline operations as flushes and stalls can hardly be modelled. In addition, flexibility in designing the instruction-set is limited to a predefined set of instructions. Tensilica Inc. customizes a RISC processor within the Xtensa system [28]. As the system is based on an architecture template comprising quite a number of base instructions, it is far too powerful and thus not suitable for highly application specific processors, which do in many cases only employ very few instructions. Our interest in a complete retargetable tool-suite for architecture exploration, production quality software development, architecture implementation, and system integration for a wide range of embedded processor architectures motivated the introduction of the language LISA used in our approach. In many aspects, LISA incorporates ideas which are similar to nML. However, it turned out from our experience with different DSP architectures that significant limitations of existing machine description languages must be overcome to allow the description of modern commercial embedded processors. For this reason, LISA includes improvements in the following areas: • Capability to provide cycle-accurate processor models, including constructs to specify pipelines and their mechanisms including stalls, flushes, operation injection, etc. • Extension of the target class of processors including SIMD, VLIW, and superscalar architectures of real-world processor architectures. • Explicit language statements addressing compiled simulation techniques. • Distinction between the detailed bit-true description of operation behavior including side-effects for the simulation and implementation on the one hand and assignment to arithmetical functions for the instruction selection task of the compiler on the other hand which allows to determine freely the abstraction level of the behavioral part of the processor model. • Strong orientation on the programming languages C/C++; LISA is a framework which encloses pure C/C++ behavioral operation description. • Support for instruction aliasing and complex instruction coding schemes.

19.3 ASIP Design Flow Powerful application specific programmable architectures are increasingly required in the DSP, multimedia and networking application domains in order to meet demanding cost and performance requirements. The complexity of algorithms and architectures in these application domains prohibits an ad hoc implementation and requests for an elaborated design methodology with efficient support in tooling. In this section, a seamless ASIP design methodology based on LISA will be introduced. Moreover, it will be demonstrated how the outlined concepts are captured by the LISA language elements. The expressiveness of the LISA formalism providing high flexibility with respect to abstraction level and architecture category is especially valuable for the design of high performance processors.

19.3.1 Architecture Exploration The LISA based methodology sets in after the algorithms, which are intended for execution on the programmable platform, are selected. The algorithm design is beyond the scope of LISA and is typically performed in an application specific system level design environment, such as, for example, COSSAP [29] for wireless communications or OPNET [30] for networking. The outcome of the algorithmic exploration is a pure functional specification usually represented by means of an executable prototype written in a high-level language (HLL) like C, together with a requirement document specifying cost and performance parameters. In the following, the steps of our proposed design flow depicted in Figure 19.1 are described, where the ASIP designer refines successively the application jointly with the LISA model of the programmable target architecture. First the performance critical algorithmic kernels of the functional specification have to be identified. This task can be easily performed with a standard profiling tool, that instrumentalizes the application



FIGURE 19.1

19-5

Application

LISA model

Exploration result

1

Assembly algorithm kernal

Datapath model

ISA accurate profiling (data)

2

Assembly program

Instruction model

ISA accurate profiling (data + control)

3

Revised assembly program

Cycle-true model

Cycle accurate profiling (data + control)

4

Assembly program

RTL model

HW cost + timing

LISA based ASIP development flow.

code in order to generate HLL execution statistics during the simulation of the functional prototype. Thus the designer becomes aware of the performance critical parts of the application and is therefore prepared to define the data path of the programmable architecture on the assembly instruction level. Starting from a LISA processor model which implements an arbitrary basic instruction set, the LISA model can be enhanced with parallel resources, special purpose instructions, and registers in order to improve the performance of the considered application. At the same time, the algorithmic kernel of the application code is translated into assembly by making use of the specified special purpose instructions. By employing assembler, linker, and processor simulator derived from the LISA model (cf. Section 19.5), the designer can iteratively profile and modify the programmable architecture in cadence with the application until both fulfill the performance requirements. After the processing intensive algorithmic kernels are considered and optimized, the instruction set needs to be completed. This is accomplished by adding instructions to the LISA model which are dedicated to the low speed control and configuration parts of the application. However, while these parts usually represent major portions of the application in terms of code amount, they have only negligible influence on the overall performance. Therefore it is very often feasible to employ the HLL C-compiler derived from the LISA model and accept suboptimal assembly code quality in return for a significant cut in design time. So far, the optimization has only been performed with respect to the software related aspects, while neglecting the influence of the micro-architecture. For this purpose the LISA language provides capabilities to model cycle-accurate behavior of pipelined architectures. The LISA model is supplemented by the instruction pipeline and the execution of all instructions is assigned to the respective pipeline stage. If the architecture does not provide automatic interlocking mechanisms, the application code has to be revised to take pipeline effects into account. Now the designer is able to verify that the cycle true processor model still satisfies the performance requirements. At the last stage of the design flow, the HDL generator (see Section 19.6) can be employed to generate synthesizable HDL code for the base structure and the control path of the architecture. After implementing the dedicated execution units of the data path, strainable numbers on hardware cost and performance parameters (e.g., design size, power consumption, clock frequency) can be derived by running the HDL processor model through the standard synthesis flow. On this high level of detail the designer can tweak the computational efficiency of the architecture by applying different implementations of the data path execution units.

19.3.2 LISA Language The language LISA [2,3] is aiming at the formalized description of programmable architectures, their peripherals and interfaces. LISA closes the gap between purely structural oriented languages (VHDL, Verilog) and instruction set languages.


19-6


Timing model

Microarchitecture model

—

Instruction scheduling

—

—

Instruction translation

—

—

—

—

—

—

Memory model

Resource Behavioral Instruction model model set model

HLLcompiler

Register allocation

Instruction scheduling

Instruction selection

Assembler

—

—

Linker

Memory allocation

—

Simulator

Simulation of storage

—

Debugger

Display configuration

Profiling

—

—

—

—

HDL generator

Basic structure

write conflict resolution

—

Instruction decoder

Operation scheduling

Operation grouping

FIGURE 19.2

Operation Decoder/ Operation simulation disassembler sheduling

—

Model requirements for ASIP design.

LISA descriptions are composed of resources and operations. The declared resources represent the storage objects of the hardware architecture (e.g., registers, memories, pipelines) which capture the state of the system. Operations are the basic objects in LISA. They represent the designer’s view of the behavior, the structure, and the instruction set of the programmable architecture. A detailed reference of the LISA language can be found in Reference 31. The process of generating software development tools and synthesizing the architecture requires information on architectural properties and the instruction set definition as depicted in Figure 19.2. These requirements can be grouped into different architectural models — the entirety of these models constitutes the abstract model of the target architecture. The LISA machine description provides information consisting of the following model components:

The memory model. This lists the registers and memories of the system with their respective bit widths, ranges, and aliasing. The compiler gets information on available registers and memory spaces. The memory configuration is provided to perform object code linking. During simulation, the entirety of storage elements represents the state of the processor which can be displayed in the debugger. The HDL code generator derives the basic architecture structure. In LISA, the resource section lists the definitions of all objects which are required to build the memory model. A sample resource section of the ICORE architecture described in Reference 32 is shown in Figure 19.3. The resource section begins with the keyword RESOURCE followed by (curly) braces enclosing all object definitions. The definitions are made in C-style and can be attributed with keywords like, for example, REGISTER, PROGRAM_COUNTER, etc. These keywords are not mandatory but they are used to classify the definitions in order to configure the debugger display. The resource section in Figure 19.3 shows the declaration of program counter, register file, memories, the four-stage instruction pipeline, and pipeline-registers. The resource model. This describes the available hardware resources and the resource requirements of operations. Resources reflect properties of hardware structures which can be accessed exclusively by one operation at a time. The instruction scheduling of the compiler depends on this information. The HDL code generator uses this information for resource conflict resolution. Besides the definition of all objects, the resource section in a LISA processor description provides information about the availability of hardware resources. By this, the property of several ports, for example, to a register bank or a memory is reflected. Moreover, the behavior section within LISA operations announces the use of processor resources. This takes place in the section header using the keyword USES in conjunction with the resource name and the information if the used resource is read, written or both (IN, OUT, or INOUT respectively).



RESOURCE { PROGRAM_COUNTER REGISTER DATA_MEMORY PROGRAM_MEMORY

19-7

int signed int

PC; R[0..7];

signed int unsigned int

RAM[0..255]; ROM[0..255];

PIPELINE ppu_pipe = { FI; ID; EX; WB }; PIPELINE_REGISTER IN ppu_pipe { bit[6] Opcode; short operandA; short operandB; }; }

FIGURE 19.3 Specification of the memory model.

RESOURCE { REGISTER unsigned int DATA_MEMORY signed int }

R([0..7])6; RAM([0..15]);

OPERATION NEG_RM { BEHAVIOR USES (IN R[]; OUT RAM[];) { /* C-code */ RAM[address] = (-1) * R[index]; } }

FIGURE 19.4 Specification of the resource model.

For illustration purposes, a sample LISA code taken from the ICORE architecture is shown in Figure 19.4. The definition of the availability of resources is carried out by enclosing the C-style resource definition with round braces followed by the number of simultaneously allowed accesses. If the number is omitted, one allowed access is assumed. The figure shows the declaration of a register bank and a memory with six and one ports respectively. Furthermore, the behavior section of the operation announces the use of these hardware resources for read and write. The instruction set model. This identifies valid combinations of hardware operations and admissible operands. It is expressed by the assembly syntax, instruction word coding, and the specification of legal operands and addressing modes for each instruction. Compilers and assemblers can identify instructions based on this model. The same information is used at the reverse process of decoding and disassembling. In LISA, the instruction set model is captured within operations. Operation definitions collect the description of different properties of the instruction set model which are defined in several sections: • The CODING section describes the binary image of the instruction word. • The SYNTAX section describes the assembly syntax of instructions, operands, and execution modes. • The SEMANTICS section specifies the transition function of the instruction.


19-8


OPERATION COMPARE_IMM { DECLARE { LABEL index; GROUP src1, dest = { register }; } CODING { 0b10011 index=0bx[5] src1 dest } SYNTAX { "CMP" src1 ˜"," index ˜"," dest } SEMANTICS { CMP (dest,src1,index) } }

FIGURE 19.5

Specification of the instruction set model.

OPERATION register { DECLARE { LABEL index; } CODING { index=0bx[4] } EXPRESSION { R[index] } }

OPERATION ADD { DECLARE { GROUP src1,src2,dest = { register }; } CODING { 0b010010 src1 src2 dest } BEHAVIOR { /* C-code */ dest = src1 + src2; saturate(&dest); } }

FIGURE 19.6

Specification of the behavioral model.

Figure 19.5 shows an excerpt of the ICORE LISA model contributing to the instruction set model information on the compare immediate instruction. The DECLARE section contains local declarations of identifiers and admissible operands. Operation register is not shown in the figure but comprises the definition of the valid coding and syntax for src1 and dest respectively. The behavioral model. This abstracts the activities of hardware structures to operations changing the state of the processor for simulation purposes. The abstraction level of this model can range widely between the hardware implementation level and the level of HLL statements. The BEHAVIOR and EXPRESSION sections within LISA operations describe components of the behavioral model. Here, the behavior section contains pure C-code that is executed during simulation whereas the expression section defines the operands and execution modes used in the context of operations. An excerpt of the ICORE LISA model is shown in Figure 19.6. Depending on the coding of the src1, src2, and dest field, the behavior code of operation ADD works with the respective registers of register bank R. As arbitrary C-code is allowed, function calls can be made to libraries which are later linked to the executable software simulator. The timing model. This specifies the activation sequence of hardware operations and units. The instruction latency information lets the compiler find an appropriate schedule and provides timing relations between operations for simulation and implementation. Several parts within a LISA model contribute to the timing model. First, the declaration of pipelines in the resource section. The declaration starts with the keyword PIPELINE, followed by an identifying name and the list of stages. Second, operations are assigned to pipeline stages by using the keyword IN and providing the name of the pipeline and the identifier of the respective stage, such as: OPERATION name_of _operation IN ppu_pipe.EX


(19.1)


19-9

RESOURCE { PIPELINE ppu_pipe = { FI; ID; EX; WB }; } OPERATION CORDIC IN ppu_pipe.EX { ACTIVATION { WriteBack } BEHAVIOR { PIPELINE_REGISTER(ppu_pipe, EX/WB).ResultE = cordic(); } } OPERATION WriteBack IN ppu_pipe.WB { BEHAVIOR { R[value] = PIPELINE_REGISTER(ppu_pipe, EX/WB).ResultE; } }

FIGURE 19.7 Specification of the timing model.

Thirdly, the ACTIVATION section in the operation description is used to activate other operations in the context of the current instruction. The activated operations are launched as soon as the instruction enters the pipeline stage the activated operation is assigned to. Non-assigned operations are launched in the pipeline stage of their activation. To exemplify this, Figure 19.7 shows sample LISA code taken from the ICORE architecture. Operations CORDIC and WriteBack are assigned to stages EX and WB of pipeline ppu_pipe, respectively. Here, operation CORDIC activates operation WriteBack which will be launched in the following cycle (in correspondence to the spacial ordering of pipeline stages) in case of an undisturbed flow of the pipeline. Moreover, in the ACTIVATION section, pipelines are controlled by means of predefined functions stall, shift, flush, insert, and execute which are automatically provided by the LISA environment for each pipeline declared in the resource section. All these pipeline control functions can be applied to single stages as well as whole pipelines, for example: PIPELINE(ppu_pipe,EX/WB).stall();

(19.2)

Using this very flexible mechanism, arbitrary pipelines, hazards, and mechanisms like forwarding can be modelled in LISA. The micro-architecture model. This allows grouping of hardware operations to functional units and contains the exact micro-architecture implementation of structural components such as adders, multipliers, etc. This enables the HDL generator to generate the appropriate HDL code from a more abstract specification. In analogy to the syntax of the VHDL language, operation grouping to functional units is formalized using the keyword ENTITY in the resource section of the LISA model, for example: ENTITY Alu { Add, Sub }

(19.3)

Here, LISA operations Add and Sub are assigned to the functional unit Alu. Information on the exact micro-architectural implementation of structural components can be included into the LISA model, for example, by calling DesignWare components [33] from within the behavior section or by inlining HDL code.


19-10


19.4 LISA Processor Design Platform The LISA processor design platform (LPDP) is an environment that allows the automatic generation of software development tools for architecture exploration, hardware implementation, software development tools for application design, and hardware–software co-simulation interfaces from one sole specification of the target architecture in the LISA language. Figure 19.8 shows the components of the LPDP environment.

19.4.1 Hardware Designer Platform — For Exploration and Processor Generation As indicated in Section 19.3, architecture design requires the designer to work in two fields (see Figure 19.9): on the one hand the development of the software part including compiler, assembler, linker, and simulator and on the other hand the development of the target architecture itself. The software simulator produces profiling data and thus may answer questions concerning the instruction set, the performance of an algorithm and the required size of memory and registers. The required silicon area or power consumption can only be determined in conjunction with a synthesizable HDL model. To accommodate these requirements, the LISA hardware designer platform can generate the following tools: • LISA language debugger for debugging the instruction-set with a graphical debugger frontend. • Exploration C-compiler for the non-critical parts of the application. • Exploration assembler which translates text-based instructions into object code for the respective programmable architecture. • Exploration linker which is controlled by a dedicated linker command file. • Instruction-set architecture (ISA) simulator providing extensive profiling capabilities, such as instruction execution statistics and resource utilization. Besides the ability to generate a set of software development tools, synthesizable HDL code (both VHDL and Verilog) for the processors control path and instruction decoder can be generated automatically from the LISA processor description. This also comprises the pipeline and pipeline controller including complex Hardware designer Application C compiler

Assembler

Simulator Linker

Architecture exploration Software designer C-compiler

LISA Architecture specification

Assembler/ linker

Architecture implementation System integrator System on chip

Simulator/debug.

Software application design

FIGURE 19.8

LISA processor design environment.


Integration and verification


19-11

Target architecture

LISA description E x p l o r a t i o n

Language compiler

Language C-compiler

HDL description

LISA assembler LISA linker

Synthesis tools

LISA simulator

I m p l e m e n t a t i o n

Gate level model Evalution results Profiling data, execution speed

Evalution results Chip size, clock speed, power consumption

FIGURE 19.9 Exploration and implementation.

interlocking mechanisms, forwarding, etc. For the data path, hand-optimized HDL code has to be inserted manually into the generated model. This approach has been chosen as the data path typically represents the critical part of the architecture in terms of power consumption and speed (critical path). It is obvious that deriving both software tools and hardware implementation model from one sole specification of the architecture in the LISA language has significant advantages: only one model needs to be maintained, changes on the architecture are applied automatically to the software tools and the implementation model and the consistency problem among the software tools and between software tools and implementation model is reduced significantly.

19.4.2 Software Designer Platform — For Software Application Design To cope with the requirements of functionality and speed in the software design phase, the tools generated for this purpose are an enhanced version of the tools generated during architecture exploration phase. The generated simulation tools are enhanced in speed by applying the compiled simulation principle [34] — where applicable — and are faster by one to two orders in magnitude than the tools currently provided by architecture vendors. As the compiled simulation principle requires the content of the program memory not to be changed during the simulation run, this holds true for most DSPs. However, for architectures running the program from external memory or working with operating systems which load/unload applications to/from internal program memory, this simulation technique is not suitable. For this purpose, an interpretive simulator is also provided.

19.4.3 System Integrator Platform — For System Integration and Verification Once the processor software simulator is available, it must be integrated and verified in the context of the whole system (SOC), which can include a mixture of different processors, memories, and interconnect components. In order to support the system integration and verification, the LPDP system integrator platform provides a well-defined application programmer interface (API) to interconnect the instructionset simulator generated from the LISA specification with other simulators. The API allows to control the simulator by stepping, running, and setting breakpoints in the application code and by providing access to the processor resources.


19-12


The following chapters will present the different areas addressed be the LISA processor design platform in more detail — software development tools and HDL code generation. Additionally, Section 19.7 will prove the high quality of the generated software development tools by comparing them to those shipped by the processor vendors.

19.5 SW Development Tools The feasibility to generate automatically HLL C-compilers, assemblers, linkers, and ISA simulators from LISA processor models enables the designer to explore the design space rapidly. In this section, specialties and requirements of these tools are discussed with particular focus on different simulation techniques.

19.5.1 Assembler and Linker The LISA assembler processes textual assembly source code and transforms it into linkable object code for the target architecture. The transformation is characterized by the instruction-set information defined in a LISA processor description. Besides the processor specific instruction-set, the generated assembler provides a set of pseudo-instructions (directives) to control the assembling process and initialize data. Section directives enable the grouping of assembled code into sections which can be positioned separately in the memory by the linker. Symbolic identifiers for numeric values and addresses are standard assembler features and are supported as well. Moreover, besides mnemonic-based instruction formats, C-like algebraic assembly syntax can be processed by the LISA assembler. The linking process is controlled by a linker command file which keeps a detailed model of the target memory environment and an assignment table of the module sections to their respective target memories. Moreover, it is suitable to provide the linker with an additional memory model which is separated from the memory configuration in the LISA description and which allows linking code into external memories that are outside the architecture model.

19.5.2 Simulator Due to the large variety of architectures and the facility to develop models on different levels of abstraction in the domain of time and architecture (see section 19.3), the LISA software simulator incorporates several simulation techniques ranging from the most flexible interpretive simulation to more application- and architecture-specific compiled simulation techniques. Compiled simulators offer a significant increase in instruction (cycle) throughput, however, the compiled simulation technique is not applicable in any case. To cope with this problem, the most appropriate simulation technique for the desired purpose (debugging, profiling, verification), architecture (instruction-accurate, cycle-accurate), and application (DSP kernel, operating system) can be chosen before the simulation is run. An overview of the available simulation techniques in the generated LISA simulator is given in the following: The interpretive simulation technique is employed in most commercially available instruction set simulators. In general, interpretive simulators run significantly slower than compiled simulators, however, unlike compiled simulation, this simulation technique can be applied to any LISA model and application. Dynamically scheduled, compiled simulation reduces simulation time by performing the steps of instruction decoding and operation sequencing prior to simulation. This technique cannot be applied to models using external memories or applications consisting of self-modifying program code. Besides the compilation steps performed in dynamic scheduling, static scheduling and code translation additionally implement operation instantiation. While the latter technique is used for instructionaccurate models, the former is suitable for cycle-accurate models including instruction pipelines. Beyond, the same restrictions apply as for dynamically scheduled simulation.



19-13

A detailed discussion of the different compiled simulation techniques is given in the following sections, while performance results are given in Section 19.7. The interpretive simulator is not discussed. 19.5.2.1 Compiled Simulation The objective of compiled simulation is to reduce the simulation time. Considering instruction set simulation, efficient run-time reduction can be achieved by performing repeatedly executed operations only once before the actual simulation is run, thus inserting an additional translation step between application load and simulation. The preprocessing of the application code can be split into three major steps [35]: 1. Within the step of instruction decoding, instructions, operands, and modes are determined for each instruction word found in the executable object file. In compiled simulation, the instruction decoding is only performed once for each instruction, whereas interpretive simulators decode the same instruction multiple times, for example, if it is part of a loop. This way, the instruction decoding is completely omitted at run-time, thus reducing simulation time significantly. 2. Operation sequencing is the process of determining all operations to be executed for the accomplishment of each instruction found in the application program. During this step, the program is translated into a table-like structure indexed by the instruction addresses. The table lines contain pointers to functions representing the behavioral code of the respective LISA operations. Although all involved operations are identified during this step, their temporal execution order is still unknown. 3. The determination of the operation timing (scheduling) is performed within the step of operation instantiation and simulation loop unfolding. Here, the behavior code of the operations is instantiated by generating the respective function calls for each instruction in the application program, thus unfolding the simulation loop that drives the simulation into the next state. Besides fully compiled simulation, which incorporates all of the above steps, partial implementations of the compiled principle are possible by performing only some of these steps. The accomplishment of each of these steps gives a further run-time reduction, but also requires a non-neglectable amount of compilation time. The trade-off between compilation time and simulation time is (qualitatively) shown in Figure 19.10. There are two levels of compiled simulation which are of particular interest — dynamic scheduling and static scheduling resp. code translation. In case of the dynamic scheduling, the task of selecting operations from overlapping instructions in the pipeline is performed at run-time of the simulation. The static scheduling already schedules the operations at compile-time.

Fully compiled

Compilation time

Static scheduling, code translation

Operation instantiation

Dynamic scheduling

Operation sequencing Compile-time decoding

Simulation time

FIGURE 19.10 Levels of compiled simulation.


Fully interpretive

19-14


19.5.2.2 Dynamic Scheduling As shown in Figure 19.10, the dynamic scheduling performs instruction decoding and operation sequencing at compile-time. However, the temporal execution order of LISA operations is determined at simulator run-time. While the operation scheduling is rather simple for instruction-accurate models, it becomes a complex task for models with instruction pipelines. In order to reflect the instructions’ timing exactly and to consider all possibly occurring pipeline effects like flushes and stalls, a generic pipeline model is employed simulating the instruction pipeline at run-time. The pipeline model is parameterized by the LISA model description and can be controlled via predefined LISA operations. These operations include: • • • • •

Insertion of operations into the pipeline (stages) Execution of all operations residing in the pipeline Pipeline shift Removal of operations (flush) Halt of entire pipeline or particular stages (stall)

Unlike for statically scheduled simulation, operations are inserted into and removed from the pipeline dynamically, that means, each operation injects further operations upon its execution. The information about operation timing is provided in the LISA description, that is, by the activation section as well as the assignment of operations to pipeline stages (see Section 19.3.2 — timing-model). It is obvious that the maintenance of the pipeline model at simulation time is expensive. Execution profiling on the generated simulators for the Texas Instruments TMS320C62xx [36] and TMS320C54x [37] revealed that more than fifty percent of the simulator’s run-time is consumed by the simulation of the pipeline. The situation could be improved by implementing the step of operation instantiation, consequently superseding the need for pipeline simulation. This, in turn, implies static scheduling, in other words, the determination of the operation timing due to overlapping instructions in the pipeline taking place at compile-time. Although there is no pipeline model in instruction-accurate processor models, it will be shown that operation instantiation also gives a significant performance increase for these models. Beyond that, operation instantiation is relatively easy to implement for instruction-accurate models (in contrast to pipelined models). 19.5.2.3 Static Scheduling Generally, operation instantiation can be described as the generation of an individual piece of (behavioral) simulator code for each instruction found in the application program. While this is straightforward for instruction-accurate processor models, cycle-true, pipelined models require a more sophisticated approach. Considering instruction-accurate models, the shortest temporal unit that can be executed is an instruction. That means, the actions to be performed for the execution of an individual instruction are determined by the instruction alone. In the simulation of pipelined models, the granularity is defined by cycles. However, since several instructions might be active at the same time due to overlapping execution, the actions performed during a single cycle are determined by the respective state of the instruction pipeline. As a consequence, instead of instantiating operations for each single instruction of the application program, behavioral code for each occurring pipeline state has to be generated. Several of such pipeline states might exist for each instruction, depending on the execution context of the instruction, that is, the instructions executed in the preceding and following cycles. As pointed out previously, the principle of compiled simulation relies on an additional translation step taking place before the simulation is run. This step is performed by a so-called simulation compiler, which implements the three steps presented in Section 5.2.1. Obviously, the simulation compiler is a highly architecture-specific tool, which is therefore retargeted from the LISA model description.



19-15

19.5.2.3.1 Operation Instantiation The objective of static scheduling is the determination of all possible pipeline states according to the instructions found in the application program. For purely sequential pipeline flow, that is, in case that no control hazards occur, the determination of the pipeline states can be achieved simply by overlapping consecutive instructions subject to the structure of the pipeline. In order to store the generated pipeline states, pipeline state tables are used, providing an intuitive representation of the instruction flow in the pipeline. Inserting instructions into pipeline state tables is referred to as scheduling in the following. A pipeline state table is a two-dimensional array storing pointers to LISA operations. One dimension represents the location within the application, the other the location within the pipeline, that is, the stage in which the operation is executed. When a new instruction has to be inserted into the state table, both intrainstruction and inter-instruction precedence must be considered to determine the table elements, in which the corresponding operations will be entered. Consequently, the actual time an operation is executed at depends on the scheduling of the preceding instruction as well as the scheduling of the operation(s) assigned to the preceding pipeline stage within the current instruction. Furthermore, control hazards causing pipeline stalls and/or flushes influence the scheduling of the instruction following the occurrence of the hazard. A simplified illustration of the scheduling process is given in Figure 19.11. Figure 19.11(a) shows the pipeline state table after a branch instruction has been inserted, composed of the operations fetch, decode, branch, and update_pc as well as a stall operation. The table columns represent the pipeline stages, the rows represent consecutive cycles (with earlier cycles in upper rows). The arrows indicate activation chains. The scheduling of a new instruction always follows the intra-instruction precedence, that means, fetch is scheduled before decode, decode before branch, and so on. The appropriate array element for fetch is determined by its assigned pipeline stage (FE) and according to inter-instruction precedences. Since the branch instruction follows the add instruction (which has already been scheduled), the fetch operation is inserted below the first operation of add (not shown in Figure 19.11[a]). The other operations are inserted according to their precedences. The stall of pipeline stage FE, which is issued from the decode operation of branch, is processed by tagging the respective table element as stalled. When the next instruction is scheduled, the stall is accounted for by moving the decode operation to the next table row resp. next cycle (see Figure 19.11[b]). Pipeline flushes are handled in a similar manner: if a selected table element is marked as flushed, the scheduling of the current instruction is abandoned. Assuming purely sequential instruction flow, the task of establishing a pipeline state table for the entire application program is very straightforward. However, every (sensible) application contains a certain amount of control flow (e.g., loops) interrupting this sequential execution. The occurrence of such control flow instructions makes the scheduling process extremely difficult or in a few cases even impossible.

(a)

DC

EX

WB

incr

fetch

decode

sub

incr

add

write_r

fetch

decode

add

write_r

branch

write_r

stalled

branch

write_r

DC

EX

WB

fetch

decode

sub

decode

stalled

(b)

FE

FE

upd_pc

FIGURE 19.11 Inserting instructions into pipeline state table.


decode

upd_pc

19-16

Address a1 a2,a3 a4 a5 ... b1


Instruction i1 BC addr i4 i5 ... k1

PF

FE

DC

AC

RD

EX

BC

i1

addr

BC

i1

i4

addr

BC

i1

i5

i4

add

BC

i1

stall

stall

addr

BC

i1

addr

BC

Condition evaluated

Cycle

addr

i6

i5

i4

i7

i6

i5

i4

i8

i7

i6

i5

i4

i9

i8

i7

i6

i5

i10

i9

i8

i7

i6

FIGURE 19.12

k1

–

–

k2

k1

–

–

k3

k2

k1

–

–

i4

k4

k3

k2

k1

–

–

i5

k5

k4

k3

k2

k1

–

Pipeline behavior for a conditional branch.

Generally, all instructions modifying the program counter cause interrupts in the control flow. Furthermore, only instructions providing an immediate target address, that is, branches and calls whose target address is known at compile-time, can be scheduled statically. If indirect branches or calls occur, it is inevitable to switch back to dynamic scheduling at run-time. Fortunately, most control flow instructions can be scheduled statically. Figure 19.12 exemplarily shows the pipeline states for a conditional branch instruction as found in the TMS320C54x’s instruction-set. Since the respective condition cannot be evaluated until the instruction is executed, scheduling has to be performed for both eventualities (condition true resp. false), splitting the program into alternative execution paths. The selection of the appropriate block of prescheduled pipeline states is performed by switching among different state tables at simulator run-time. In order to prevent from doubling the entire pipeline state table each time a conditional branch occurs, alternative execution paths are left as soon as an already generated state has been reached. Unless several conditional instructions reside in the pipeline at the same time, these usually have the length of a few rows. 19.5.2.3.2 Simulator Instantiation After all instructions of the application program have been processed, and thus the entire operation schedule has been established, the simulator code can be instantiated. The simulation compiler backend thereby generates either C code or an operation table with the respective function pointers, both describing alternative representations of the application program. Figure 19.13 shows a simplified excerpt of the generated C code for a branch instruction. Cases represent instructions, while a new line starts a new cycle.



switch case case case

19-17

(pc) { 0x1584: fetch(); decode(); sub(); write_registers(); 0x1585: fetch(); decode(); test_condition(); add(); 0x1586: branch(); write_registers(); fetch(); update_pc(); fetch(); decode(); fetch(); decode(); load(); goto_0x1400_;

}

FIGURE 19.13

Generated simulator code.

19.5.2.4 Instruction-Based Code Translation The need for a scheduling mechanism arises from the presence of an instruction pipeline in the LISA model. However, even instruction-accurate processor models without pipeline benefit from the step of operation instantiation. The technique applied here is called instruction-based code translation. Due to the absence of instruction overlap, simulator code can be instantiated for each instruction independently, thus simplifying simulator generation to the concatenation of the respective behavioral code specified in the LISA description. In contrast to direct binary-to-binary translation techniques [38], the translation of target-specific into host-specific machine code uses C source code as intermediate format. This keeps the simulator portable, and thus independent from the simulation host. Since the instruction-based code translation generates program code that linearly increases in size with the number of instructions in the application, the use of this simulation technique is restricted to small and medium sized applications (less than ≈10k instructions, depending on model complexity). For large applications, the resultant worse cache utilization on the simulation host reduces the performance of the simulator significantly.

19.6 Architecture Implementation As we are targeting the development of application specific instruction set processors (ASIP), which are highly optimized for one specific application domain, the HDL code generated from a LISA processor description has to fulfill tight constraints to be an acceptable replacement for handwritten HDL code by experienced designers. Especially power consumption, chip area, and execution speed are critical points for this class of architectures. For this reason, the LPDP platform does not claim to be able to efficiently synthesize the complete HDL code of the target architecture. Especially the data path of an architecture is highly critical and must in most cases be optimized manually. Frequently, full-custom design technique must be used to meet power consumption and clock speed constraints. For this reason, the generated HDL code is limited to the following parts of the architecture: • Coarse processor structure such as register set, pipeline, pipeline registers, and test-interface. • Instruction decoder setting data and control signals which are carried through the pipeline and activate the respective functional units executed in context of the decoded instruction. • Pipeline controller handling different pipeline interlocks, pipeline register flushes and supporting mechanisms such as data forwarding. Additionally, hardware operations as they are described in the LISA model can be grouped to functional units (see Section 19.3.2 — micro-architecture model). Those functional units are generated as wrappers, that is, the ports of the functional units as well as the interconnects to the pipeline registers and other functional units are generated automatically while the content needs to be filled manually with code. Emerging driver conflicts in context with the interconnects are resolved automatically by the insertion of multiplexers.


19-18


The disadvantage of rewriting the data path in the HDL description by hand is that the behavior of hardware operations within those functional units has to be described and maintained twice — on the one hand in the LISA model and on the other hand in the HDL model of the target architecture. Consequently, a problem here is verification which will be addressed in future research.

19.6.1 LISA Language Elements for HDL Synthesis The following sections will show in detail, how different parts of the LISA model contribute to the generated HDL model of the target architecture. 19.6.1.1 The Resource Section The resource section provides general information about the structure of the architecture (e.g., registers, memories, and pipelines, see Section 19.3.2 — resource/memory model). Based on this information, the coarse structure of the architecture can be generated automatically. Figure 19.14 shows an excerpt resource declaration of the LISA model of the ICORE architecture [32], which was used in our case study. The ICORE architecture has two different register sets — one for general purpose use named R, consisting of eight separate registers with 32 bits width and one for the address registers named AR, consisting of four elements each with eleven bits. The round brackets indicate the maximum number of simultaneously accesses allowed for the respective register bank — six for the general purpose register R and one for the address register set. From that, the respective number of access ports to the register banks can be generated automatically. With this information — bit-true widths, ranges and access ports — the register banks can be easily synthesized. Moreover, a data and program memory resource are declared — both 32 bits wide and with just one allowed access per cycle. Since various memory types are known and are generally very technology dependent, however, cannot be further specified in the LISA model, wrappers are generated with the appropriate number of access ports. Before synthesis, the wrappers need to be filled manually with code for the respective technology. The resources labelled as PORT are accessible from outside the model and can be attached to a testbench — in the ICORE the RESET and the STATE_BUS. Besides the processor resources such as memories, ports, and registers, also pipelines and pipeline registers are declared. The ICORE architecture contains a four stage instruction pipeline consisting of the stages FI (instruction fetch), ID (instruction decode), EX (instruction execution), and WB (write-back to registers). In between those pipeline stages, pipeline registers are located which forward information about the instruction such as instruction opcode, operand registers, etc. The declared pipeline registers are multiple instanced between each stage and are completely generated from the LISA model. For the pipeline and the stages, entities are created which are in a subsequent phase of the HDL generator run filled with code for functional units, instruction decoder, pipeline controller, etc. RESOURCE { REGISTER S32 REGISTER bit[11]

R([0..7])6; /* GP Registers */ AR([0..3]); /* Address Registers */

DATA_MEMORY S32 RAM([0..255]); /* Memory Space */ PROGRAM_MEMORY U32 ROM([0..255]);/* Instruction ROM */ PORT bit[1] PORT bit[32]

RESET; STATE_BUS;

/* Reset pin */ /* Processor state bus */

PIPELINE ppu_pipe = { FI; ID; EX; WB }; PIPELINE_REGISTER IN ppu_pipe { bit[6] Opcode; ... }; }

FIGURE 19.14

Resource declaration in the LISA model of the ICORE architecture.



19-19 Architecture entity

Base structure

Pipeline structure

Register entity

FE entity

Pipline entity

FE/DC

Stage structure—LISA entities

FIGURE 19.15

DC entity

Branch entity

Memory entity

DC/EX

EX entity

ALU entity

Shifter entity

Entity hierarchy in generated HDL model.

19.6.1.2 Grouping Operations to Functional Units As the LISA language describes the target architecture’s behavior and timing on the granularity of hardware operations, however, the synthesis requires the grouping of hardware operations to functional units that can then be filled with hand-optimized HDL code for the data path, a well known construct from the VHDL language was adopted for this purpose: the ENTITY (see Section 19.3.2 — micro-architecture model). Using the ENTITY to group hardware operations to a functional unit is not only an essential information for the HDL code generator but also for retargeting the HLL C-compiler which requires information about the availability of hardware resources to schedule instructions. As indicated in Section 19.6.1.1, the HDL code derived from the LISA resource section already comprises a pipeline entity including further entities for each pipeline stage and the respective pipeline registers. The entities defined in the LISA model are now part of the respective pipeline stages as shown in Figure 19.15. Here, a Branch entity is placed into the entity of the Decode stage. Moreover, the EX stage contains an ALU and a Shifter entity. As it is possible in LISA to assign hardware operations to pipeline stages, this information is sufficient to locate the functional units within the pipeline they are assigned to. As already pointed out, the entities of the functional units are wrappers which need to be filled with HDL code by hand. Nevertheless, in Section 19.6.2.1 will be shown that by far the largest part of the target architecture can be generated automatically from a LISA model. 19.6.1.3 Generation of the Instruction Decoder The generated HDL decoder is derived from information in the LISA model on the coding of instructions (see Section 19.3.2 — instruction-set model). Depending on the structuring of the LISA architecture description, decoder processes are generated in several pipeline stages. The specified signal paths within the target architecture can be divided into data signals and control signals. The control signals are a straight forward derivation of the operation activation tree which is part of the LISA timing model (see Section 19.3.2 – timing model). The data signals are explicitly modelled by the designer by writing values into pipeline registers and implicitly fixed by the declaration of used resources in the behavior sections of LISA operations.

19.6.2 Implementation Results The ICORE which was used in our case study is a low-power application specific instruction set processor (ASIP) for DVB-T acquisition and tracking algorithms. It has been developed in cooperation with Infineon Technologies. The primary tasks of this architecture are the FFT-window-position, sampling-clock synchronization for interpolation/decimation and carrier frequency offset estimation. In a previous project this architecture was completely designed by hand using semi-custom design. Thereby, a large amount of effort was spent in optimizing the architecture towards extremely low power consumption while keeping up the clock frequency at 120 MHz. At that time, a LISA model was already realized for architecture exploration purposes and for verifying the model against the handwritten HDL implementation.


19-20


Pipeline FI

ID DAG

Instructionfetch

ZOLP

Shifter Mult

WB

Minmax ALU

Branch

EX

MOVE

Writeback

IIC Bitmanip Decoder

Pipeline control

Addsub Mem Decoder

Registers

I/O control

Memory ICORE architecture Data path Control path

FIGURE 19.16

Autom. gen. process Manual entity

Entity

The complete generated HDL model.

Except for the data path within functional units, the HDL code of the architecture has been generated completely. Figure 19.16 shows the composition of the model. The dark boxes have been filled manually with HDL code, whereas the light boxes and interconnects are the result of the generation process. 19.6.2.1 Comparison of Development Time The LISA model of the ICORE as well as the original handwritten HDL model of the ICORE architecture have been developed by one designer. The initial manual realization of the HDL model (without the time needed for architecture exploration) took approx. three months. As already indicated, a LISA model was built in this first realization of the ICORE for architecture exploration and verification purposes. It took the designer approx. one month to learn the LISA language and to create a cycle-accurate LISA model. After completion of the HDL generator, it took another two days to refine the LISA model to RTL-accuracy. The handwritten functional units (data path), that were added manually to the generated HDL model, could be completed in less than a week. This comparison clearly indicates, that the time expensive work in realizing the HDL model was to create structure, controller and decoder of the architecture. In addition, a major decrease of total architecture design time can be seen, as the LISA model results from the design exploration phase. 19.6.2.2 Gate Level Synthesis To verify the feasibility of generating automatically HDL code from LISA architecture descriptions in terms of power-consumption, clock speed, and chip area, a gate level synthesis was carried out. The model has not been changed (i.e., manually optimized) to enhance the results.



19-21

19.6.2.2.1 Timing and Size Comparison The results of the gate-level synthesis affecting timing and area optimization were compared to the handwritten ICORE model, which comprised the same architectural features. Moreover, the same synthesis scripts were used for both models. It shall be emphasized that the performance values are nearly the same for both models. Furthermore, it is interesting that the same critical paths were found in both, the handwritten and the generated model. The critical paths occur exclusively in the data path, which confirms the presumption that the data path is the most critical part of the architecture and should thus not be generated automatically from an abstract processor model. 19.6.2.2.2 Critical Path The synthesis has been performed with a clock of 8 nsec, this equals a frequency of 125 MHz. The critical path, starting from the pipeline register to the shifter unit and multiplexer to the next pipeline register, violates this timing constraints by 0.36 nsec. This matches the handwritten ICORE model, which has been improved from this point of state manually at gate-level. The longest combinatoric path of the ID stage runs through the decoder and the DAG entity and counts 3.7 nsec. Therefore, the generated decoder does not affect the critical path in any way. 19.6.2.2.3 Area The synthesized area has been a minor criteria, due to the fact that the constrains for the handwritten ICORE model are not area sensitive. The total area of the generated ICORE model is 59,009 gates. The combinational area takes 57% of the total area. The handwritten ICORE model takes a total area of 58,473 gates. The most complex part of the generated ICORE is the decoder. The area of the automatically generated decoder in the ID stage is 4693 gates, whereas the area of the handwritten equivalent is 5500 gates. This result must be considered carefully as the control logic varies in some implemented features — for example, the handwritten decoder and program flow controller support an idle and suspended state of the core. 19.6.2.2.4 Power Consumption Comparison Figure 19.17 shows the comparison of power consumption of the handwritten versus the generated ICORE realization. The handwritten model consumes 12,64 mW, whereas the implementation generated from a LISA model consumes 14,51 mW. The reason for the slightly worse numbers in power consumption of the generated model versus the handwritten is due to the early version of the LISA HDL generator which in its current state allows access to all registers and memories within the model via the test-interface. Without this unnecessary overhead, the same results as for the hand-optimized model are achievable.

16

Power consumption (mW)

14

14,51 mW 12,64 mW

12 10 8 6 4 2 0 Handwritten ICORE

FIGURE 19.17 Power consumption of different ICORE realizations.


Generated ICORE

19-22

FIGURE 19.18


Graphical debugger frontend.

To summarize, it could be shown in this chapter that it is feasible to generate efficient HDL code from architecture descriptions in the LISA language.

19.7 Tools for Application Development The LPDP application software development tool-suite includes HLL C-compiler, assembler, linker, simulator as well as a graphical debugger frontend. Providing these tools, a complete software development environment is available which ranges from the C/assembly source file up to simulation within a comfortable graphical debugger frontend. The tools are an enhanced version of those tools used for architecture exploration. The enhancements concern for the software simulator the ability to graphically visualize the debugging process of the application under test. The LISA debugger frontend ldb is a generic GUI for the generated LISA simulator (see Figure 19.18). It visualizes the internal state of the simulation process. Both the C-source code and the disassembly of the application as well as all configured memories and (pipeline) registers are displayed. All contents can be changed in the frontend at run-time of the application. The progress of the simulator can be controlled by stepping and running through the application and setting breakpoints. The code generation tools (assembler and linker) are enhanced in functionality as well. The assembler supports more than 30 common assembler directives, labels, and symbols, named user sections, generation of source listing and symbol table and provides detailed error report and debugging facilities, whereas the linker is driven by a powerful linker command file with the ability to link sections into different address spaces, paging support, and the possibility to define user specific memory models.

19.7.1 Examined Architectures To examine the quality of the generated software development tools, four different architectures have been considered. The architectures were carefully chosen to cover a broad range of architectural characteristics and are widely used in the field of digital signal processing (DSP) and microcontrollers (µC). Moreover, the abstraction level of the models ranges from phase-accuracy (TMS320C62x) to instruction-set accuracy (ARM7): ARM7. The ARM7 core is a 32 bit microcontroller of Advanced RISC Machines Ltd [39]. The realization of a LISA model of the ARM7 µC at instruction-set accuracy took approx. two weeks.



19-23

ADSP2101. The Analog Devices ADSP2101 is a 16 bit fixed-point DSP with 20 bit instruction-word width [40]. The realization of the LISA model of the ADSP2101 at cycle-accuracy took approx. 3 weeks. TMS320C54x. The Texas Instruments TMS320C54x is a high performance 16 bit fixed-point DSP with a six stage instruction pipeline [37]. The realization of the model at cycle-accuracy (including pipeline behavior) took approx. 8 weeks. TMS320C62x. The Texas Instruments TMS320C62x is a general-purpose fixed-point DSP based on a very long instruction-word (VLIW) architecture containing an eleven stage pipeline [36]. The realization of the model at phase-accuracy (including pipeline behavior) took approx. 6 weeks. These architectures were modelled on the respective abstraction level with LISA and software development tools were generated successfully. The speed of the generated tools was then compared with the tools shipped by the respective tools of the architecture vendor. Of course the LISA tools are working on the same level of accuracy as the vendor tools. The vendor tools are exclusively using the interpretive simulation technique.

19.7.2 Efficiency of the Generated Tools Measurements took place on an AMD Athlon system with a clock frequency of 800 MHz. The system is equipped with 256 MB of RAM and is part of the networking system. It runs under the operating system Linux, kernel version 2.2.14. Tool compilation was performed with GNU GCC, version 2.92. The generation of the complete tool-suite (HLL C-compiler, simulator, assembler, linker, and debugger frontend) takes, depending on the complexity of the considered model, between 12 sec (ARM7 µC instruction-set accurate) and 67 sec (C6x DSP phase-accurate). Due to the early stage in research on the retargetable compiler (see Section 19.8), no results on code quality are presented. 19.7.2.1 Performance of the Simulator Figures 19.19 to 19.22 show the speed of the generated simulators in instructions per second and cycles per second, respectively. Simulation speed was quantified by running an application on the respective simulator and counting the number of processed instructions/cycles. The set of simulated applications on the architectures comprises a simple 20 tap FIR filter, an ADPCM G.721 (Adaptive Differential Pulse Code Modulation) coder/decoder and a GSM speech codec. For the ARM7, an ATM-QFC protocol application was additionally run, which is responsible for flow control and configuration in an ATM portprocessor chip. As expected, the compiled simulation technique applied by the generated LISA simulators outperforms the vendor simulators by one to two orders in magnitude.

Speed (in mega instr. per seconds)

40

LISA — compiled (code translation)

35 LISA — compiled (dynamic scheduling) 30 ARMulator — interpretive 25 ARM7 (real hardware, 25 MHz)

20 15 10 5 0 FIR

ADPCM

FIGURE 19.19 Speed of the ARM7 µC at instruction-accuracy.


ATM-QFC

19-24

Embedded Systems Handbook 25

Speed (in mega cycles per sec)

LISA — compiled (code translation) LISA — compiled (dynamic scheduling)

20

Analog Devices xsim 2101 — interpretive 15

10

5 0,01 Meg

0,01 Meg

0,01 Meg

0 FIR

FIGURE 19.20

ADPCM

GSM

Speed of the ADSP2101 DSP at cycle-accuracy.

Speed (in mega cycles per sec)

6

LISA — compiled (static scheduling) LISA — compiled (dynamic scheduling)

5

Texas Instruments sim54x — interpretive 4 3 2 1 0,075 Meg

0,075 Meg

0,075 Meg

0 FIR

FIGURE 19.21

ADPCM

Speed of C54x DSP at cycle-accuracy.

1200

Speed (in kilo cycles per sec)

GSM

LISA — compiled (dynamic scheduling) Texas Instruments sim54x — interpretive

1000 800 600 400 200 15 K

15 K

15 K

0 FIR

FIGURE 19.22

Speed of the C6x DSP at cycle-accuracy.


ADPCM

GSM


19-25

As both the ARM7 and ADSP2101 LISA model contain no instruction pipeline, two different flavors of compiled simulation are applied in the benchmarks — instruction-based code translation and dynamic scheduling (see Section 19.5.2.4). It shows, that the highest possible degree of simulation compilation offers an additional speed-up of a factor 2–7 compared to dynamically scheduled compiled simulation. As explained in Section 19.5.2.4, the speed-up decreases with bigger applications due to cache misses on the simulating host. It is interesting to see that — considering an ARM7 µC running at a frequency of 25 MHz — the software simulator running at 31 MIPS even outperforms the real hardware. This makes application development suitable before the actual silicon is at hand. The LISA model of the C54x DSP is cycle-accurate and contains an instruction pipeline. Therefore, compiled simulation with static scheduling is applied (see Section 19.5.2.3). This pays off with an additional speed-up of a factor of 5 compared to a dynamically scheduled compiled simulator. Due to the superscalar instruction dispatching mechanism used in the C62x architecture, which is highly run-time dependent, the LISA simulator for the C62x DSP uses only compiled simulation with dynamic scheduling. However, the dynamic scheduled compiled simulator still offers a significant speed-up of a factor of 65 compared to the native TI simulator. 19.7.2.2 Performance of Assembler and Linker The generated assembler and linker are not as time critical as the simulator is. It shall be mentioned though that the performance (i.e., the number of assembled/linked instructions per second) of the automatically generated tools is comparable to that of the vendor tools.

19.8 Requirements and Limitations In this chapter the requirements and current limitations of different aspects of the processor design using the LISA language are discussed. These affect the modelling capabilities of the language itself as well as the generated tools.

19.8.1 LISA Language Common to all models described in LISA is the underlying zero-delay model. This means that all transitions are provided correctly at each control step. Control steps may be clock phases, clock cycles, instruction cycles or even higher levels. Events between these control steps are not regarded. However, this property meets requirements of current co-simulation environments [41–43] on processor simulators to be used for HW/SW co-design [44,45]. Besides, the LISA language currently contains no formalism to describe memory hierarchies such as multi-level caches. However, existing C/C++ models of memory hierarchies can easily be integrated into the LISA architecture model.

19.8.2 HLL C-compiler Due to the early stage of research, no further details on the retargetable compiler are presented within the scope of this chapter. At the current status, the quality of the generated code is only fair. However, it is evident that the proposed new ASIP design methodology can only be carried out efficiently at the presence of an efficient retargetable compiler. In our case study presented in Section 19.6, major parts of the application were realized in assembly code.

19.8.3 HDL Generator As LISA allows modelling the architecture using a combination of both LISA language elements and pure C/C++ code, certain coding guidelines need to be obeyed in order to generate synthesizable HDL code of the target architecture. Firstly, only the LISA language elements are considered — thus the usage of C-code in the model needs to be limited to the description of the data path which is not taken into account for


19-26


HDL code generation anyway. Secondly, architectural properties, which can be modelled in LISA but are not synthesizable include pipelined functional units and multiple instruction word decoders.

19.9 Conclusion and Future Work In this chapter we presented the LISA processor design platform LPDP — a novel framework for the design of application specific integrated processors. The LPDP platform helps the architecture designer in different domains: architecture exploration, implementation, application software design, and system integration/verification. In a case study it was shown that an ASIP, the ICORE architecture, was completely realized using this novel design methodology — from exploration to implementation. The implementation results concerning maximum frequency, area and power consumption were comparable to those of the hand-optimized version of the same architecture realized in a previous project. Moreover, the quality of the generated software development tools was compared to those of the semiconductor vendors. LISA models were realized and tools successfully generated for the ARM7 µC, the Analog Devices ADSP2101, the Texas Instruments C62x and the Texas Instruments C54x on instructionset/cycle/phase-accuracy respectively. Due to the usage of the compiled simulation principle, the generated simulators run by one to two orders in magnitude faster than the vendor simulators. In addition, the generated assembler and linker can compete well in speed with the vendor tools. Our future work will focus on modelling further real-world processor architectures and improving the quality of our retargetable C-compiler. In addition, formal ways to model memory hierarchies will be addressed. For the HDL generator, data path synthesis will be examined in context of the SystemC modelling language.

References [1] M. Birnbaum and H. Sachs, How VSIA answers the SOC dilemma. IEEE Computer, 32, 42–50, 1999. [2] S. Pees, A. Hoffmann, V. Zivojnovic, and H. Meyr, LISA — machine description language for cycleaccurate models of programmable DSP architectures. In Proceedings of the Design Automation Conference (DAC). New Orleans, June 1999. [3] V. Živojnović, S. Pees, and H. Meyr, LISA — machine description language and generic machine model for HW/SW co-design. In Proceedings of the IEEE Workshop on VLSI Signal Processing. San Francisco, October 1996. [4] K. Olukotun, M. Heinrich, and D. Ofelt, Digital system simulation: methodologies and examples. In Proceedings of the Design Automation Conference (DAC), June 1998. [5] J. Rowson, Hardware/software co-simulation. In Proceedings of the Design Automation Conference (DAC), 1994. [6] R. Stallman, Using and Porting the GNU Compiler Collection, gcc-2.95 ed. Free Software Foundation, Boston, MA, 1999. [7] G. Araujo, A. Sudarsanam, and S. Malik, Instruction set design and optimization for address computation in DSP architectures. In Proceedings of the International Symposium on System Synthesis (ISSS), 1996. [8] C. Liem et al., Industrial experience using rule-driven retargetable code generation for multimedia applications. In Proceedings of the International Symposium on System Synthesis (ISSS), September 1995. [9] D. Engler, VCODE: a retargetable, extensible, very fast dynamic code generation system. In Proceedings of the International Conference on Programming Language Design and Implementation (PLDI), May 1996.



19-27

[10] D. Bradlee, R. Henry, and S. Eggers, The Marion system for retargetable instruction scheduling. In Proceedings of the ACM SIGPLAN’91 Conference on Programming Language Design and Implementation. Toronto, Canada, 1991, pp. 229–240. [11] B. Rau, VLIW compilation driven by a machine description database. In Proceedings of the 2nd Code Generation Workshop. Leuven, Belgium, 1996. [12] M. Freericks, The nML machine description formalism. Technical Report 1991/15, Technische Universität Berlin, Fachbereich Informatik, Berlin, 1991. [13] A. Fauth, J. Van Praet, and M. Freericks, Describing instruction set processors using nML. In Proceedings of the European Design and Test Conference. Paris, March 1995. [14] M. Hartoog et al., Generation of software tools from processor descriptions for hardware/software codesign. In Proceedings of the Design Automation Conference (DAC), June 1997. [15] W. Geurts et al., Design of DSP systems with chess/checkers. In Proceedings of the 2nd International Workshop on Code Generation for Embedded Processors. Leuven, March 1996. [16] J. Van Praet et al., A graph based processor model for retargetable code generation. In Proceedings of the European Design and Test Conference (ED&TC), March 1996. [17] V. Rajesh and R. Moona, Processor modeling for hardware software codesign. In Proceedings of the International Conference on VLSI Design. Goa, India, January 1999. [18] G. Hadjiyiannis, S. Hanono, and S. Devadas, ISDL: an instruction set description language for retargetability. In Proceedings of the Design Automation Conference (DAC), June 1997. [19] A. Halambi et al., EXPRESSION: a language for architecture exploration through compiler/simulator retargetability. In Proceedings of the Conference on Design, Automation & Test in Europe (DATE), March 1999. [20] P. Paulin, Design automation challenges for application-specific architecture platforms. In Proceedings of the SCOPES 2001 — Workshop on Software and Compilers for Embedded Systems, March 2001. [21] ACE — Associated Compiler Experts, The COSY Compilation System, 2001. http://www.ace.nl/ products/cosy.html [22] T. Morimoto, K. Saito, H. Nakamura, T. Boku, and K. Nakazawa, Advanced processor design using hardware description language AIDL. In Proceedings of the Asia South Pacific Design Automation Conference (ASPDAC), March 1997. [23] I. Huang, B. Holmer, and A. Despain, ASIA: automatic synthesis of instruction-set architectures. In Proceedings of the SASIMI Workshop, October 1993. [24] M. Gschwind, Instruction set selection for ASIP design. In Proceedings of the International Workshop on Hardware/Software Codesign, May 1999. [25] S. Kobayashi et al., Compiler generation in PEAS-III: an ASIP development system. In Proceedings of the SCOPES 2001 — Workshop on Software and Compilers for Embedded Systems, March 2001. [26] C.-M. Kyung, Metacore: an application specific DSP development system. In Proceedings of the Design Automation Conference (DAC), June 1998. [27] M. Barbacci, Instruction set processor specifications (ISPS): the notation and its application. IEEE Transactions on Computers, C-30, 24–40, 1981. [28] R. Gonzales, Xtensa: a configurable and extensible processor. IEEE Micro, 20, 2000. [29] Synopsys, COSSAP. http://www.synopsys.com [30] OPNET, http://www.opnet.com [31] LISA Homepage, ISS, RWTH Aachen, 2001, http://www.iss.rwth-aachen.de/lisa [32] T. Gloekler, S. Bitterlich, and H. Meyr, Increasing the power efficiency of application-specific instruction set processors using datapath optimization. In Proceedings of the IEEE Workshop on Signal Processing Systems (SIPS). Lafayette, October 2001. [33] Synopsys, DesignWare Components, 1999. http://www.synopsys.com/products/designware/ designware.html


19-28


[34] A. Hoffmann, A. Nohl, G. Braun, and H. Meyr, Generating production quality software development tools using a machine description language. In Proceedings of the Conference on Design, Automation & Test in Europe (DATE), March 2001. [35] S. Pees, A. Hoffmann, and H. Meyr, Retargeting of compiled simulators for digital signal processors using a machine description language. In Proceedings of the Conference on Design, Automation & Test in Europe (DATE). Paris, March 2000. [36] Texas Instruments, TMS320C62x/C67x CPU and Instruction Set Reference Guide, March 1998. [37] Texas Instruments, TMS320C54x CPU and Instruction Set Reference Guide, October 1996. [38] R. Sites et al., Binary translation. Communications of the ACM, 36, 69–81, 1993. [39] Advanced Risc Machines Ltd., ARM7 Data Sheet, December 1994. [40] Analog Devices, ADSP2101 User’s Manual, September 1993. [41] Synopsys, Eaglei, 1999. http://www.synopsys.com/products/hwsw [42] Cadence, Cierto, 1999. http://www.cadence.com/technology/hwsw [43] Mentor Graphics, Seamless, 1999. http://www.mentor.com/seamless [44] L. Guerra et al., Cycle and phase accurate DSP modeling and integration for HW/SW co-verification. In Proceedings of the Design Automation Conference (DAC), June 1999. [45] R. Earnshaw, L. Smith, and K. Welton, Challenges in cross-development. IEEE Micro, 17, 28–36, 1997.


20 State-of-the-Art SoC Communication Architectures 20.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20-1 20.2 AMBA Bus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20-2 AMBA System Bus • AMBA AHB Basic Operation • Advanced Peripheral Bus • Advanced AMBA Evolutions

20.3 CoreConnect Bus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20-7 Processor Local Bus • On-Chip Peripheral Bus • Device Control Register Bus

20.4 STBus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20-10 Bus Topologies

20.5 Wishbone . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20-11 The Wishbone Bus Transactions

20.6 SiliconBackplane MicroNetwork . . . . . . . . . . . . . . . . . . . . . . . 20-12 System Interconnect Bandwidth • Configuration Resources

José L. Ayala and Marisa López-Vallejo Universidad Politécnica de Madrid

Davide Bertozzi and Luca Benini University of Bologna

20.7 Other On-Chip Interconnects . . . . . . . . . . . . . . . . . . . . . . . . . . 20-14 Peripheral Interconnect Bus • Avalon • CoreFrame

20.8 Analysis of Communication Architectures. . . . . . . . . . . . . 20-15 Scalability Analysis

20.9 Packet-Switched Interconnection Networks . . . . . . . . . . . 20-20 20.10 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20-21 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20-21

20.1 Introduction The current high levels of on-chip integration allow for the implementation of increasingly complex Systems-on-Chip (SoC), consisting of heterogeneous components such as general-purpose processors, Digital Signal Processors (DSPs), coprocessors, memories, I/O units, and dedicated hardware accelerators. In this context, MultiProcessor Systems-on-Chip (MPSoC) are emerging as an effective solution to meet the demand for computational power posed by application domains such as network processors and parallel media processors. MPSoCs combine the advantages of parallel processing with the high integration levels of SoCs. It is expected that future MPSoCs will integrate hundreds of processing units and storage elements, and their performance will be increasingly interconnect dominated [1]. Interconnect technology and 20-1


20-2


architecture will become the limiting factor for achieving operational goals, and the efficient design of low-power, high-performance on-chip communication architectures will pose novel challenges. The main issue regards scalability of system interconnects, since the trend for system integration is expected to continue. State-of-the-art on-chip buses rely on shared communication resources and on an arbitration mechanism that is in charge of serializing bus access requests. This widely adopted solution unfortunately suffers from power and performance scalability limitations, therefore a lot of effort is being devoted to the development of advanced bus topologies (e.g., partial or full crossbars, bridged buses) and protocols, some of them are already implemented in commercially available products. In the long run, a more aggressive approach will be needed, and a design paradigm shift will most probably lead to a packetized on-chip communication based on micronetworks of interconnects or Networks-on-Chip (NoC) [2,3]. This chapter focuses on state-of-the-art SoC communication architectures, providing an overview of the most relevant ones from an industrial and research viewpoint. Beyond describing the distinctive features of each of them, the chapter sketches the main evolution guidelines for these architectures by means of a protocol and topology analysis framework. Finally, some basic concepts on packet-switched interconnection networks will be put forward. Open bus specifications such as Advanced MicroController Bus Architecture (AMBA) and CoreConnect will be obviously described more in detail, providing the background which is needed to understand the more general description of proprietary industrial bus architectures, while at the same time being able to assess their contribution to the advance in the field.

20.2 AMBA Bus AMBA is a bus standard which was originally conceived by ARM to support communication among ARM processor cores. However, nowadays AMBA is one of the leading on-chip busing systems because it is licensed and deployed for use with third party Intellectual Property (IP) cores [4]. Designed for custom silicon, the AMBA specification provides standard bus protocols for connecting on-chip components, custom logic and specialized functions. These bus protocols are independent of the ARM processor and generalized for different SoC structures. AMBA defines a segmented bus architecture, wherein two bus segments are connected with each other via a bridge that buffers data and operations between them. A system bus is defined, which provides a highspeed, high-bandwidth communication channel between embedded processors and high-performance peripherals. Two system buses are actually specified: the AMBA High-Speed Bus (AHB) and the Advanced System Bus (ASB). Moreover, a low-performance and low-power peripheral bus (called Advanced Peripheral Bus, APB) is specified, which accommodates communication with general-purpose peripherals and is connected to the system bus via a bridge, acting as the only APB master. The overall AMBA architecture is illustrated in Figure 20.1.

20.2.1 AMBA System Bus ASB is the first generation of AMBA system bus, and sits above APB in that it implements the features required for high-performance systems including burst transfers, pipelined transfer operation and multiple bus masters. AHB is a later generation of AMBA bus which is intended to address the requirements of high-performance, high-clock synthesizable designs. ASB is used for simpler, more cost-effective designs whereas more sophisticated designs call for the employment of the AHB. For this reason, a detailed description of AHB follows. The main features of AMBA AHB can be summarized as follows: Multiple bus masters. Optimized system performance is obtained by sharing resources among different bus masters. A simple request-grant mechanism is implemented between the arbiter and each bus master. In this way, the arbiter ensures that only one bus master is active on the bus and also that when no masters are requesting the bus a default master is granted.


SoC Communication Architectures

20-3

ARM CPU Test I/F controller

AMBA(AHB) System bus

SRAM

External memory

Bridge

SDRAM controller

Audio codec I/F

Smart card I/F

Color LCD controller

High speed

AMBA(APB) Peripheral bus Synchronous serial port UART

Low power

FIGURE 20.1 Schematic architecture of AMBA bus.

Pipelined and burst transfers. Address and data phases of a transfer occur during different clock periods. In fact, the address phase of any transfer occurs during the data phase of the previous transfer. This overlapping of address and data is fundamental to the pipelined nature of the bus and allows for highperformance operation, while still providing adequate time for a slave to provide the response to a transfer. This also implies that ownership of the data bus is delayed with respect to ownership of the address bus. Moreover, support for burst transfers allows for efficient use of memory interfaces by providing transfer information in advance. Split transactions. They maximize the use of bus bandwidth by enabling high latency slaves to release the system bus during dead time while they complete processing of their access requests. Wide data bus configurations. Support for high-bandwidth data-intensive applications is provided using wide on-chip memories. System buses support 32-, 64-, and 128-bit data bus implementations with a 32-bit address bus, as well as smaller byte and half-word designs. Nontristate implementation. AMBA AHB implements a separate read and write data bus in order to avoid the use of tristate drivers. In particular, master and slave signals are multiplexed onto the shared communication resources (read and write data buses, address bus, control signals). A typical AMBA AHB system contains the following components: AHB master. Only one bus master at a time is allowed to initiate and complete read and write transactions. Bus masters drive out the address and control signals and the arbiter determines which master has its signals routed to all of the slaves. A central decoder controls the read data and response signal multiplexor, which selects the appropriate signals from the slave that has been addressed. AHB slave. It signals back to the active master, the status of the pending transaction. It can indicate if the transfer is completed successfully, or there was an error or the master should retry the transfer or indicate the beginning of a split transaction. AHB arbiter. The bus arbiter serializes bus access requests. The arbitration algorithm is not specified by the standard and its selection is left as a design parameter (fixed priority, round-robin, latency-driven, etc.), although the request-grant based arbitration protocol has to be kept fixed. AHB decoder. This is used for address decoding and provides the select signal to the intended slave.


20-4


20.2.2 AMBA AHB Basic Operation In a normal bus transaction, the arbiter grants the bus to the master until the transfer completes and the bus can then be handed over to another master. However, in order to avoid excessive arbitration latencies, the arbiter can break up a burst. In that case, the master must rearbitrate for the bus in order to complete the remaining data transfers. A basic AHB transfer consists of four clock cycles. During the first one, the request signal is asserted, and in the best case at the end of the second cycle a grant signal from the arbiter can be sampled by the master. Then, address and control signals are asserted for slave sampling on the next rising edge, and during the last cycle the data phase is carried out (read data bus-driven or information on the write data bus sampled). A slave may insert wait states into any transfer, thus extending the data phase, and a ready signal is available for this purpose. Four-, eight-, and sixteen-beat bursts are defined in the AMBA AHB protocol, as well as undefinedlength bursts. During a burst transfer, the arbiter rearbitrates the bus when the penultimate address has been sampled, so that the asserted grant signal can be sampled by the relative master at the same point where the last address of the burst is sampled. This makes bus master handover at the end of a burst transfer very efficient. For long transactions, the slave can decide to split the operation warning the arbiter that the master should not be granted access to the bus until the slave indicates it is ready to complete the transfer. This transfer splitting mechanism is supported by all advanced on-chip interconnects, since it prevents high latency slaves from keeping the bus busy without performing any actual transfer of data. On the contrary, split transfers can significantly improve bus efficiency, that is, reduce the number of bus busy cycles used just for control (e.g., protocol handshake) and not for actual data transfers. Advanced arbitration features are required in order to support split transfers, as well as more complex master and slave interfaces.

20.2.3 Advanced Peripheral Bus The AMBA APB is intended for general-purpose low-speed low-power peripheral devices. It enables the connection to the main system bus via a bridge. All bus devices are slaves, the bridge being the only peripheral bus master. This is a static bus that provides a simple addressing, with latched addresses and control signals for easy interfacing. ARM recommends a dual Read and Write bus implementation, but APB can be implemented with a single tristated data bus. The main features of this bus are the following: • Unpipelined architecture • Low-gate count • Low-power operation (a) Reduced loading of the main system bus is obtained by isolating the peripherals behind the bridge. (b) Peripheral bus signals are only active during low-bandwidth peripheral transfers. AMBA APB operation can be abstracted as a state machine with three states. The default state for the peripheral bus is IDLE, which switches to SETUP state when a transfer is required. SETUP state lasts just one cycle, during which the peripheral select signal is asserted. The bus then moves to ENABLE state, which also lasts only one cycle and which requires the address, control, and data signals to remain stable. Then, if other transfers are to take place, the bus goes back to SETUP state, otherwise to IDLE. As can be observed, AMBA APB should be used to interface to any peripherals which are low-bandwidth and do not require the high-performance of a pipelined bus interface.



20-5

20.2.4 Advanced AMBA Evolutions Recently, some advanced specifications of AMBA bus have appeared, featuring increased performance and better link utilization. In particular, the Multi-Layer AHB and the AMBA AXI interconnect schemes will be briefly addressed in the following subsections. It should be observed that interconnect performance improvement can be achieved by adopting new topologies and by choosing new protocols, at the expense of silicon area. The former strategy leads from shared buses to bridged clusters, partial or full crossbars, and eventually to NoCs, in an attempt to increase available bandwidth and to reduce local contention. The latter strategy instead tries to maximize link utilization by adopting more sophisticated control schemes and thus permitting a better sharing of existing resources. Multi-Layer AHB can be seen as an evolution of bus topology while keeping the AHB protocol unchanged. On the contrary, AMBA AXI represents an advanced interconnect fabric protocol. 20.2.4.1 Multi-Layer AHB The Multi-Layer AHB specification emerges with the aim of increasing the overall bus bandwidth and providing a more flexible interconnect architecture with respect to AMBA AHB. This is achieved by using a more complex interconnection matrix which enables parallel access paths between multiple masters and slaves in a system [5]. Therefore, the multi-layer bus architecture allows the interconnection of unmodified standard AHB master and slave modules with an increased available bus bandwidth. The resulting architecture becomes very simple and flexible: each AHB layer only has one master and no arbitration and master-to-slave muxing is needed. Moreover, the interconnect protocol implemented in these layers can be very simple: it does not have to support request and grant, nor retry or split transactions. The additional hardware needed for this architecture with respect to the AHB is a multiplexor to connect the multiple masters to the peripherals and some point arbitration is also required when more than one master wants to access the same slave simultaneously. Figure 20.2 shows a schematic view of the multi-layer concept. The interconnect matrix contains a decode stage for every layer in order to determine which slave is required during the transfer. The multiplexer is used to route the request from the specific layer to the desired slave. The arbitration protocol decides the sequence of accesses of layers to slaves based on a priority assignment. The layer with lowest priority has to wait for the slave to be freed. Different arbitration schemes can be used, and every slave port has its own arbitration. Input layers can be served in a round-robin fashion, changing every transfer or every burst transaction, or based on a fixed priority scheme. The number of input/output ports on the interconnect matrix is completely flexible and can be adapted to suit to system requirements. As the number of masters and slaves implemented in the system increases, the complexity of the interconnection matrix can become significant and some optimization techniques have to be used: defining multiple masters on a single layer, multiple slaves appearing as a single slave to the interconnect matrix, and defining local slaves to a particular layer. Finally, it is interesting to outline the capability of this topology to support multi-port slaves. Some devices, such as SDRAM controllers, work much more efficiently when processing transfers from different layers in parallel. 20.2.4.2 AMBA AXI Protocol AXI is the latest generation AMBA interface. It is designed to be used as a high-speed submicron interconnect, and also includes optional extensions for low-power operation [6]. This high-performance protocol provides flexibility in the implementation of interconnect architectures while still keeping backward-compatibility with existing AHB and APB interfaces. AMBA AXI builds upon the concept of point-to-point connection. AMBA AXI does not provide masters and slaves with visibility of the underlying interconnect, instead featuring the concept of master interfaces and symmetric slave interfaces. This approach, besides allowing seamless topology scaling, has


20-6


Slave Decode Master

Mux Slave

Mux Master

Slave Decode

Slave

FIGURE 20.2

Schematic view of the multi-layer AHB interconnect.

the advantage of simplifying the handshake logic of attached devices, which only need to manage a point-to-point link. To provide high scalability and parallelism, four different logical monodirectional channels are provided in AXI interfaces: an address channel, a read channel, a write channel, and a write response channel. Activity on different channels is mostly asynchronous (e.g., data for a write can be pushed to the write channel before or after the write address is issued to the address channel), and can be parallelized, allowing multiple outstanding read and write requests. Figure 20.3(a) shows how a read transaction uses the read address and read data channels. The write operation over the write address and write data channels is presented in Figure 20.3(b). As can be observed, the data is transferred from the master to the slave using a write data channel, and it is transferred from the slave to the master using a read data channel. In write transactions, in which all the data flows from the master to the slave, the AXI protocol has an additional write response channel to allow the slave to signal to the master the completion of the write transaction. However, the AXI protocol is a master/slave to interconnect interface definition, and this enables a variety of different interconnect implementations. Therefore, the mapping of channels, as visible by the interfaces, to actual internal communication lanes is decided by the interconnect designer; single resources might be shared by all channels of a certain type in the system, or a variable amount of dedicated signals might be available, up to a full crossbar scheme. The rationale of this split-channel implementation is based upon the observation that usually the required bandwidth for addresses is much lower than that for data (e.g., a burst requires a single address but maybe four or eight data transfers). Availability of independently scalable resources might, for example, lead to medium complexity designs sharing a single internal address channel while providing multiple data read and write channels. Finally, some of the key incremental features of the AXI protocol can be listed as follows: • • • • •

Support for out-of-order completion of transactions. Easy addition of register stages to provide timing closure. Support for multiple address issuing. Separate read and write data channels to enable low-cost Direct Memory Access (DMA). Support for unaligned data transfers.



20-7 Read address channel

(a) Address and control Master

Slave Read data channel

interface

Read data

Read data

interface

Read data

Write address channel

(b) Address and control

Write data channel Master

Slave Write data

Write data

Write data

interface

interface

Write response channel Write response

FIGURE 20.3 Architecture of transfers: (a) read operation, (b) write operation.

20.3 CoreConnect Bus CoreConnect is an IBM-developed on-chip bus that eases the integration and reuse of processor, subsystem and peripheral cores within standard product platform designs. It is a complete and versatile architecture clearly targeting high-performance systems, and many of its features might be overkill in simple embedded applications [7]. The CoreConnect bus architecture serves as the foundation of IBM Blue Logic™ or other non-IBM devices. The Blue Logic ASIC/SoC design methodology is the approach proposed by IBM [8] to extend conventional ASIC design flows to current design needs: low-power and multiple-voltage products, reconfigurable logic, custom design capability, and analog/mixed-signal designs. Each of these offerings requires a well-balanced coupling of technology capabilities and design methodology. The use of this bus architecture allows the hierarchical design of SoCs. As can be seen in Figure 20.4, the IBM CoreConnect architecture provides three buses for interconnecting cores, library macros, and custom logic: • Processor Local Bus (PLB) • On-Chip Peripheral Bus (OPB) • Device Control Register (DCR) Bus The PLB bus connects the processor to high-performance peripherals, such as memories, DMA controllers, and fast devices. Bridged to the PLB, the OPB supports slower-speed peripherals. Finally, the DCR bus is a separate control bus that connects all devices, controllers, and bridges and provides a separate


20-8

Embedded Systems Handbook DCR bus

System core

System core

System core

Processor local bus (PLB)

Arbiter

Peripheral core

Bus bridge

Peripheral core

On-Chip peripheral bus (OPB)

Arbiter

CoreConnect bus On-Chip Memory

Processor core

Auxiliary processor

DCR bus

FIGURE 20.4

Schematic structure of the CoreConnect bus.

path to set and monitor the individual control registers. It is designed to transfer data between the CPU’s general-purpose registers and the slave logic’s device control registers. It removes configuration registers from the memory address map, which reduces loading and improves bandwidth of the PLB. This architecture shares many high-performance features with the AMBA bus specification. On one hand both architectures allow split, pipelined, and burst transfers, multiple bus masters and 32-, 64-, or 128-bits architectures. On the other hand, CoreConnect also supports multiple masters in the peripheral bus. Please note that design toolkits are available for the CoreConnect bus and include functional models, monitors, and a bus functional language to drive the models. These toolkits provide an advanced validation environment for engineers designing macros to attach to the PLB, OPB, and DCR buses.

20.3.1 Processor Local Bus The PLB is the main system bus targeting high-performance and low-latency on-chip communication. More specifically, PLB is a synchronous, multi-master, arbitrated bus. It supports concurrent read and write transfers, thus yielding a maximum bus utilization of two data transfers per clock cycle. Moreover, PLB implements address pipelining, that reduces bus latency by overlapping a new write request with an ongoing write transfer and up to three read requests with an ongoing read transfer [9]. Access to PLB is granted through a central arbitration mechanism that allows masters to compete for bus ownership. This arbitration mechanism is flexible enough to provide for the implementation of various priority schemes. In fact, four levels of request priority for each master allow PLB implementation with various arbitration priority schemes. Additionally, an arbitration locking mechanism is provided to support master-driven atomic operations. PLB also exhibits the ability to overlap the bus request/grant protocol with an ongoing transfer. The PLB specification describes a system architecture along with a detailed description of the signals and transactions. PLB-based custom logic systems require the use of a PLB macro to interconnect the various master and slave macros. The PLB macro is the key component of PLB architecture, and consists of a bus arbitration control unit and the control logic required to manage the address and dataflow through the PLB. Each PLB master is attached to the PLB through separate address, read data, and write data buses and a plurality of transfer qualifier signals, while PLB slaves are attached through shared, but decoupled, address, read data, and write data buses (each one with its own transfer control and status signals). The separate address and data buses from the masters allow simultaneous transfer requests. The PLB macro arbitrates among them and sends



20-9

the address, data, and control signals from the granted master to the slave bus. The slave response is then routed back to the appropriate master. Up to 16 masters can be supported by the arbitration unit, while there are no restrictions in the number of slave devices.

20.3.2 On-Chip Peripheral Bus Frequently, the OPB architecture connects low-bandwidth devices such as serial and parallel ports, UARTs, timers, etc. and represents a separate, independent level of bus hierarchy. It is implemented as a multimaster, arbitrated bus. It is a fully synchronous interconnect with a common clock, but its devices can run with slower clocks, as long as all of the clocks are synchronized with the rising edge of the main clock. This bus uses a distributed multiplexer attachment implementation instead of tristate drivers. The OPB supports multiple masters and slaves by implementing the address and data buses as a distributed multiplexer. This type of structure is suitable for the less data intensive OPB bus and allows adding peripherals to a custom core logic design without changing the I/O on either the OPB arbiter or existing peripherals. All of the masters are capable of providing an address to the slaves, whereas both masters and slaves are capable of driving and receiving the distributed data bus. PLB masters gain access to the peripherals on the OPB bus through the OPB bridge macro. The OPB bridge acts as a slave device on the PLB and a master on the OPB. It supports word (32-bit), half-word (16-bit), and byte read and write transfers on the 32-bit OPB data bus, bursts and has the capability to perform target word first line read accesses. The OPB bridge performs dynamic bus sizing, allowing devices with different data widths to efficiently communicate. When the OPB bridge master performs an operation wider than the selected OPB slave can support, the bridge splits the operation into two or more smaller transfers. Some of the main features of the OPB specification are: • • • • • • • • •

Fully synchronous Dynamic bus sizing: byte, half-word, full-word, and double-word transfers Separate address and data buses Support for multiple OPB bus masters Single cycle transfer of data between OPB bus master and OPB slaves Sequential address (burst) protocol 16-cycle fixed bus timeout provided by the OPB arbiter Bus arbitration overlapped with last cycle of bus transfers Optional OPB DMA transfers

20.3.3 Device Control Register Bus The DCR bus provides an alternative path to the system for setting the individual device control registers. These latter are on-chip registers that are implemented outside the processor core, from an architectural viewpoint. Through the DCR bus, the host CPU can set up the device-control-register sets without loading down the main PLB. This bus has a single master, the CPU interface, which can read or write to the individual device control registers. The DCR bus architecture allows data transfers among OPB peripherals to occur independently from, and concurrently with data transfers between processor and memory, or among other PLB devices. The DCR bus architecture is based on a ring topology to connect the CPU interface to all devices. The DCR bus is typically implemented as a distributed multiplexer across the chip such that each subunit not only has a path to place its own DCRs on the CPU read path, but also has a path which bypasses its DCRs and places another unit’s DCRs on the CPU read path. DCR bus consists of a 10-bit address bus and a 32-bit data bus. This is a synchronous bus, wherein slaves may be clocked either faster or slower than the master, although a synchronization of clock signals with the DCR bus clock is required. Finally, bursts are not supported by this bus, and two-cycle minimum read or write transfers are allowed. Optionally, they can be extended by slaves or by the single master.


20-10


20.4 STBus STBus is an STMicroelectronics proprietary on-chip bus protocol. STBus is dedicated to SoC designed for high-bandwidth applications such as audio/video processing [10]. The STBus interfaces and protocols are closely related to the industry standard VCI (Virtual Component Interface). The components interconnected by an STBus are either initiators (which initiate transactions on the bus by sending requests), or targets (which respond to requests). The bus architecture is decomposed into nodes (sub-buses in which initiators and targets can communicate directly), and the internode communications are performed through First In First Out (FIFO) buffers. Figure 20.5 shows a schematic view of the STBus interconnect. STBus implements three different protocols that can be selected by the designer in order to meet the complexity, cost, and performance constraints. From lower to higher, they can be listed as follows: Type 1: Peripheral protocol. This type is the low-cost implementation for low/medium-performance. Its simple design allows a synchronous handshake protocol and provides a limited transaction set. The peripheral STBus is targeted at modules that require a low complexity medium data rate communication path with the rest of the system. This typically includes standalone modules such as general-purpose input/output or modules which require independent control interfaces in addition to their main memory interface. Type 2: Basic protocol. In this case, the limited operation set of the peripheral interface is extended to a full operation set, including compound operations, source labeling and some priority and transaction labeling. Moreover, this implementation supports split and pipelined accesses, and is aimed at devices which need high-performance but do not require the additional system efficiency associated with shaped request/response packets or the ability to reorder outstanding operations. Type 3: Advanced protocol. The most advanced implementation upgrades previous interfaces with support for out-of-order execution and shaped packets, and is equivalent to the advanced VCI protocol. Split and pipelined accesses are supported. It allows the improvement of performance either by allowing more operations to occur concurrently, or by rescheduling operations more efficiently. A type 2 protocol preserves the order of requests and responses. One constraint is that, when communicating with a given target, an initiator cannot send a request to a new target until it has received all the responses from the current target. The unresponded requests are called pending, and a pending request controller manages them. A given type 2 target is assumed to send the responses in the same order as the request arrival order. In type 3 protocol, the order of responses may not be guaranteed, and an initiator can communicate with any target, even if it has not received all responses from a previous one. Initiators (masters)

Type 1

Type 2

Initiator IP

Type 3

Any bus IF Stbus IF

STBus

STBus IF Anybus IF Type 1

Type 2

Type 3 Initiator IP

Targets (slaves)

FIGURE 20.5

Schematic view of the STBus interconnect.



20-11

Associated with these protocols, hardware components have been designed in order to build complete reconfigurable interconnections between initiators and targets. A toolkit has been developed around this STBus (graphical interface) to automatically generate top level backbone, cycle accurate high-level models, way to implementation, bus analysis (latencies, bandwidth) and bus verification (protocol and behavior). An STBus system includes three generic architectural components. The node arbitrates and routes the requests and optionally the responses. The converter is in charge of converting the requests from one protocol to another (for instance, from basic to advanced). Finally, the size converter is used between two buses of the same type but of different widths. It includes buffering capability. The STBus can implement various strategies of arbitration and allows to change them dynamically. In a simplified single-node system example, a communication between one initiator and a target is performed in several steps: • A request/grant step between the initiator and the node takes place, corresponding to an atomic rendezvous operation of the system. • The request is transferred from the node to the target. • A response-request/grant step is carried out between the target and the node. • The response-request is transferred from the node to the initiator.

20.4.1 Bus Topologies STBus can instantiate different bus topologies, trading-off communication parallelism with architectural complexity. In particular, system interconnects with different scalability properties can be instantiated such as: • Single shared bus: suitable for simple low-performance implementations. It features minimum wiring area but limited scalability. • Full crossbar: targets complex high-performance implementations. Large wiring area overhead. • Partial crossbar: intermediate solution, medium performance, implementation complexity, and wiring overhead. It is worth observing that STBus allows for the instantiation of complex bus systems such as heterogeneous multi-node buses (thanks to size or type converters) and facilitates bridging with different bus architectures, provided proper protocol converters are made available (e.g., STBus and AMBA).

20.5 Wishbone The Wishbone SoC interconnect [11] defines two types of interfaces, called master and slave. Master interfaces are cores that are capable of generating bus cycles, while slave interfaces are capable of receiving bus cycles. Some relevant Wishbone features that are worth mentioning are the multi-master capability which enables multiprocessing, the arbitration methodology defined by end users attending to their needs, and the scalable data bus widths and operand sizes. Moreover, the hardware implementation of bus interfaces is simple and compact, and the hierarchical view of the Wishbone architecture supports structured design methodologies [12]. The hardware implementation supports various IP core interconnection schemes, including: point-topoint connection, shared bus, crossbar switch implementation, dataflow interconnection, and off-chip interconnection. The crossbar switch interconnection is usually used when connecting two or more masters together so that every one can access two or more slaves. In this scheme, the master initiates an addressable bus cycle to a target slave. The crossbar switch interconnection allows more than one master to use the bus provided they do not access the same slave. In this way, the master requests a channel on the switch and, once this is established, data is transferred in a point-to-point way.


20-12


On one hand the overall data transfer rate of the crossbar switch is higher than shared bus mechanisms, and can be expanded to support extremely high data transfer rates. On the other hand, the main disadvantage is a more complex interconnection logic and routing resources.

20.5.1 The Wishbone Bus Transactions The Wishbone architecture defines different transaction cycles attending to the action performed (read or write) and the blocking/nonblocking access. For instance, single read/write transfers are carried out as follows. The master requests the operation and places the slave address onto the bus. Then the slave places data onto the data bus and asserts an acknowledge signal. The master monitors this signal and relies the request signals when data have been latched. Two or more back-to-back read/write transfers can also be strung together. In this case, the starting and stopping point of the transfers are identified by the assertion and negation of a specific signal [13]. A Read-Modify-Write (RMW) transfer is also specified, which can be used in multiprocessor and multitasking systems in order to allow multiple software processes to share common resources by using semaphores. This is commonly done on interfaces for disk controllers, serial ports, and memory. The RMW transfer reads and writes data to a memory location in a single bus cycle. For the correct implementation of this bus transaction, shared bus interconnects have to be designed in such a way that once the arbiter grants the bus to a master, it will not rearbitrate the bus until the current master gives it up. Also, it is important to note that a master device must support the RMW transfer in order to be effective, and this is generally done by means of special instructions forcing RMW bus transactions.

20.6 SiliconBackplane MicroNetwork SiliconBackplane MicroNetwork is a family of innovative communication architectures licensed by Sonics for use in SoC design. The Sonics architecture provides CPU independence, true mix-and-match of IP cores, a unified communication medium, and a structure that makes a SOC design simpler to partition, analyze, design, verify, and test [14]. The SiliconBackplane MicroNetwork allows high-speed pipelined transactions (data bandwidth of the interconnect scales from 50 MB/sec to 4.8 Gbyte/sec) where the real-time Quality of Service (QoS) of multiple simultaneous dataflows is guaranteed. A network utilization of up to 90% can be achieved. The SiliconBackplane relies on the SonicsStudio™ development environment for architectural exploration, and the availability of pre-characterization results enables reliable performance analysis and reduction of interconnect timing closure uncertainties. The ultimate goal is to avoid over-designing interconnects. The architecture can be described as a distributed communication infrastructure (thus facilitating placeand-route) which can be extended hierarchically in the form of Tiles (collection of functions requiring minimal assistance from the rest of the die) in an easy way. Among other features, the SiliconBackplane MicroNetwork provides advanced error handling in hardware (features for SoC-wide error detection and support mechanisms for software clean-up and recovery of unresponsive cores), runtime-operating reconfiguration to meet changing application demands and data multicast. The SiliconBackplane system consists of a physical interconnect bus configured with a combination of agents. Each IP core communicates with an attached agent through ports implementing the Open Core Protocol (OCP) standard interface. The agents then communicate with each other using a network of interconnects based on the SiliconBackplane protocol. This latter includes patented transfer mechanisms aiming at maximizing interconnect bandwidth utilization and optimized for streaming multimedia applications [15]. Figure 20.6 shows a schematic view of the SiliconBackplane system.



20-13

System initiator

System initiator / target

System target

Core

Core

Core

Master

Master

Slave

Slave

Response

OCP

Slave Bus initiator

Initiator

Request

Slave Bus initiator

Target / target

Master Bus initiator

ON-CHIP BUS

FIGURE 20.6 Schematic view of the SiliconBackplane system.

A few specific components can be identified in an agent architecture: Initiators. Who implements the interface between the bus and the master core (CPU, DSP, DMA, etc.). The initiator receives requests from the OCP, then transmits the requests according to the SiliconBackplane standard, and finally processes the responses from the target. Targets. Who implements the interface between the physical bus and the slave device (memories, UARTs, etc.). This module serves as the bridge between the system and the OCP. Service agent. Who is an enhanced initiator that provides additional capabilities such as debug and test.

20.6.1 System Interconnect Bandwidth One of the most interesting features of the SiliconBackplane network is the possibility of allocating bandwidth based on a two-level arbitration policy. The system designer can preallocate bandwidth to high priority initiators by means of the concept of Time-Division Multiple Access (TDMA). An initiator agent with a preassigned time slot has the rights over that slot. If the owner does not need it, the slot is reallocated in a round-robin fashion to one of the system devices, and this represents the second level of the arbitration policy. The TDMA approach provides fast access to variable-latency subsystems and is a simple mechanism to guarantee QoS. The TDMA bandwidth allocation tables are stored in a configuration register at every initiator, and can be dynamically over-written to fit the system needs. On the other hand, the fair roundrobin allocation scheme can be used to guarantee bandwidth availability to initiators with less predictable access patterns, since some or many of the TDMA slots may turn out to be left unallocated. Round-robin arbitration policy is particularly suitable for best-effort traffic.

20.6.2 Configuration Resources All the configurable IP cores implemented in the SiliconBackplane system can be configured either at compile time or dynamically by means of specific configuration registers. These configuration devices are accessible by the operating system. Configuration registers are individually set for each agent, depending upon the services provided to the attached cores. The types of configuration registers are: • Unbuffered registers hold configuration values for the agent or its subsystem core. • Buffered registers hold configuration values that must be simultaneously updated in all agents. • Broadcast configuration registers hold values that must remain identical in multiple agents.


20-14


20.7 Other On-Chip Interconnects 20.7.1 Peripheral Interconnect Bus The PI bus was developed by several European semiconductor companies (Advanced RISC Machines, Philips Semiconductors, SGS-THOMSON Microelectronics, Siemens, TEMIC/MATRA MHS) within the framework of a European project (OMI, Open Microprocessor Initiative framework.)1 After this, an extended backward-compatible PI Bus protocol standard frequently used in many hardware systems has been developed by Philips [16]. The high-bandwidth and low-overhead of the PI Bus provide a comfortable environment for connecting processor cores, memories, coprocessors, I/O controllers and other functional blocks in high-performance chips, for time-critical applications. The PI Bus functional modules are arranged in macrocells, and a wide range of functions are provided. Macrocells with a PI Bus interface can be easily integrated into a chip layout even if they are designed by different manufacturers. The potential bus agents require only a PI Bus interface of low complexity. Since there is no concrete implementation specified, PI Bus can be adapted to the individual requirements of the target chip design. For instance, the widths of the address and data bus may be varied. The main features of this bus are: • • • • • • • •

Processor independent implementation and design Demultiplexed operation Clock synchronous Peak transfer rate of 200 MB/sec (50 MHz bus clock) Address and data bus scalable (up to 32 bits) 8-, 16-, 32-bit data access Broad range of transfer types from single to multiple data transfers Multi-master capability

The PI Bus does not provide cache coherency support, broadcasts, dynamic bus sizing, and unaligned data access. Finally, the University of Sussex has developed a VHDL toolkit to meet the needs of embedded system designers using the PI bus. Macrocell testing for PI bus compliance is also possible using the framework available in the toolkit [17].

20.7.2 Avalon Avalon is Altera’s parameterized interface bus used by the Nios embedded processor. The Avalon switch fabric has a set of predefined signal types with which a user can connect one or more IP blocks. It can only be implemented on Altera devices using SOPC Builder, a system development tool that automatically generates the Avalon switch fabric logic [18]. The Avalon switch fabric enables simultaneous multi-master operation for maximum system performance by using a technique called slave-side arbitration. It determines which master gains access to a certain slave, in the event that multiple masters attempt to access the same slave at the same time. Therefore, simultaneous transactions for all bus masters are supported and arbitration for peripherals or memory interfaces that are shared among masters is automatically included. The Avalon interconnect includes chip-select signals for all peripherals, even user-defined peripherals, to simplify the design of the embedded system. Separate, dedicated address and data paths provide an easy interface to on-chip user logic. User-defined peripherals are not required to decode data and address bus cycles. Dynamic bus sizing allows developers to use low-cost, narrow memory devices that do not match the native bus size of their CPU. The switch fabric supports each type of transfer supported by the Avalon interface. Each peripheral port into the switch is generated with reduced amount of logic to meet the requirements of the peripheral, including wait state logic, data width matching, and passing wait signals. 1 The

PI Bus has been incorporated as OMI Standard OMI 324.3D.



20-15

Read and write operations with latency can be performed. Latent transfers are useful to masters wanting to issue multiple sequential read or write requests to a slave, which may require multiple cycles for the first transfer but fewer cycles for subsequent sequential transfers. This can be beneficial for instruction-fetch operations and DMA transfers to or from SDRAM. In these cases, the CPU or DMA master may prefetch (post) multiple requests prior to completion of the first transfer and thereby reduce overall access latency. Interestingly, the Avalon interface includes signals for streaming data between master/slave pairs. These signals indicate the peripheral’s capacity to provide or accept data. A master does not have to access status registers in the slave peripheral to determine whether the slave can send or receive data. Streaming transactions maximize throughput between master–slave pairs, while avoiding data overflow or underflow on the slave peripherals. This is especially useful for DMA transfers [19].

20.7.3 CoreFrame The CoreFrame architecture has been developed by Palmchip Corporation and relies on point-to-point signals and multiplexing instead of shared tristate lines. It aims at delivering high-performance while simultaneously reducing design and verification time. The distinctive features of CoreFrame are [20]: • • • • • • • • • •

400 MB/sec bandwidth at 100 MHz (bus speed is scalable to technology and design requirements) Unidirectional buses only Central, shared memory controller Single clock cycle data transfers Zero wait state register accesses Separate peripheral I/O and DMA buses Simple protocol for reduced gate count Low-capacitive loading for high-frequency operation Hidden arbitration for DMA bus masters Application-specific memory map and peripherals

The most distinctive feature of CoreFrame is the separation of I/O and memory transfers onto different buses. The PalmBus provides for the I/O backplane and allows the processor to configure and control peripheral blocks while the MBus provides a DMA connection from peripherals to main memory, allowing a direct data transfer without processor intervention. Other on-chip interconnects are not described here owing to lack of space: IPBus from IDT [21], IP Interface from Motorola [22], MARBLE asynchronous bus from University of Manchester [23], Atlantic from Altera [24], ClearConnect from ClearSpeed Techn. [25], and FISPbus from Mentor Graphics [26].

20.8 Analysis of Communication Architectures Traditional SoC interconnects, as exemplified by AMBA AHB, are based upon low-complexity shared buses, in an attempt to minimize area overhead. Such architectures, however, are not adequate to support the trend for SoC integration, motivating the need for more scalable designs. Interconnect performance improvement can be achieved by adopting new topologies and by choosing new protocols, at the expense of silicon area. The former strategy leads from shared buses to bridged clusters, partial or full crossbars, and eventually to NoC, in an attempt to increase available bandwidth and to reduce local contention. The latter strategy instead tries to maximize link utilization by adopting more sophisticated control schemes and thus permitting a better sharing of existing resources. While both approaches can be followed at the same time, we perform separate analysis for the sake of clarity. At first, scalability of evolving interconnect fabric protocols is assessed. Three state-of-the-art shared buses are stressed under an increasing traffic load: a traditional AMBA AHB link is not only more advanced, but also more expensive, evolutionary solutions as offered by STBus (type 3) and AMBA AXI (based upon a Synopsys implementation).


20-16


These system interconnects were selected for analysis because of their distinctive features, which allow to sketch the evolution of shared-bus based communication architectures. AMBA AHB makes two data links (one for read, one for write) available, but only one of them can be active at any time. Only one bus master can own the data wires at any time, preventing the multiplexing of requests and responses on the interconnect signals. Transaction pipelining (i.e., split ownership of data and address lines) is provided, but not as a means of allowing multiple outstanding requests, since address sampling is only allowed at the end of the previous data transfer. Bursts are supported, but only as a way to cut down on rearbitration times, and AHB slaves do not have a native burst notion. Overall, AMBA AHB is designed for a low silicon area footprint. The STBus interconnect (with shared bus topology) implements split request and response channels. This means that, while a system initiator is receiving data from an STBus target, another one can issue a second request to a different target. As soon as the response channel frees up, the second request can immediately be serviced, thus hiding target wait states behind those of the first transfer. The amount of saved wait states depends on the depth of the prefetch FIFO buffers on the slave side. Additionally, the split channel feature allows for multiple outstanding requests by masters, with support for out-of-order retirement. An additional relevant feature of STBus is its low-latency arbitration, which is performed in a single cycle. Finally, AMBA AXI builds upon the concept of point-to-point connection and exhibits complex features, such as multiple outstanding transaction support (with out-of-order or in-order delivery selectable by means of transaction IDs) and time interleaving of traffic toward different masters on internal data lanes. Four different logical monodirectional channels are provided in AXI interfaces, and activity on them can be parallelized allowing multiple outstanding read and write requests. In our protocol exploration, to provide a fair comparison, a “shared bus” topology is assumed, which comprises of a single internal lane per each one of the four AXI channels. Figure 20.7 shows an example of the efficiency improvements made possible by advanced interconnects in the test case of slave devices having two wait states, with three system processors and four-beat burst CLOCK

READY1

1

2

3

4

(a) READY2

1

2

3

READY3

READY1

1

2

3

4

(b) READY2

1

2

3

4

READY3

READY1

1

1

2

3

1

4

(c) READY2

1

2

3

4

READY3

READY1

1

1

(d) READY2 READY3

2 1

3 2

1

4 3

2

1 4

3

3

4

2 1

4

2

3 2

1

2

FIGURE 20.7 Concept waveforms showing burst interleaving for the three interconnects. (a) AMBA AHB, (b) STBus (with minimal buffering), (c) STBus (with more buffering), and (d) AMBA AXI.



20-17

transfers. AMBA AHB has to pay two cycles of penalty per transferred datum. STBus is able to hide latencies for subsequent transfers behind those of the first one, with an effectiveness which is a function of the available buffering. AMBA AXI is capable of interleaving transfers, by sharing data channel ownership in time. Under conditions of peak load, when transactions always overlap, AMBA AHB is limited to a 33% efficiency (transferred words over elapsed clock cycles), while both STBus and AMBA AXI can theoretically reach a 100% throughput.

20.8.1 Scalability Analysis SystemC models of AMBA AHB, AMBA AXI (provided within the Synopsys CoCentric/Designware® [27] suites), and STBus are used within the framework of the MPARM simulation platform [28–30]. For the STBus model, the depth of FIFOs instantiated by the target side of the interconnect is a configurable parameter; their impact can be noticed on concept waveforms in Figure 20.7. 1-stage (“STBus” hereafter) and 4-stage (“STBus [B]”) FIFOs were benchmarked. The simulated on-chip multiprocessor consists of a configurable number of ARM cores attached to the system interconnect. Traffic workload and pattern can easily be tuned by running different benchmark code on the cores, by scaling the number of system processors, or by changing the amount of processor cache, which leads to different amounts of cache refills. Slave devices are assumed to introduce one wait state before responses. To assess interconnect scalability, a benchmark independently but concurrently runs on every system processor performing accesses to its private slave (involving bus transactions). This means that, while producing real functional traffic patterns, the test setup was not constrained by bottlenecks owing to shared slave devices. Scalability properties of the system interconnects can be observed in Figure 20.8, reporting the execution time variation when attaching an increasing amount of system cores to a single shared interconnect under heavy traffic load. Core caches are kept very small (256 bytes) in order to cause many cache misses and therefore significant levels of interconnect congestion. Execution times are normalized against those for a two-processor system, trying to isolate the scalability factor alone. The heavy bus congestion case is considered here because the same analysis performed under light traffic conditions (e.g., with 1 kB caches) shows that all of the interconnects perform very well (they are all always close to 100%), with only AHB showing a moderate performance decrease of 6% when moving from two to eight running processors. With 256 bytes caches, the resulting execution times, as Figure 20.8 shows, get 77% worse for AMBA AHB when moving from two to eight cores, while AXI and STBus manage to stay within 12% and 15%. The impact of FIFOs in STBus is noticeable, since the interconnect with minimal buffering shows execution times 36% worse than in the two-core setup. The reason behind the behavior pointed out in Figure 20.8 is that under heavy traffic load and with many processors, interconnect saturation takes place. This is clearly indicated in Figure 20.9, which reports the fraction of cycles during which some transaction was pending on the bus with respect to total execution time. In such a congested environment, as Figure 20.10 shows, AMBA AXI and STBus (with 4-stage FIFOs) are able to achieve transfer efficiencies (defined as data actually moved over bus contention time) of up to 81% and 83%, respectively, while AMBA AHB reaches 47% only — near to its maximum theoretical efficiency of 50% (one wait state per data word). These plots stress the impact that comparatively low-area-overhead optimizations can sometimes have in complex systems. According to simulation results, some of the advanced features in AMBA AXI provided highly scalable bandwidth, but at the price of latency in low-contention setups. Figure 20.11 shows the minimum and average amount of cycles required to complete a single write and a burst read transaction in STBus and AMBA AXI. STBus has a minimal overhead for transaction initiation, as low as a single cycle if communication resources are free. This is confirmed by figures showing a best-case three-cycle latency for single accesses (initiation, wait state, data transfer) and a nine-cycle latency for four-beat bursts. AMBA AXI, owing to its complex channel management and arbitration, requires more time to initiate and close a transaction: recorded minimum completion times are 6 and 11 cycles for single writes and burst reads,



Relative execution time (%)

20-18 180 170 160 150 140 130 120 110 100 90 80 70 60 50 40 30 20 10 0

2 Cores 4 Cores 6 Cores 8 Cores

AHB

FIGURE 20.8

AXI

STBus

STBus (B)

Execution times with 256 bytes caches. 100 90

Interconnect busy (%)

80 70 60 2 Cores 4 Cores 6 Cores 8 Cores

50 40 30 20 10 0 AHB

FIGURE 20.9

AXI

STBus

STBus (B)

Bus busy time with 256 bytes caches.

respectively. As bus traffic increases, completion latencies of AMBA AXI and STBus get more and more similar because the bulk of transaction latency is spent in contention. It must be pointed out, however, that protocol improvements alone cannot overcome the intrinsic performance bound owing to the shared nature of the interconnect resources. While protocol features can push the saturation boundary further, and get near to a 100% efficiency, traffic loads taking advantage of more parallel topologies will always exist. The charts reported here already show some traces of saturation even for the most advanced interconnects. However, the improved performance achieved by more parallel topologies strongly depends on the kind of bus traffic. In fact, if the traffic is dominated by accesses to


Interconnect usage efficiency (%)


20-19

100 95 90 85 80 75 70 65 60 55 50 45 40 35 30 25 20 15 10 5 0

2 Cores 4 Cores 6 Cores 8 Cores

AHB

FIGURE 20.10

AXI

STBus

STBus (B)

Bus usage efficiency with 256 bytes caches.

14 13 Latency for access completion (cycles)

12 11 10 STBus (B) write avg STBus (B) write min STBus (B) read avg STBus (B) read min AXI write avg AXI write min AXI read avg AXI read min

9 8 7 6 5 4 3 2 1 0 2 Cores

FIGURE 20.11

4 Cores

6 Cores

8 Cores

Transaction completion latency with 256 bytes caches.

shared devices (shared memory, semaphores, interrupt module), they have to be serialized anyway, thus reducing the effectiveness of area-hungry parallel topologies. It is therefore evident that crossbars behave best when data accesses are local and no destination conflicts arise. This is reflected in Figure 20.12, showing average completion latencies in read accesses for different bus topologies: shared buses (AMBA AHB and STBus), partial crossbars (STBus-32 and STBus-54), and full crossbars (STBus-FC). Four benchmarks are considered, consisting of matrix multiplications performed independently by each processor or in pipeline, with or without an underlying OS (Operating System)


20-20

Embedded Systems Handbook Average time for read (cycles)

20 18 16 14 AMBA ST-BUS ST-FC ST-32 ST-54

12 10 8 6 4 2 0 ASM-IND

FIGURE 20.12

OS-IND

ASM-PIP

OS-PIP

Reads average latency.

(OS-IND, OS-PIP, ASM-IND, and ASM-PIP, respectively). IND benchmarks do not give rise to interprocessor communication, which is instead at the core of PIP benchmarks. Communication goes through the shared memory. Moreover, OS-assisted code implicitely uses both semaphores and interrupts, while standalone ASM applications rely on an explicit semaphore polling mechanism for synchronization purposes. Crossbars show a substantial advantage in OS-IND and ASM-IND benchmarks, wherein processors only access private memories: this operation is obviously suitable for parallelization. Both ST-FC and ST-54 achieve the minimum theoretical latency where no conflict on private memories ever arises. ST-32 trails immediately behind ST-FC and ST-54, with rare conflicts which do not occur systematically because execution times shift among conflicting processors. OS-PIP still shows significant improvement for crossbar designs. ASM-PIP, in contrast, puts ST-BUS at the same level of crossbars, and sometimes the shared bus even proves slightly faster. This can be explained with the continuous semaphore polling performed by this (and only this) benchmark; while crossbars may have an advantage in private memory accesses, the resulting speedup only gives processors more opportunities to poll the semaphore device, which becomes a bottleneck. Unpredictability of conflict patterns can then explain why a simple shared bus can sometimes slightly outperform crossbars, therefore the selection of bus topology should carefully match the target communication pattern.

20.9 Packet-Switched Interconnection Networks Previous sections have illustrated on-chip interconnection schemes based on shared buses and on evolutionary communication architectures. This section introduces a more revolutionary approach to on-chip communication, known as Network-on-Chip [2,3]. The NoC architecture consists of a packet-switched interconnetion network integrated onto a single chip, and it is likely to better support the trend for SoC integration. The basic idea is borrowed from the wide-area networks domain, and envisions router (or switch)-based networks of interconnects on which on-chip packetized communication takes place. Cores access the network by means of proper interfaces, and have their packets forwarded to destination through a certain number of hops. SoCs differ from wide area networks in their local proximity and because they exhibit less nondeterminism. Local, high-performance networks — such as those developed for large-scale multiprocessors — have



20-21

similar requirements and constraints. However, some distinctive features, such as energy constraints and design-time specialization, are unique to SoC networks. Topology selection for NoCs is a critical design issue. It is determined by how efficiently communication requirements of an application can be mapped onto a certain topology, and by physical level considerations. In fact, regular topologies can be designed with a better control on electrical parameters and therefore on communication noise sources (such as crosstalk), although they might result in link under-utilization or localized congestion from an application viewpoint. On the contrary, irregular topologies have to deal with more complex physical design issues but are more suitable to implement customized, domain-specific communication architectures. Two-dimensional mesh networks are a reference solution for regular NoC topologies. The scalable and modular nature of NoCs and their support for efficient on-chip communication potentially leads to NoC-based multiprocessor systems characterized by high structural complexity and functional diversity. On one hand, these features need to be properly addressed by means of new design methodologies, while on the other hand more efforts have to be devoted to modeling on-chip communication architectures and integrating them into a single modeling and simulation environment combining both processing elements and communication architectures. The development of NoC architectures and their integration into a complete MPSoC design flow is the main focus of an ongoing worldwide research effort [30–33].

20.10 Conclusions This chapter addresses the critical issue of on-chip communication for gigascale MPSoCs. An overview of the most widely used on-chip communication architectures is provided, and evolution guidelines aiming at overcoming scalability limitations are sketched. Advances regard both communication protocol and topology, although it is becoming clear that in the long term more aggressive approaches will be required to sustain system performance, namely packet-switched interconnection networks.

References [1] R. Ho, K.W. Mai, and M.A. Horowitz. The future of wires. Proceedings of the IEEE, 89: 490–504, 2001. [2] L. Benini and G. De Micheli. Networks on chips: a new SoC paradigm. IEEE Computer, 35: 70–78, 2002. [3] J. Henkel, W. Wolf, and S. Chakradhar. On-chip networks: a scalable, communication-centric embedded system design paradigm. In Proceedings of the International Conference on VLSI Design, January 2004, pp. 845–851. [4] ARM. AMBA Specification v2.0, 1999. [5] ARM. AMBA Multi-Layer AHB Overview, 2001. [6] ARM. AMBA AXI Protocol Specification, 2003. [7] IBM Microelectronics. CoreConnect Bus Architecture Overview, 1999. [8] G.W. Doerre and D.E. Lackey. The IBM ASIC/SoC methodology. A recipe for first-time success. IBM Journal of Research & Development, 46: 649–660, 2002. [9] IBM Microelectronics. The CoreConnect Bus Architecture White Paper, 1999. [10] P. Wodey, G. Camarroque, F. Barray, R. Hersemeule, and J.P. Cousin. LOTOS code generation for model checking of STBus based SoC: the STBus interconnection. In Proceedings of ACM and IEEE International Conference on Formal Methods and Models for Co-Design, June 2003, pp. 204–213. [11] Richard Herveille. Combining WISHBONE Interface Signals, Application Note, April 2001. [12] Richard Herveille. WISHBONE System-on-Chip (SoC) Interconnection Architecture for Portable IP Cores. Specification, 2002. [13] Rudolf Usselmann. OpenCores SoC Bus Review, 2001.


20-22


[14] Sonics Inc. µ-Networks. Technical Overview, 2002. [15] Sonics Inc. SiliconBackplane III MicroNetwork IP. Product Brief, 2002. [16] Philip de Nier. Property checking of PI-bus modules. In Proceedings of the Workshop on Circuits, Systems and Signal Processing (ProRISC99), J.P. Veen, Ed. STW, Technology Foundation, Mierlo, The Netherlands, 1999, pp. 343–354. [17] ESPRIT, 1996, http://www.cordis.lu/esprit/src/results/res_area/omi/omi10.htm [18] Altera. AHB to Avalon & Avalon to AHB Bridges, 2003. [19] Altera. Avalon Bus Specification, 2003. [20] Palmchip. Overview of the CoreFrame Architecture, 2001. [21] IDT. IDT Peripheral Bus (IPBus). Intermodule Connection Technology Enables Broad Range of System-Level Integration, 2002. [22] Motorola. IP Interface. Semiconductor Reuse Standard, 2001. [23] W.J. Bainbridge and S.B. Furber. MARBLE: an asynchronous on-chip macrocell bus. Microprocessors and Microsystems, 24: 213–222, 2000. [24] Altera. Atlantic Interface. Functional Specification, 2002. [25] ClearSpeed. ClearConnect Bus. Scalable High Performance On-Chip Interconnect, 2003. [26] Summary of SoC Interconnection Buses, 2004, http://www.silicore.net/uCbusum.htm [27] Synopsys CoCentric, 2004, http://www.synopsys.com [28] L. Benini, D. Bertozzi, D. Bruni, N. Drago, F. Fummi, and M. Poncino. SystemC cosimulation and emulation of multiprocessor SoC designs. IEEE Computer, 36: 53–59, 2003. [29] F. Poletti, D. Bertozzi, A. Bogliolo, and L. Benini. Performance analysis of arbitration policies for SoC communication architectures. Journal of Design Automation for Embedded Systems, 8: 189–210, 2003. [30] M. Loghi, F. Angiolini, D. Bertozzi, L. Benini, and R. Zafalon. Analyzing on-chip communication in a MPSoC environment. In Proceedings of the IEEE Design Automation and Test in Europe Conference (DATE04), February 2004, pp. 752–757. [31] E. Rijpkema, K. Goossens, and A. Radulescu. Trade-offs in the design of a router with both guaranteed and best-effort services for networks on chip. In Proceedings of Design Automation and Test in Europe, March 2003, pp. 350–355. [32] K. Lee et al. A 51 Mw 1.6 GHz on-chip network for low power heterogeneous SoC platform. In ISSCC Digest of Technical Papers, 2004, pp. 152–154. [33] E. Bolotin, I. Cidon, R. Ginosar, and A. Kolodny. QNoC: QoS architecture and design process for network on chip. The Journal of Systems Architecture, Special Issue on Networks on Chip, 50(2–3): 105–128, February 2004.


21 Network-on-Chip Design for Gigascale Systems-on-Chip 21.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Design Challenges for On-Chip Communication Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.4 NoC Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Davide Bertozzi and Luca Benini University of Bologna

Giovanni De Micheli Stanford University

21-1 21-3 21-4 21-5

Network Link • Switch • Network Interface

21.5 NoC Topology. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21-13 Domain-Specific NoC Synthesis Flow

21.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21-16 Acknowledgment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21-17 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21-17

21.1 Introduction The increasing integration densities made available by shrinking device geometries will have to be exploited to meet the computational requirements of parallel applications, such as multimedia processing, automotive, multiwindow TV, ambient intelligence, etc. As an example, systems designed for ambient intelligence will be based on high-speed digital signal processing with computational loads ranging from 10 MOPS for lightweight audio processing, 3 GOPS for video processing, 20 GOPS for multilingual conversation interfaces, and up to 1 TOPS for synthetic video generation. This computational challenge will have to be addressed at manageable power levels and affordable costs [1]. Such a performance cannot be provided by a single processor, but requires a heterogeneous on-chip multiprocessor system containing a mix of general-purpose programmable cores, application specific processors, and dedicated hardware accelerators. In this context, performance of gigascale Systems-on-Chip (SoC) will be communication dominated, and only an interconnect-centric system architecture will be able to cope with this problem. Current on-chip interconnects consist of low-cost shared arbitrated buses, based on the serialization of bus access requests; only one master at a time can be granted access to the bus. The main drawback of this solution is its lack of scalability, which will result in unacceptable performance degradation for complex SoCs 21-1


21-2


Core

Core NI

NI S

Core

S

NI S

S

Core

S

NI

NI

NI Core

Core

NI –– network interface S –– switch

FIGURE 21.1 Example of NoC architecture.

(more than a dozen of integrated cores). Moreover, the connection of new blocks to a shared bus increases its associated load capacitance, resulting in more energy consuming bus transactions. A scalable communication infrastructure that better supports the trend of SoC integration consists of an on-chip micronetwork of interconnects, generally known as Network-on-Chip (NoC) architecture [2–4]. The basic idea is borrowed from the wide-area networks domain, and envisions router (or switch)based networks on which on-chip packetized communication takes place, as depicted in Figure 21.1. Cores access the network by means of proper interfaces, and have their packets forwarded to destination through a certain number of hops. The scalable and modular nature of NoCs and their support for efficient on-chip communication potentially leads to NoC-based multiprocessor systems characterized by high structural complexity and functional diversity. On one hand, these features need to be properly addressed by means of new design methodologies [5], while on the other hand more efforts have to be devoted to modeling on-chip communication architectures and integrating them into a single modeling and simulation environment combining both processing elements and communication infrastructures [6–8]. These efforts are needed to include on-chip communication architecture in any quantitative evaluation of system design during design space exploration [9,10], so as to be able to assess the impact of the interconnect on achieving a target system performance. An important design decision for NoCs regards the choice of topology. Several researchers [4,5,11,12] envision NoCs as regular tile-based topologies (such as mesh networks and fat trees), which are suitable for interconnecting homogeneous cores in a chip multiprocessor. However, SoC component specialization (used by designers to optimize performance at low power consumption and competitive cost) leads to the on-chip integration of heterogeneous cores having varied functionality, size, and communication requirements. If a regular interconnect is designed to match the requirements of a few communicationhungry components, it is bound to be largely overdesigned with respect to the needs of the remaining components. This is the main reason why most of the current SoCs use irregular topologies, such as bridged buses and dedicated point-to-point links [13]. This chapter introduces basic principles and guidelines for the NoC design. At first, the motivation for the design paradigm shift of SoC communication architectures from shared buses to NoCs is examined. Then, the chapter goes into the details of NoC building blocks (switch, network interface, and switch-toswitch links), discussing the design guidelines and presenting a case study where some of the most advanced concepts in NoC design have been applied to a real NoC architecture (called Xpipes and developed at University of Bologna [14]). Finally, the challenging issue of heterogeneous NoC design will be addressed, and the effects of mapping the communication requirements of an application onto a domain-specific NoC, instead of a network with regular topology, will be detailed by means of an illustrative example.


NoC Design for Gigascale SoC

21-3

21.2 Design Challenges for On-Chip Communication Architectures SoC design challenges that are driving the evolution of traditional bus architectures toward NoCs can be outlined as follows: Technology issues. While gate delays scale down with technology, global wire delays typically increase or remain constant as repeaters are inserted. It is estimated that in 50 nm technology, at a clock frequency of 10 GHz, a global wire delay might range from 6 to 10 clock cycles [2]. Therefore, limiting the on-chip distance traveled by critical signals will be key to guarantee the performance of the overall system, and will be a common design guideline for all kinds of system interconnects. On the contrary, other challenges posed by deep submicron technologies are leading to a paradigm shift in the design of SoC communication architectures. For instance, global synchronization of cores on future SoCs will be unfeasible due to deep submicron effects (clock skew, power associated with clock distribution tree, etc.), and an alternative scenario consists of self-synchronous cores that communicate with one another through a network-centric architecture [15]. Finally, signal integrity issues (crosstalk, power supply noise, soft errors, etc.) will lead to more transient and permanent failures of signals, logic values, devices, and interconnects, thus raising the reliability concern for on-chip communication [16]. In many cases, on-chip networks can be designed as regular structures, allowing electrical parameters of wires to be optimized and well controlled. This leads to lower communication failure probabilities, thus enabling the use of low swing signaling techniques [17], and to the capability of exploiting performance optimization techniques, such as wavefront pipelining [18]. Performance issues. In traditional buses, all communication actors share the same bandwidth. As a consequence, performance does not scale with the level of system integration, but degrades significantly. Though, once the bus is granted to a master, access occurs with no additional delay. On the contrary, NoCs can provide much better performance scalability. No delays are experienced for accessing the communication infrastructure, since multiple outstanding transactions originated by multiple cores can be handled at the same time, resulting in a more efficient network resources utilization. However, given a certain network dimension (e.g., number of instantiated switches), large latency fluctuations for packet delivery could be experienced as a consequence of network congestion. This is unacceptable when hard real-time constraints of an application have to be met, and two solutions are viable: network overdimensioning (for NoCs designed to support Best Effort [BE] traffic only) or implementation of dedicated mechanisms to provide guarantees for timing constrained traffic (e.g., loss-less data transport, minimal bandwidth, bounded latency, minimal throughput, etc.) [19]. Design productivity issues. It is well known that synthesis and compiler technology development do not keep up with IC manufacturing technology development [20]. Moreover, time-to-market needs to be kept as low as possible. Reuse of complex preverified design blocks is an efficient mean to increase productivity, and regards both computation resources and the communication infrastructure [21]. It would be highly desirable to have processing elements that could be employed in different platforms by means of a plugand-play design style. To this purpose, a scalable and modular on-chip network represents a more efficient communication infrastructure compared with shared-bus-based architectures. However, the reuse of processing elements is facilitated by the definition of standard network interfaces, which also make the modularity property of the NoC effective. The Virtual Socket Interface Alliance (VSIA) has attempted to set the characteristics of this interface industry-wide [22]. Open Core Protocol (OCP) [23] is another example of standard interface sockets for cores. It is worth remarking that such network interfaces also decouple the development of new cores from the evolution of new communication architectures. The core developer will not have to make assumptions about the system, when the core will be plugged into. Similarly, designers of new on-chip interconnects will not be constrained by the knowledge of detailed interfacing requirements for particular legacy SoC components. Finally, let us observe that NoC components (e.g., switches or interfaces) can be instantiated multiple times in the same design (as opposed to the arbiter of traditional shared buses, which is instance-specific) and reused in a large number of products targeting a specific application domain.


21-4


The development of NoC architectures and protocols is fueled by the aforementioned arguments, in spite of the challenges represented by the need for new design methodologies and an increased complexity of system design.

21.3 Related Work The need to progressively replace on-chip buses with micronetworks was extensively discussed in [2,4]. A number of NoC architectures have been proposed in the literature so far. Sonics MicroNetwork [24] is an on-chip network making use of communication architectureindependent interface sockets. The MicroNetwork is an example of evolutionary solutions [25], which move from a physical implementation as a shared bus, and propose generalizations to support higher bandwidth (such as partial and full crossbars). STBUS interconnect from STMicroelectronics is another example of evolutionary architecture that provides designers with the capability to instantiate both shared bus or partial or full crossbar interconnect configurations. Even though these architectures provide higher bandwidth than simple buses, addressing the wiring delay and scalability challenge in the long term requires more radical solutions. One of the earliest contributions in this area is the Maia heterogeneous signal processing architecture, proposed by Zhang et al. [26], based on a hierarchical mesh network. Unfortunately, Maia’s interconnect is fully instance-specific. Furthermore, routing is static at configuration time: network switches are programmed once for all for a given application (as in a Field Programmable Gate Array [FPGA]). Thus, communication is based on circuit switching, as opposed to packet switching. In this direction, Dally and Lacy [27] sketch the architecture of a VLSI multicomputer using 2009 technology. A chip with 64 processor-memory tiles is envisioned. Communication is based on packet switching. This seminal work draws upon past experiences in designing parallel computers and reconfigurable architectures (FPGAs and their evolutions) [28–30]. Most proposed NoC platforms are packet switched and exhibit regular structure. An example is a mesh interconnection, which can rely on a simple layout and the switch independence on the network size. The NOSTRUM network described in Reference 5 takes this approach: the platform includes both a mesh architecture and the design methodology. The Scalable Programmable Integrated Network (SPIN) described in Reference 31 is another regular, fat-tree-based network architecture. It adopts cut-through switching to minimize message latency and storage requirements in the design of network switches. The Linkoeping SoCBUS [32] is a two-dimensional mesh network that uses a packet connected circuit (PCC) to set up routes through the network: a packet is switched through the network locking the circuit as it goes. This notion of virtual circuit leads to deterministic communication behavior but restricts routing flexibility for the rest of the communication traffic. The need to map communication requirements of heterogeneous cores may lead to the adoption of irregular topologies. The motivation for such architectures lies in the fact that each block can be optimized for a specific application (e.g., video or audio processing), and link characteristics can be adapted to the communication requirements of the interconnected cores. Supporting heterogeneous architectures requires a major design effort and leads to coarser-granularity control of physical parameters. Many recent heterogeneous SoC implementations are still based on shared buses (such as the single chip MPEG-2 codec reported in Reference 33, but the growing complexity of customizable media embedded processor architectures for digital media processing will soon require NoC-based communication architectures and proper hardware/software development tools. The Aethereal NoC design framework presented in Reference 34 aims at providing a complete infrastructure for developing heterogeneous NoC with end-to-end quality of service guarantees. The network supports guaranteed throughput (GT) for real-time applications and BE traffic for timing unconstrained applications. Support for heterogeneous architectures requires highly configurable network building blocks, customizable at instantiation time for a specific application domain. For instance, the Proteo NoC [35] consists of



21-5

a small library of predefined, parameterized components that allow the implementation of a large range of different topologies, protocols, and configurations. Xpipes interconnect [14] and its synthesizer XpipesCompiler [36] push this approach to the limit, by instantiating an application-specific NoC from a library of composable soft macros (network interface, link, and switch). The components are highly parameterizable and provide reliable and latency-insensitive operation.

21.4 NoC Architecture Most of the terminology for on-chip packet-switched communication is adapted from computer network and multiprocessor domain. Messages that have to be transmitted across the network are usually partitioned into fixed-length packets. Packets in turn are often broken into message flow control units called flits. In the presence of channel width constraints, multiple physical channel cycles can be used to transfer a single flit. A phit is the unit of information that can be transferred across a physical channel in a single step. Flits represent logical units of information, as opposed to phits that correspond to physical quantities. In many implementations, a flit is set to be equal to a phit. The basic building blocks for packet-switched communication across NoCs are: 1. Network link 2. Switch 3. Network interface and will be described hereafter.

21.4.1 Network Link The performance of interconnect is a major concern in scaled technologies. As geometries shrink, gate delay improves much faster than the delay in long wires. Therefore, the long wires increasingly determine the maximum clock rate, and hence performance, of the entire design. The problem becomes particularly serious for domain-specific heterogeneous SoCs, where the wire structure is highly irregular and may include both short and extremely long switch-to-switch links. Moreover, it has been estimated that only a fraction of the chip area (between 0.4 and 1.4%) will be reachable in one clock cycle [37]. A solution to overcome the interconnect-delay problem consists of pipelining interconnects [38,39]. Wires can be partitioned into segments (or relay stations, which have a function similar to one of the latches on a pipelined data path) whose length satisfies predefined timing requirements (e.g., desired clock speed of the design). In this way, link delay is changed into latency, but data introduction rate is not bounded by the link delay any more. Now, the latency of a channel connecting two modules may end up being more than one clock cycle. Therefore, if the functionality of the design is based on the sequencing of the signals and not on their exact timing, then link pipelining does not change the functional correctness of the design. This requires the system to be made of modules whose behavior does not depend on the latency of the communication channels (latency-insensitive operation). As a consequence, the use of interconnect pipelining can be seen as a part of a new and more general methodology for deep submicron (DSM) designs, which can be envisioned as synchronous distributed systems composed by functional modules that exchange data on communication channels according to a latency-insensitive protocol. This protocol ensures that functionally correct modules behave correctly independently of the channel latencies [38]. The effectiveness of the latency-insensitive design methodology is strongly related to the ability of maintaining a sufficient communication throughput in the presence of increased channel latencies. The International Technology Roadmap for Semiconductors (ITRS) 2001 [15] assumes that interconnect pipelining is the strategy of choice in its estimates of achievable clock speeds for MPUs. Some industrial designs already make use of interconnect pipelining. For instance, the NETBURST microarchitecture of Pentium 4 contains instances of a stage dedicated exclusively to handle wire delays: in fact, a so-called drive


21-6


stage is used only to move signals across the chip without performing any computation and, therefore, can be seen as a physical implementation of a relay station [40]. Xpipes interconnect makes use of pipelined links and of latency-insensitive operation in the implementation of its building blocks. Switch-to-switch links are subdivided into basic segments whose length guarantees that the desired clock frequency (i.e., the maximum speed provided by a certain technology) can be used. In this way, the system operating frequency is not bound by the delay of long links. According to the link length, a certain number of clock cycles is needed by a flit to cross the interconnect. If network switches are designed in such a way that their functional correctness depends on the flit arriving order and not on their timing, input links of the switches can be different and of any length. These design choices are at the basis of latency-insensitive operation of the NoC and allow the construction of an arbitrary network topology and hence support for heterogeneous architectures. Figure 21.2 illustrates the link model, which is equivalent to a pipelined shift register. Pipelining has been used both for data and control lines. The figure also illustrates how pipelined links are used to support latency-insensitive link-level error control, ensuring robustness against communication errors. The retransmission of a corrupted flit between two successive switches is represented. Multiple outstanding flits propagate across the link during the same clock cycle. When flits are correctly received at the destination switch, an ACK is propagated back to the source, and after N clock cycles (where N is the Flits at source switch

Link C

Transmission

Flit acknowledgment

D

C

B

A

D

C

B

A

Destination switch

B

A

D

C

B A

ACK=1 ACK_Valid=1 D Detection of corrupted flit

D

C

B

C B

A

ACK=1 ACK_Valid = 1 ACK/NACK propagation

D

C

B

A ACK = 0 ACK_Valid = 1

A

A

ACK = 0 ACK_Valid = 1 B Retransmission

D

C

B

Go-Back-N

D

C

B

D

D

C

B

FIGURE 21.2 Pipelined link model and latency-insensitive link-level error control.


A

C

B

A


21-7

length of the link expressed in number of repeater stages) the flit will be discarded from the buffer of the source switch. On the contrary, a corrupted flit is NACKed and will be retransmitted in due time. The implemented retransmission policy is GO-BACK-N , to keep the switch complexity as low as possible.

21.4.2 Switch The task of the switch is to carry packets injected into the network to their final destination, following a statically defined or dynamically determined routing path. The switch transfers packets from one of its input ports to one or more of its output ports. Switch design is usually characterized by a power-performance trade-off: power-hungry switch memory resources can be required by the need to support high-performance on-chip communication. A specific design of a switch may include both input and output buffers or only one type of buffer. Input queuing uses fewer buffers, but suffers from head-of-line blocking. Virtual output queuing has a higher performance, but at the cost of more buffers. Network flow control (or routing mode) specifically addresses the limited amount of buffering resources in switches. Three policies are feasible in this context [41]. In store-and-forward routing, an entire packet is received and entirely stored before being forwarded to the next switch. This is the most demanding approach in terms of memory requirements and switch latency. Also virtual cut-through routing requires buffer space for an entire packet, but allows lower latency communication, in that a packet is forwarded as soon as the next switch guarantees that the complete packet will be accepted. If this is not the case, the current router must be able to store the whole packet. Finally, a wormhole routing scheme can be employed to reduce switch memory requirements and to permit low latency communication. The first flit of a packet contains routing information, and header flit decoding enables the switches to establish the path and subsequent flits simply follow this path in a pipelined fashion by means of switch output port reservation. A flit is passed to the next switch as soon as enough space is available to store it, even though there is not enough space to store the whole packet. If a certain flit faces a busy channel, subsequent flits have to wait at their current locations and are therefore spread over multiple switches, thus blocking the intermediate links. This scheme avoids buffering the full packet at one switch and keeps end-to-end latency low, although it is more sensitive to deadlock and may result in low link utilization. Guaranteeing quality of service in switch operation is another important design issue, which needs to be addressed when time-constrained (hard or soft real-time) traffic is to be supported. Throughput guarantees or latency bounds are examples of time-related guarantees. Contention related delays are responsible for large fluctuations of performance metrics, and a fully predictable system can be obtained only by means of contention free routing schemes. With circuit switching, a connection is setup over which all subsequent data is transported. Therefore, contention resolution takes place at setup at the granularity of connections, and time-related guarantees during data transport can be given. In time division circuit switching (see Reference 24 for an example), bandwidth is shared by time division multiplexing connections over circuits. In packet switching, contention is unavoidable since packet arrival cannot be predicted. Therefore arbitration mechanisms and buffering resources must be implemented at each switch, thus delaying data in an unpredictable manner and making it difficult to provide guarantees. BE NoC architectures can mainly rely on network overdimensioning to bound fluctuations of performance metrics. The Aethereal NoC architecture makes use of a router that tries to combine GT and BE services [34]. The GT router subsystem is based on a time-division multiplexed circuit switching approach. A router uses a slot table to (1) avoid contention on a link, (2) divide up bandwidth per link between connections, and (3) switch data to the correct output. Every slot table T has S time slots (rows), and N router outputs (columns). There is a logical notion of synchronicity: all routers in the network are in the same fixedduration slot. In a slot s at most one block of data can be read/written per input/output port. In the next slot, the read blocks are written to their appropriate output ports. Blocks thus propagate in a store and forward fashion. The latency a block incurs per router is equal to the duration of a slot and bandwidth


21-8


is guaranteed in multiples of block size per S slots. The BE router uses packet switching, and it has been showed that both input queuing with wormhole routing or virtual cut-through routing and virtual output queuing with wormhole routing are feasible in terms of buffering cost. The BE and GT router subsystem are combined in the Aethereal router architecture of Figure 21.3. The GT router offers a fixed end-to-end latency for its traffic, which is given the highest priority by the arbiter. The BE router uses all the bandwidth (slots) that has not been reserved or used by GT traffic. GT router slot tables are programmed by means of BE packets (see the arrow “program” in Figure 21.3). Negotiations, resulting in slot allocation, can be done at compile time, and be configured deterministically at runtime. Alternatively, negotiations can be done at runtime. A different perspective has been taken in the design of the switch for the BE Xpipes NoC. Figure 21.4 shows an example configuration with four inputs, four outputs, and two virtual channels multiplexed across the same physical output link. A physical link is assigned to different virtual channels on a flitby-flit basis, thereby improving network throughput. Switch operation is latency-insensitive, in that correct operation is guaranteed for arbitrary link pipeline depth. In fact, as explained above, network

(a)

(b) BE

BE Program

GT

GT Low priority

Program

Preempt

Control path

High priority Buffers

Arbitration

Data path

Switch

FIGURE 21.3 A combined GT–BE router: (a) conceptual view; (b) hardware view.

In[1]

Out[1] In[2]

Out[0]

Switch

Out[2]

2N + M flits In[0] Out[3]

FIGURE 21.4 Example of switch configuration with two virtual channels.


In[3]


21-9

links in Xpipes interconnect are pipelined with a flexible number of stages, thereby decoupling link data introduction rate from its physical length. For latency-insensitive operation, the switch has virtual channel registers to store 2N + M flits, where N is the link length (expressed as number of basic repeater stages) and M is a switch architecture-related contribution (12 cycles in this design). The reason is that each transmitted flit has to be acknowledged before being discarded from the buffer. Before an ACK is received, the flit has to travel across the link (N cycles), an ACK/NACK decision has to be taken at the destination switch (a portion of M cycles), the ACK/NACK signal has to be propagated back (N cycles) and recognized by the source switch (remaining portion of M cycles). During this time, other 2N + M flits are transmitted but not yet ACKed. Output buffering was chosen for Xpipes switches, and the resulting architecture is reported in Figure 21.5. It consists of a replication of the same output module, accepting all input ports as its own inputs. Flow control signals generated by each output block are directed to a centralized module, that takes care of generating proper ACKs or NACKs for the incoming flits from the different input ports. Each output module is deeply pipelined (seven pipeline stages) so as to maximize the operating clock frequency of the switch. Architectural details on the pipelined output module are illustrated in Figure 21.6. Forward flow control is used and a flit is transmitted to the next switch only when adequate storage is available. The CRC decoders for error detection work are in parallel with the switch operation, thereby hiding their impact on switch latency.

In[1]

Out[1]

flow_control[1]

In[0] In[1] In[2]

Out[0] portOUTtot_switch[0].h

In[3] Internal_flow_control[0] In[0]

In[0]

In[2]

In[1] In[2] Out[0]

Out[1] ACK management

portOUTtot_switch[1].h

Out[2]

In[3] Internal_flow_control[1]

flow_control[0]

flow_control[2] In[0] In[1] In[2]

Out[3] portOUTtot_switch[3].h

In[3] Internal_flow_control[3]

flow_control[3] In[3]

Out[3]

FIGURE 21.5 Architecture of output buffered Xpipes switch.


21-10


CRC_decoder[0]

CRC_decoder[1]

CRC_decoder[2]

CRC_decoder[3]

crc_ACK[0]

crc_ACK[1]

Error detection logic

crc_ACK[2]

crc_ACK[3] Output module

In[0] In[1] Out[0] which_in

In[2] In[3]

ACK

ACK

ACK

ACK_Valid

NACK

NACK

NACK

ACK

Matching input and Arbiter output port

MUXstage Virtual channel arbiter

Virtual channel registers

Forward flow control

Output link arbiter

FIGURE 21.6 Architecture of an Xpipes switch output module.

The first pipeline stage checks the header of incoming packets on different input ports to determine whether those packets have to be routed through the output port under consideration. Only matching packets are forwarded to the second stage, which resolves contention based on a round robin policy. Arbitration is carried out against receipt of the tail flits of preceding packets, so that all other flits of a packet can be propagated without contention resolution at this stage. A NACK for flits of nonselected packets is generated. The third stage is just a multiplexer, which selects the prioritized input port. The following arbitration stage keeps the status of virtual channel registers and decides whether the flits can be stored into the registers or not. A header flit is sent to the register with more free locations, followed by successive flits of the same packet. The fifth stage is the actual buffering stage, and the ACK/NACK message at this stage indicates whether a flit has been successfully stored or not. The following stage takes care of forward flow control and finally a last arbitration stage multiplexes the virtual channels on the physical output link. Finally, the switch is highly parameterizable. Design parameters are: number of I/O ports, flit width, number of virtual channels, length of switch-to-switch links, size of output registers.

21.4.3 Network Interface The most relevant tasks of the network interface are: (1) hiding the details about the network communication protocol to the cores, so that they can be developed independently of the communication infrastructure, (2) communication protocol conversion (from end-to-end to network protocol), (3) data packetization (packet assembly, delivery, and disassembly). The former objective can be achieved by means of standard interfaces. For instance, the VSIA vision [22] is to specify open standards and specifications that facilitate the integration of software and hardware



21-11

virtual components from multiple sources. Different complexity interfaces are described in the standard, from Peripheral Virtual Complexity Interfaces (VCI) to Basic VCI and Advanced VCI. Another example of standard socket to interface cores to networks is represented by Open Core Protocol (OCP) [23]. Its main characteristics are a high degree of configurability to adapt to the core’s functionality and the independence of request and response phases, thus supporting multiple outstanding requests and pipelining of transfers. Data packetization is a critical task for the network interface, and has an impact on the communication latency, besides the latency of the communication channel. The packet-preparation process consists of building packet header, payload, and packet tail. The header contains the necessary routing and network control information (e.g., source and destination address). When source routing is used, the destination address is ignored and replaced with a route field that specifies the route to the destination. This overhead in terms of packet header is counterbalanced by the simpler routing logic at the network switches: they simply have to look at the route field and route the packet over the specified switch output port. The packet tail indicates the end of a packet and usually contains parity bits for error-detecting or error-correcting codes. An insight in the Xpipes network interface implementation will provide an example of these concepts. It provides a standardized OCP-based interface to network nodes. The network interface for cores that initiate communication (initiators) needs to turn OCP-compliant transactions into packets to be transmitted across the network. It represents the slave side of an OCP end-to-end connection, and it is therefore referred to as network interface slave (NIS). Its architecture is showed in Figure 21.7.

Initiator core

numSB Lutword

Datastream Request phase (OCP)

STATIC_PACKETING

Flitout

req_tx_datastream

DP_FAST

teq_tx_flitout BUFFER_OUT

Output to the network

busy_buffer busy_dpfast

Master if

start_receive_response

enable_new_read

Datastream Response phase (OCP)

req_tx_datastream

RECEIVE_RESPONSE

busy_receive_response

NIS

FIGURE 21.7 Architecture of the Xpipes NIS.


SYNCHRO

Input from the network

21-12


The NIS has to build the packet header, which has to be spread over a variable number of flits depending on the length of the path to the destination node. In fact, Xpipes relies on a static routing algorithm called street sign routing. Routes are derived by the network interface by accessing a look-up table based on the destination address. Such information consists of direction bits read by each switch and indicating the output port of the switch to which flits belonging to a certain packet have to be directed to. The look-up table is accessed by the STATIC_PACKETING block, a finite state machine that forwards the routing information numSB (number of hops to destination) and lutword (word read from the lookup table) as well as the request-related information datastream from the initiator core to the DP_FAST block, provided the enable signal busy_dpfast is not asserted. Based on the input data, module DP_FAST has the task of building the flits to be transmitted via the output buffer BUFFER_OUT, according to the mechanism illustrated in Figure 21.8. Let us assume that a packet requires numSB = 5 hops to get to destination, and that the direction to be taken at each switch is expressed by DIR. Module DP_FAST builds the first flit by concatenating the flit type field with path information. If there is some space left in the flit, it is filled with header information derived from the input datastream. The unused part of the datastream is stored in a regpark register, so that a new datastream can be read from the STATIC_PACKETING block. The following header and payload flits will be formed by combining data stored in regpark and reg_datastream. No partially filled flits are transmitted to make transmission more efficient. Finally, module BUFFER_OUT stores flits to be sent across the network, and allows the NIS to keep preparing successive flits when the network is congested. Size of this buffer is a design parameter. The response phase is carried out by means of two modules. SYNCHRO receives incoming flits and reads out only useful information (e.g., it discards route fields). At the same time, it contains buffering resources

DIR DIR DIR DIR DIR Lutword FTYPE

numSB = 5

Path W Flit Header

W–FTYPEWD Datastream Regflit

reg_datastream

Header

Regpark

FIGURE 21.8 Mechanism for building header flits.



21-13

to synchronize the network’s requests to transmit remaining packet flits with the core consuming rate. The RECEIVE_RESPONSE module translates useful header and payload information into OCP-compliant response fields. When a read transaction is initiated by the master core, the STATIC_PACKETING block asserts a start_receive_response signal that triggers the waiting phase of the RECEIVE_RESPONSE module for the requested data. As a consequence, the NIS supports only one outstanding read operation to keep interface complexity low. Although no read after read transactions can be initiated unless the previous one has completed, an indefinite number of write transactions can be carried out after an outstanding read has been initiated. The architecture of a network interface master is similar to the one just described, and is not reported here for lack of space. At instantiation time, the main network interface related parameters to be set are: total number of core blocks, flit width, and maximum number of hops across the network.

21.5 NoC Topology The individual components of SoCs are inherently heterogeneous with widely varying functionality and communication requirements. The communication infrastructure should optimally match communication patterns among these components accounting for the individual component needs. As an example, consider the implementation of an MPEG4 decoder [42], depicted in Figure 21.9(b), where blocks are drawn roughly to scale and links represent interblock communication. First, the embedded memory (SDRAM) is much larger than all other cores and it is a critical communication bottleneck. Block sizes are highly nonuniform and the floorplan does not match the regular, tile-based floorplan shown in Figure 21.9(a). Second, the total communication bandwidth to/from the embedded SDRAM is much larger than that required for communication among the other cores. Third, many neighboring blocks do not need to communicate. Even though it may be possible to implement MPEG4 onto a homogeneous fabric, there is a significant risk of either underutilizing many tiles and links, or, at the opposite extreme, of achieving poor performance because of localized congestion. These factors motivate the use of an application-specific on-chip network [26]. With an application-specific network, the designer is faced with the additional task of designing network components (e.g., switches) with different configurations (e.g., different I/Os, virtual channels, buffers) and interconnecting them with links of uneven length. These steps require significant design time and the need to verify network components and their communications for every design. The library-based nature of network building blocks seems the more appropriate solution to support domain-specific custom NoCs. Two relevant examples have been reported in the open literature: Proteo and Xpipes Interconnects. Proteo consists of a fully reusable and scalable component library where the

(a)

(b)

AU

RAST

SRAM

VU

SDRAM

MCPU

Core BAB DSP

UPS SRAM

ADSP RISC

FIGURE 21.9 Homogeneous versus heterogeneous architectural template: (a) tile-based on-chip multiprocessor; (b) MPEG4 SoC.


21-14


components can be used to implement networks from very simple bus emulation structures to complex packet networks. It uses standardized VCI interface between the functional cores and the communication network. Proteo is described using synthesizable VHDL and relies on an interconnect node architecture that targets flexible on-chip communication. It is used as a testing platform when the efficiency of network topologies and routing schemes are investigated for on-chip environments. The node is constructed from a collection of parameterized and reusable hardware blocks, including components such as FIFO (first in first out) buffers, routing controllers, and standardized interface wrappers. A node can be tuned to fulfill the desired characteristics of communication by properly selecting the internal architecture of the node itself. Xpipes NoC takes a similar approach. As described throughout this chapter, its network building blocks have been designed as highly configurable and design-time composable soft macros described in SystemC at the cycle-accurate level. An optimal system solution will also require an efficient mapping of high-level abstractions on to the underlying platform. This mapping procedure involves optimizations and trade-offs among many complex constraints, including quality of service, real-time response, power consumption, area, etc. Tools are urgently needed to explore this mapping process, and assist and automate optimization where possible. The first challenge for these tools is to bridge the gap in building custom NoCs that optimally match the communication requirements of the system. The network components they build should be highly optimized for that particular NoC design, providing large savings in area, power, and latency with respect to standard NoCs based on regular structures. In the following section, an example of design methodology for heterogeneous SoCs is briefly illustrated. It is relative to Xpipes interconnect and relies on a tool that automatically instantiates an applicationspecific NoC for heterogeneous on-chip multiprocessors (called XpipesCompiler [36]).

21.5.1 Domain-Specific NoC Synthesis Flow The complete XpipesCompiler-based NoC design flow is depicted in Figure 21.10. From the specification of an application, the designer (or a high-level analysis and exploration tool) creates a high-level view of the SoC floorplan, including nodes (with their network interfaces), links, and switches. Based on clock

Xpipes Library NI files

Application

Application specific NoC

Switch files

Link files

Instantiation software

Routing tables

Core source files XpipesCompiler tool

FIGURE 21.10

NoC synthesis flow with XpipesCompiler.


SystemC SystemC files Files ofwhole whole of design design


21-15

VU 190

Media CPU

AU

40

0.5

600

SRAM1

ADSP

40

250

SDRAM 0.5

RAST

60

910 Up SAMP

SRAM2 670 173

iDCT, etc.

500

32 BAB

RISC

FIGURE 21.11 Core graph representation of an example MPEG4 design with annotated average communication requirements.

speed target and link routing, the number of pipeline stages for each link is also specified. The information on the network architecture is specified in an input file for the XpipesCompiler. Routing tables for the network interfaces are also specified. The tool takes as additional input the SystemC library of soft network components. The output is a SystemC hierarchical description, which includes all switches, links, network nodes, and interfaces, and specifies their topological connectivity. The final description can then be compiled and simulated at the cycle-accurate and signal-accurate level. At this point, the description can be fed to backend register transfer level (RTL) synthesis tools for silicon implementation. In a nutshell, the XpipesCompiler generates a set of network component instances that are customtailored to the specification contained in its input network description file. This tool allows a very instructive comparison of the effects (in terms of area, power, and performance) of mapping applications on customized domain-specific NoCs and regular mesh NoCs. Let us focus on the MPEG4 decoder already introduced in this chapter. Its core graph representation together with its communication requirements are reported in Figure 21.11. The edges are annotated with the average bandwidth requirements of the cores in MB/sec. Customized application-specific NoCs that closely match the application’s communication characteristics have been manually developed and compared to a regular mesh topology. The different NoC configurations are reported in Figure 21.12. In the MPEG4 design considered, many of the cores communicate with each other through the shared SDRAM. So a large switch is used for connecting the SDRAM with other cores (Figure 21.12[b]) while smaller switches are used for other cores. An alternate custom NoC is also considered (Figure 21.12[c]): it is an optimized mesh network, with superfluous switches and switch I/Os removed. Area (in 0.1 µm technology) and power estimates for the different NoC configurations are reported in Table 21.1. Since all cores communicate with many other cores, many switches are needed and therefore area savings are not extremely significant for custom NoCs. Based on the average traffic through each network component, the power dissipation for each NoC design has been calculated. Power savings for the custom solutions are not very significant, as most of the traffic traverses the larger switches connected to the memories. As power dissipation on a switch increases nonlinearly with increase in switch size, there is more power dissipation in the switches of custom NoC1 (that has an 8 × 8 switch) than the mesh NoC. However most of the traffic traverses short links in this custom NoC, thereby giving marginal power savings for the whole design.


21-16


(a)

vu s1

raster izer

SRAM

s2

s1

au

Media CPU

(c)

(b)

s2

au vu

s2 BAB Calc up samp

s2 s3

DDR SDRAM

raster S3 izer

DDR SDRAM

RISC CPU

S3

Audio DSP

s1 iDCT, etc.

s2

s2 s1

SRAM

S3

BAB calc Audio DSP

RISC CPU

Audio DSP up samp

S3

BAB Calc

SRAM iDCT etc.

RISC CPU

DDR SDRAM

s1 iDCT, etc.

s3

SRAM

s2 –– 3 × 3

s8 –– 8 × 8

s3 –– 5 × 5

s2

s1 –– 5 × 5

s3 –– 3 × 3

s2 –– 4 × 4

s2

Media CPU

s1

Media CPU

up samp

s1 –– 3 × 3

SRAM

S3

S8 s3

au vu raster izer s2 s2

SRAM

s3 –– 4 × 4

FIGURE 21.12 NoC configurations for MPEG4 decoder: (a) mesh NoC, (b) application-specific NoC1, and (c) application-specific NoC2. TABLE 21.1 Area and Power Estimates for the MPEG4-Related NoC Configurations NoC configuration Mesh Custom 1 Custom 2

Area (mm2 )

Ratio mesh/cust

Power (mW)

Ratio mesh/cust

1.31 0.86 0.71

1.52 1.85

114.36 110.66 93.66

1.03 1.22

Figure 21.13 reports the variation of average packet latency (for 64B packets, 32 bit flits) with link bandwidth. Custom NoCs, as synthesized by XpipesCompiler, have lower packet latencies as the average number of switches and link traversals is lower. At the minimum plotted bandwidth value, almost 10% latency saving is achieved. Moreover, the latency increases more rapidly with the mesh NoC as the link bandwidth decreases. Also, custom NoCs have better link utilization: around 1.5 times the link utilization of a mesh topology. Area, power, and performance optimizations by means of custom NoCs turn out to be more difficult for MPEG4 than other applications, such as Video Object Plane Decoders and MultiWindow Displayer [36].

21.6 Conclusions This chapter has described the motivation for packet-switched networks as communication paradigm for deep submicron SoCs. After an overview of NoC proposals from the open literature, the chapter has gone into the details of NoC architectural components (switch, network interface, and point-to-point links), illustrating the Xpipes library of composable soft macros as a case study. Finally, the challenging issue of heterogeneous NoC design has been addressed, showing an example NoC synthesis flow and detailing area, power, and performance metrics of customized application-specific NoC architectures with respect to regular mesh topologies. The chapter aims at highlighting the main guidelines and open issues for NoC design on gigascale SoCs.



21-17

50 Mesh 48

Cust1 Cust2

Average Packet Latency (in Cy)

46

44

42

40

38

36

34 32 3.2

3

2.8 2.6 BW (in Gb/sec)

2.4

2.2

FIGURE 21.13 Average packet latency as a function of the link bandwidth.

Acknowledgment This work was supported in part by MARCO/DARPA Gigascale Silicon Research Center.

References [1] F. Boekhorst. Ambient Intelligence, the Next Paradigm for Consumer Electronics: How will it Affect Silicon? In ISSCC 2002, Vol. 1, February 2002, pp. 28–31. [2] L. Benini and G. De Micheli. Networks on Chips: a New SoC Paradigm. IEEE Computer, 35, 2002, 70–78. [3] P. Wielage and K. Goossens. Networks on Silicon: Blessing or Nightmare? In Proceedings of the Euromicro Symposium on Digital System Design DSD02, September 2002, pp. 196–200. [4] W.J. Dally and B. Towles. Route Packets, not Wires: On-Chip Interconnection Networks. In Proceedings of the Design and Automation Conference DAC01, June 2001, pp. 684–689. [5] S. Kumar, A. Jantsch, J.P. Soininen, M. Forsell, M. Millberg, J. Oeberg, K. Tiensyrja, and A. Hemani. A Network on Chip Architecture and Design Methodology. In IEEE Symposium on VLSI ISVLSI02, April 2002, pp. 105–112. [6] L. Benini, D. Bertozzi, D. Bruni, N. Drago, F. Fummi, and M. Poncino. SystemC Cosimulation and Emulation of Multiprocessor SoC Designs. IEEE Computer, 36, 2003, 53–59.


21-18


[7] S. Nugent, D.S. Wills, and J.D. Meindl. A Hierarchical Block-Based Modeling Methodology for SOC in GENESYS. In Proceedings of the IEEE ASIC/SOC Conference, September 2002, pp. 239–243. [8] P. Gerin, S. Yoo, G. Nicolescu, and A.A. Jerraya. Scalable and Flexible Cosimulation of SoC Designs with Heterogeneous Multi-Processor Target Architecture. In Proceedings of the ASP-DAC 2001, January/February 2001, pp. 63–68. [9] H. Blume, H. Huebert, H.T. Feldkaemper, and T.G. Noll. Model-Based Exploration of the Design Space for Heterogeneous Systems on Chip. In Proceedings of the IEEE Conference on ApplicationSpecific Systems, Architectures and Processors ASAP02, 2002. [10] P.G. Paulin, C. Pilkington, and E. Bensoudane. StepNP: a System-level Exploration Platform for Network Processors. IEEE Design and Test of Computers, November–December 2002, pp. 17–26. [11] P. Guerrier and A. Greiner. A Generic Architecture for On-Chip Packet Switched Interconnections. In Proceedings of the Design, Automation and Testing in Europe DATE00, March 2000, pp. 250–256. [12] S.J. Lee et al. An 800 MHz Star-Connected On-Chip Network for Application to Systems on a Chip. In ISSCC03, February 2003. [13] H. Yamauchi et al. A 0.8 W HDTV Video Processor with Simultaneous Decoding of Two MPEG2 MP@HL Streams and Capable of 30 Frames/s Reverse Playback. In ISSCC02, Vol. 1, February 2002, pp. 473–474. [14] M. Dall’Osso, G. Biccari, L. Giovannini, D. Bertozzi, and L. Benini. Xpipes: a Latency Insensitive Parameterized Network-on-Chip Architecture for Multi-Processor SoCs. In ICCD03, October 2003. [15] ITRS. 2001, http://public.itrs.net/Files/2001ITRS/Home.htm. [16] D. Bertozzi, L. Benini, and G. De Micheli. Energy-Reliability Trade-Off for NoCs. In Networks on Chip, A. Jantsch and Hannu Tenhunen, Eds., Kluwer Academic Press, Boston, MA, 2003, pp. 107–129. [17] H. Zhang, V. George, and J.M. Rabaey. Low-Swing On-Chip Signaling Techniques: Effectiveness and Robustness. IEEE Transactions on VLSI Systems, 8, 2000, 264–272. [18] J. Xu, and W. Wolf, Wave Pipelining for Application-Specific Networks-on-Chips. In CASES02, October 2002, pp. 198–201. [19] K. Goossens, J. Dielissen, J. van Meerbergen, P. Poplavko, A. Radulescu, E. Rijpkema, E. Waterlander, and P. Wielage. Guaranteeing the Quality of Services in Networks on Chip. In Networks on Chip, A. Jantsch and Hannu Tenhunen, Eds., Kluwer Academic Press, Boston, MA, 2003, pp. 61–82. [20] ITRS, 1999, http://public.itrs.net/files/1999_SIA_Roadmap/. [21] A. Jantsch and H. Tenhunen. Will Networks on Chip Close the Productivity Gap? In Networks on Chip, A. Jantsch and Hannu Tenhunen, Eds., Kluwer Academic Press, Boston, MA, 2003, pp. 3–18. [22] VSI Alliance. Virtual Component Interface Standard, 2000. [23] OCP International Partnership. Open Core Protocol Specification, 2001. [24] D. Wingard. MicroNetwork-Based Integration for SoCs. In Design Automation Conference DAC01, June 2001, pp. 673–677. [25] D. Flynn. AMBA: enabling Reusable On-Chip Designs. IEEE Micro, 17, 1997, 20–27. [26] H. Zhang et al. A 1V Heterogeneous Reconfigurable DSP IC for Wireless Baseband Digital Signal Processing. IEEE Journal of SSC, 35, 2000, 1697–1704. [27] W.J. Dally and S. Lacy. VLSI Architecture: Past, Present and Future. In Conference of Advanced Research in VLSI, 1999, pp. 232–241. [28] D. Culler, J.P. Singh, and A. Gupta. Parallel Computer Architecture. An Hardware/Software Approach. Morgan Kaufmann, San Francisco, CA, 1999. [29] K. Compton and S. Hauck. Reconfigurable Computing: a Survey of System and Software. ACM Computing Surveys, 34, 2002, 171–210. [30] R. Tessier and W. Burleson. Reconfigurable Computing and Digital Signal Processing: a Survey. Journal of VLSI Signal Processing, 28, 2001, 7–27.



21-19

[31] J. Walrand and P. Varaja. High Performance Communication Networks. Morgan Kaufmann, San Francisco, CA, 2000. [32] Dale Liu et al. SoCBUS: The Solution of High Communication Bandwidth on Chip and Short TTM, invited paper in Real Time and Embedded Computing Conference, September 2002. [33] S. Ishiwata et al., A Single Chip MPEG-2 Codec Based on Customizable Media Embedded Processor. IEEE JSSC, 38, 2003, 530–540. [34] E. Rijpkema, K. Goossens, A. Radulescu, J. van Meerbergen, P. Wielage, and E. Waterlander. Trade Offs in the Design of a Router with both Guaranteed and Best-Effort Services for Networks on Chip. In Design Automation and Test in Europe DATE03, March 2003, pp. 350–355. [35] I. Saastamoinen, D. Siguenza-Tortosa, and J. Nurmi. Iterconnect IP Node for Future Systemson-Chip Designs. IEEE Workshop on Electronic Design, Test and Applications, January 2002, pp. 116–120. [36] A. Jalabert, S. Murali, L. Benini, and G. De Micheli. XpipesCompiler: a Tool for Instantiating Application Specific Networks-on-Chip, DATE 2004, pp. 884–889. [37] V. Agarwal, M.S. Hrishikesh, S.W. Keckler, and D. Burger. Clock Rate Versus IPC: The End of the Road for Conventional Microarchitectures. In Proceedings of the 27th Annual International Symposium on Computer Architecture, June 2000, pp. 248–250. [38] L.P. Carloni, K.L. McMillan, and A.L. Sangiovanni-Vincentelli. Theory of Latency-Insensitive Design. IEEE Transactions on CAD of ICs and Systems, 20, 2001, 1059–1076. [39] L. Scheffer. Methodologies and Tools for Pipelined On-Chip Interconnects, International Conference on Computer Design, 2002, pp. 152–157. [40] P. Glaskowsky. Pentium 4 (Partially) Previewed. Microprocessor Report, 14, 2000, 10–13. [41] J. Duato, S. Yalamanchili, and L. Ni. Interconnection Networks: an Engineering Approach. IEEE Computer Society Press, Washington, 1997. [42] E.B. Van der Tol and E.G.T. Jaspers. Mapping of MPEG4 Decoding on a Flexible Architecture Platform. In SPIE 2002, January 2002, pp. 1–13.


22 Platform-Based Design for Embedded Systems 22.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22-1 22.2 Platform-Based Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22-3 22.3 Platforms at the Articulation Points of the Design Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22-4 (Micro-)Architecture Platforms • API Platform • System Platform Stack

22.4 Network Platforms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22-7 Definitions • Quality of Service • Design of Network Platforms

22.5 Fault-Tolerant Platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22-9

Luca P. Carloni, Fernando De Bernardinis, Claudio Pinello, Alberto L. Sangiovanni-Vincentelli, and Marco Sgroi University of California at Berkeley

Types of Faults and Platform Redundancy • Fault-Tolerant Design Methodology • The API Platform (FTDF Primitives) • Fault-Tolerant Deployment • Replica Determinism

22.6 Analog Platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22-15 Definitions • Building Performance Models • Mixed-Signal Design Flow with Platforms • Reconfigurable Platforms

22.7 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22-22 Acknowledgments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22-24 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22-24

22.1 Introduction Platform-Based Design (PBD) [1,2] has emerged as an important design style as the electronics industry faced serious difficulties owing to three major factors: 1. The disaggregation (or “horizontalization”) of the electronics industry has begun about a decade ago and has affected the structure of the industry favoring the move from a vertically oriented business model to a horizontally oriented one. In the past, electronic system companies used to maintain full control of the product development cycle, from product definition to final manufacturing. Today, the identification of a new market opportunity, the definition of the detailed system specifications, the development and assembly of the components, and the manufacturing of the final product are tasks performed more and more frequently by distinct organizations. In fact, the complexity of electronic designs and the number of technologies that must be mastered to bring winning products to market have forced electronic companies to focus on their core competence. In this 22-1


22-2


scenario, the integration of the design chain becomes a serious problem at the hand-off points from one company to another. 2. The pressure for reducing time-to-market of electronic products in the presence of exponentially increasing complexity has forced designers to adopt methods that favor component reuse at all levels of abstraction. Furthermore, each organization that contributes a component to the final product naturally strives for the flexibility in their design approach that allows to make continuous adjustments and accommodate last-minute engineering changes. 3. The dramatic increase in NonRecurring Engineering (NRE) costs owing to mask making at the Integrated Circuit (IC) implementation level (a set of masks for the 90 nm technology node costs more than two million US dollars), development of production plants (a new fab costs more than two billion US dollars), and design cost (a new generation microprocessor design requires more than 500 designers with all the associated costs in tools and infrastructure!) has created, on one hand, the necessity of correct-the-first-time designs and on the otherhand, the push for consolidation of efforts in manufacturing.1 The combination of these factors has caused several system companies to substantially reduce their ASIC (Application Specific Integrated Circuits) design efforts. Traditional paradigms in electronic system and IC design have to be revisited and readjusted or altogether abandoned. Along the same line of reasoning, IC manufacturers are moving toward the development of parts that have guaranteed highvolume production form a single mask set (or that are likely to have high-volume production, if successful) thus moving differentiation and optimization to reconfigurability and programmability. Platform-Based Design has emerged over the years as a way of coping with the problems listed earlier. The term “platform” has been used in several domains: from service providers to system companies, from tier-one suppliers to IC companies. In particular, IC companies have been very active, lately, to espouse platforms. The TI OMAP platform for cellular phones, the Philips Viper and Nexperia platforms for consumer electronics, the Intel Centrino platform for laptops, are a few examples. Recently, Intel has been characterized by its CEO Ottellini as a “platform company.” As is often the case for fairly radical new approaches, the methodology emerged as a sequence of empirical rules and concepts, but we have reached a point where a rigorous design process was needed together with supporting EDA environments and tools. The PBD: • Sets the foundation for developing economically feasible design flows because it is a structured methodology that theoretically limits the space of exploration, yet still achieves superior results in the fixed time constraints of the design. • Provides a formal mechanism for identifying the most critical hand-off points in the design chain. The hand-off point between system companies and IC design companies and the one between IC design companies (or divisions) and IC manufacturing companies (or divisions) represent the articulation points of the overall design process. • Eliminates expensive design iterations because it fosters design reuse at all abstraction levels thus enabling the design of an electronic product by assembling and configuring platform components in a rapid and reliable fashion. • Provides an intellectual framework for the complete electronic design process. This chapter presents the foundations of this discipline and outlines a variety of domains where the PBD principles can be applied. In particular, Section 22.2 defines the main principles of PBD. Our goal is to provide a precise reference that may be used as the basis for reaching a common understanding in the electronic system and circuit design community. Then, we present the platforms that define the articulation points between system definition and implementation (Section 22.3). In the following sections 1 The cost of fabs has changed the landscape of IC manufacturing in a substantial way forcing companies to team up

for developing new technology nodes (see, e.g., the recent agreement among Motorola, Philips, and ST Microelectronics and the creation of Renesas in Japan).


Platform-Based Design for Embedded Systems

22-3

we show that the PBD paradigm can be applied to all levels of design: from very high levels of abstraction such as communication networks (Section 22.4) and fault-tolerant platforms for the design of safetycritical feedback-control systems (Section 22.5) to low levels such as analog parts (Section 22.6), where performance is the main focus.

22.2 Platform-Based Design The basic tenets of PBD are: • The identification of design as a meeting-in-the-middle process, where successive refinements of specifications meet with abstractions of potential implementations. • The identification of precisely defined layers where the refinement and abstraction process take place. Each layer supports a design stage that provides an opaque abstraction of lower layers that allows accurate performance estimations. This information is incorporated in appropriate parameters that annotate design choices at the present layer of abstraction. These layers of abstraction are called platforms to stress their role in the design process and their solidity. A platform is a library of components that can be assembled to generate a design at that level of abstraction. This library not only contains computational blocks that carry out the appropriate computation but also communication components that are used to interconnect the functional components. Each element of the library has a characterization in terms of performance parameters together with the functionality it can support. For every platform level, there is a set of methods used to map the upper layers of abstraction into the platform and a set of methods used to estimate performances of lower level abstractions. As illustrated in Figure 22.1, the meeting-in-the-middle process is the combination of two efforts: • Top-down: map an instance of the top platform into an instance of the lower platform and propagate constraints. • Bottom-up: build a platform by defining the library that characterizes it and a performance abstraction (e.g., number of literals for technology independent optimization, area, and propagation delay for a cell in a standard cell library). A platform instance is a set of architecture components that are selected from the library and whose parameters are set. Often the combination of two consecutive layers and their “filling” can be interpreted as a unique abstraction layer with an “upper” view, the top abstraction layer, and a “lower” view, the bottom layer. A platform stack is a pair of platforms, along with the tools and methods that are used to map the upper layer of abstraction onto the lower layer. Note that we can allow a platform stack to include several sub-stacks if we wish to span a large number of abstractions.

Performance estimation

Constraints propagation

Upper layer of abstraction

Lower layer of abstraction

FIGURE 22.1 Interactions between abstraction layers.


22-4


Platforms should be defined to eliminate large loop iterations for affordable designs: they should restrict design space via new forms of regularity and structure that surrender some design potential for lower cost and first-pass success. The library of function and communication components is the design space that we can explore at the appropriate level of abstraction. Establishing the number, location, and components of intermediate platforms is the essence of PBD. In fact, designs with different requirements and specification may use different intermediate platforms, hence different layers of regularity and design-space constraints. A critical step of the PBD process is the definition of intermediate platforms to support predictability, which enables the abstraction of implementation detail to facilitate higher-level optimization, and verifiability, that is, the ability to formally ensure correctness. The trade-offs involved in the selection of number and characteristics of platforms relate to the size of the design space to be explored and the accuracy of the estimation of the characteristics of the solution adopted. Naturally, the larger the step across platforms, the more difficult is the prediction of performance, optimizing at the higher levels of abstraction, and providing a tight lower bound. In fact, the design space for this approach may actually be smaller than the one obtained with smaller steps because it becomes harder to explore meaningful design alternatives and the restriction on search impedes complete designspace exploration. Ultimately, predictions/abstractions may be so inaccurate that design optimizations are misguided and the lower bounds are incorrect. It is important to emphasize that the PBD paradigm applies to all levels of design. While it is rather easy to grasp the notion of a programmable hardware platform, the concept is completely general and should be exploited through the entire design flow to solve the design problem. In the following sections, we will show that platforms can be applied to low levels of abstraction such as analog components, where flexibility is minimal and performance is the main focus, as well as to very high levels of abstraction such as networks, where platforms have to provide connectivity and services. In the former case platforms abstract hardware to provide (physical) implementation, while in the latter communication services abstract software layers (protocol) to provide global connectivity.

22.3 Platforms at the Articulation Points of the Design Process As we mentioned in Section 22.2, the key to the application of the design principle is the careful definition of the platform layers. Platforms can be defined at several point of the design process. Some levels of abstraction are more important than others in the overall design trade-off space. In particular, the articulation point between system definition and implementation is a critical one for design quality and time. Indeed, the very notion of PBD originated at this point (see [1,3–5]). In References 1, 2, and 5, we have discovered that at this level there are indeed two distinct platforms forming a system platform stack. These need to be defined together with the methods and the tools necessary to link them: a (micro-)architecture platform and an Application Programming Interface (API) platform. The API platform allows system designers to use the services that a (micro-)architecture offers. In the world of Personal Computers (PCs), this concept is well known and is the key to the development of application software on different hardware that share some commonalities allowing the definition of a unique API.

22.3.1 (Micro-)Architecture Platforms Integrated circuits used for embedded systems will most likely be developed as an instance of a particular (micro-)architecture platform. That is, rather than being assembled from a collection of independently developed blocks of silicon functionalities, they will be derived from a specific family of micro-architectures, possibly oriented toward a particular class of problems, that can be extended or reduced by the system developer. The elements of this family are a sort of “hardware denominator” that could be shared across multiple applications. Hence, an architecture platform is a family of micro-architectures that share some commonality, the library of components that are used to define the micro-architecture. Every element



22-5

of the family can be obtained quickly through the personalization of an appropriate set of parameters controlling the micro-architecture. Often, the family may have additional constraints on the components of the library that can or should be used. For example, a particular micro-architecture platform may be characterized by the same programmable processor and the same interconnection scheme, while the peripherals and the memories of a specific implementation may be selected from the predesigned library of components depending on the given application. Depending on the implementation platform that is chosen, each element of the family may still need to go through the standard manufacturing process including mask making. This approach then conjugates the need of saving design time with the optimization of the element of the family for the application at hand. Although it does not solve the mask cost issue directly, it should be noted that the mask cost problem is primarily owing to the generation of multiple mask sets for multiple design spins, which is addressed by the architecture platform methodology. The less constrained the platform, the more freedom a designer has in selecting an instance and the more potential there is for optimization, if time permits. However, more constraints mean stronger standards and easier addition of components to the library that defines the architecture platform (as with PC platforms). Note that, the basic concept is similar to the cell-based design layout style, where regularity and the reuse of library elements allow faster design time at the expense of some optimality. The tradeoff between design time and design “quality” needs to be kept in mind. The economics of the design problem must dictate the choice of the design style. The higher the granularity of the library, the more leverage we have in shortening the design time. Given that the elements of the library are reused, there is a strong incentive to optimize them. In fact, we argue that the “macro-cells” should be designed with great care and attention given to area and performance. It also makes sense to offer a variation of cells with the same functionality but with implementations that differ in performance, area, and power dissipation. Architecture platforms are, in general, characterized by (but not limited to) the presence of programmable components. Then, each of the platform instances that can be derived from the architecture platform maintains enough flexibility to support an application space that guarantees the production volumes required for economically viable manufacturing. The library that defines the architecture platform may also contain reconfigurable components, which comes in two flavors. With runtime reconfigurability, FPGA (Field Programmable Gate Arrays) blocks can be customized by the user without the need of changing mask set, thus saving both design cost and fabrication cost. With design-time reconfigurability, where the silicon is still application specific, only design time is reduced. An architecture platform instance is derived from an architecture platform by choosing a set of components from its library and by setting parameters of reconfigurable components of the library. The flexibility, or the capability of supporting different applications, of a platform instance is guaranteed by programmable components. Programmability will ultimately be of various forms. One is software programmability to indicate the presence of a microprocessor, Digital Signal Processor (DSP) or any other software programmable component. Another is hardware programmability to indicate the presence of reconfigurable logic blocks such as FPGAs, whereby logic function can be changed by software tools without requiring a custom set of masks. Some of the new architecture and/or implementation platforms being offered in the market mix the two types into a single chip. For example, Triscend, Altera, and Xilinx are offering FPGA fabrics with embedded hard processors. Software programmability yields a more flexible solution, since modifying software is, in general, faster and cheaper than modifying FPGA personalities. On the other hand, logic functions mapped on FPGAs execute orders of magnitude faster and with much less power than the corresponding implementation as a software program. Thus, the trade-off here is between flexibility and performance.

22.3.2 API Platform The concept of architecture platform by itself is not enough to achieve the level of application software reuse we require. The architecture platform has to be abstracted at a level where the application software “sees” a high-level interface with the hardware that we call API or Programmer Model. A software


22-6


layer is used to perform this abstraction. This layer wraps the essential parts of the architecture platform: • The programmable cores and the memory subsystem via a Real-Time Operating System (RTOS). • The I/O subsystem via the device drivers. • The network connection via the network communication subsystem. In our framework, the API is a unique abstract representation of the architecture platform via the software layer. Therefore, the application software can be reused for every platform instance. Indeed, the API is a platform itself that we can call the API platform. Of course, the higher the abstraction level at which a platform is defined, the more instances it contains. For example, to share the source code, we need to have the same operating system but not necessarily the same instruction set, while to share the binary code, we need to add the architectural constraints that force us to use the same ISA (Instruction Set Architecture), thus greatly restricting the range of architectural choices. The RTOS is responsible for the scheduling of the available computing resources and of the communication between them and the memory subsystem. Note that, in several embedded system applications, the available computing resources consist of a single microprocessor. In others, such as wireless handsets, the combination of a Reduced Instruction Set Computer (RISC) microprocessor or controller and DSP has been used widely in 2G, and now for 2.5G and 3G, and beyond. In set-top boxes, a RISC for control and a media processor have also been used. In general, we can imagine a multiple core architecture platform where the RTOS schedules software processes across different computing engines.

22.3.3 System Platform Stack The basic idea of system platform stack is captured in Figure 22.2. The vertex of the two cones represents the combination of the API and the architecture platform. A system designer maps its application onto the abstract representation that “includes” a family of architectures that can be chosen to optimize cost, efficiency, energy consumption, and flexibility. The mapping of the application onto the actual architecture in the family specified by the API can be carried out, at least in part, automatically if a set of appropriate software tools (e.g., software synthesis, RTOS synthesis, device-driver synthesis) is available. It is clear that the synthesis tools have to be aware of the architecture features as well as of the API. This set of tools makes use of the software layer to go from the API platform to the architecture platform. Note that, the system platform effectively decouples the application development process (the upper triangle) from the architecture implementation process (the lower triangle). Note also that, once we use the abstract definition of “API” as described earlier, we may obtain extreme cases such as traditional PC platforms on

Application space Application instance Platform mapping System platform Platform design-space export Platform instance Architectural space

FIGURE 22.2 System platform stack.



22-7

one side and full hardware implementation on the other. Of course, the programmer model for a full custom hardware solution is trivial since there is a one-to-one map between functions to be implemented and physical blocks that implement them. In the latter case, PBD amounts to adding to traditional design methodologies some higher level of abstractions.

22.4 Network Platforms In distributed systems the design of the protocols and channels that support the communication among the system components is a difficult task owing to the tight constraints on performances and cost. To make the communication design problem more manageable, designers usually decompose the communication function into distinct protocol layers, and design each layer separately. According to this approach, of which the Open Systems Interconnection (OSI) Reference Model is a particular instance, each protocol layer together with the lower layers define a platform that provides CSs to the upper layers and to the application-level components. Identifying the most effective layered architecture for a given application requires one to solve a trade-off between performances, which increase by minimizing the number of layers, and design manageability, which improves with the number of the intermediate steps. Present embedded system applications, owing to their tight constraints, increasingly demand the codesign of protocol functions that in less-constrained applications are assigned to different layers and considered separately (e.g., cross-layer protocol design of MAC and routing protocols in sensor networks). The definition of an optimal layered architecture, the design of the correct functionality for each protocol layer, and the design-space exploration for the choice of the physical implementation must be supported by tools and methodologies that allow to evaluate the performances and guarantees the satisfaction of the constraints after each step. For these reasons, we believe that the PBD principles and methodology provide the right framework to design communication networks. In this section, first, we formalize the concept of Network Platform (NP). Then, we outline a methodology for selecting, composing, and refining NP [6].

22.4.1 Definitions A Network Platform is a library of resources that can be selected and composed together to form a Network Platform Instance (NPI) and support the interaction among a group of interacting components. The structure of a NPI is defined by abstracting computation resources as nodes and communication resources as links. Ports interface nodes with links or with environment of the NPI. The structure of a node or a link is defined by its input and output ports, the structure of a NPI is defined by a set of nodes and links connecting them. The behaviors and the performances of a NPI are defined in terms of the type and the quality of the CSs it offers. We formalize the behaviors of a NPI using the Tagged Signal Model [7]. NPI components are modeled as processes and events model the instances of the send and receive actions of the processes. An event is associated with a message that has a type and a value, and with tags that specify attributes of the corresponding action instance (e.g., when it occurs in time). The set of behaviors of a NPI is defined by the intersection of the behaviors of the component processes. A NPI is defined as a tuple, NPI = (L, N , P, S), where: • L = {L1 , L2 , . . . , LNl } is a set of directed links. • N = {N1 , N2 , . . . , NNn } is a set of nodes. • P = {P1 , P2 , . . . , PNp } is a set of ports. A port Pi is a triple (Ni , Li , d), where Ni ∈ N is a node, Li ∈ L ∪ Env is a link or the NPI environment, and d = in if it is an input port, d = out if it is an output port. The ports that interface the NPI with the environment define the sets P in = {(Ni , Env, in)} ⊆ P, P out = {(Ni , Env, out)} ⊆ P. • S = Nn+Nl Ri is the set of behaviors, where Ri indicates the set of behaviors of a resource that can be a link in L or a node in N .


22-8


The basic services provided by a NPI are called Communication Services (CSs). A CS consists of a sequence of message exchanges through the NPI from its input to its output ports. A CS can be accessed by NPI users through the invocation of send and receive primitives whose instances are modeled as events. A NPI API consists of the set of methods that are invoked by the NPI users to access the CS. For the definition of a NPI API it is essential to specify not only the service primitives but also the type of CS they provide access to (e.g., reliable send, out-of-order delivery etc.). Formally, a CS is a tuple (P¯ in , P¯ out , M , E, h, g , field releaseRoom (port)

The acquire and release functions acquire/release a single token at a time. Supporting vector operations for these interface types would result in a complex interface. For example, it would expose the wrap-around in the channel buffer or would require a vector of references to be returned. Since tasks must still be able to acquire more than one token, these acquire functions acquire the first un-acquired token and change the state of the channel, unlike the reacquire functions of RB and RN. The release functions release the oldest acquired token on port. The interface types are named DBI and DNI with the D of direct, B of blocking, N of non-blocking, and I of in-order as tokens are released in the same order as they are acquired. These interface types can be implemented efficiently on shared memory architectures [7] and are suited for software tasks that process coarse-grain tokens. 25.4.1.5 Interface Type DBO and DNO In some cases tasks do not finish the processing of data in the same order as the data was acquired. In particular when large tokens are used, it should be possible to release a token as soon as a task is finished with it. For this purpose TTL offers the DBO and DNO interface types (O for out-of-order). The only difference with the DBI and DNI interface types is in the release functions: releaseData (port, &token) releaseRoom (port, &token)

The token reference allows the task to specify which token should be released. The out-of-order release supports efficient use of memory at the cost of a more complex implementation of the channel.

25.4.2 TTL Multi-Tasking Interface To support different forms of multi-tasking, TTL offers different ways for tasks to interact with the scheduler. Thereto TTL supports three task types. The task type process is for tasks that have their own (virtual) thread of execution and that do not explicitly interact with the scheduler. This task type is suited for tasks that have their private processing


25-8


resource or that rely on the platform infrastructure to perform task switching and state saving implicitly. For example, this task type is well suited for software tasks executing on an OS. The task type co-routine is for cooperative tasks that interact explicitly with the scheduler at points in their execution where task switching is acceptable. For this purpose TTL offers a suspend function. This task type may be used to reduce the task-switching overhead by allowing the task to suspend itself at points where only little state needs to be saved. The task type actor is for fire-exit tasks that perform a finite amount of computations and then return to the scheduler, similar to a function call. Unless explicitly saved, state is lost upon return. This task type may be used for a set of tasks that have to be scheduled statically.

25.4.3 TTL APIs The TTL interface is available both as C++ and as C API. The use of C++ gives cleaner descriptions of task interfaces, due to C++ support for templates and function overloading. We use C to link to software compilers for embedded processors and hardware synthesizers since most of them do not support C++ as input language. For both the C++ API and the C API we offer a generic run-time environment, which can be used for functional modeling and verification of TTL application models.

25.5 Multiprocessor Mapping In this section we present a systematic approach to map applications efficiently onto multiprocessors. The key advantage of TTL is that it provides a smooth transition from application development to application implementation. In our approach, we rewrite the source code of applications to improve efficiency. We focus on source code transformations for multiprocessor architectures taking into account costs of memory usage, synchronization cycles, data transfer cycles, and address generation cycles. We do not consider algorithmic transformations because these transformations are application-specific. Typically, application developers perform these transformations. We also do not consider code transformations for single target processors because these transformations are processor-specific. We assume that processor-specific compilers and synthesizers support these transformations, although in today’s practice programmers also write processor-specific C. In the remainder of this section we present methods and tools to transform source code. First we present source code transformations to illustrate the advantages of using TTL. Next we present tools that we developed to automate these transformations.

25.5.1 Source Code Transformation We use a simple example to illustrate the use of TTL. The example consists of an inverse quantization (IQ) task that produces data for an inverse zigzag (IZZ) task; see Figure 25.4. We focus on the interaction between these two tasks. The TTL interface supports different inter-task communication interface types that provide a trade-off between abstraction and efficiency. We illustrate this by means of code fragments. To save space we indicate scopes by means of indentation rather than curly braces. 25.5.1.1 Optimization for Single Interface Types The most abstract and easy-to-use interface type is CB, which combines synchronization and data transfer in write and read functions. Figure 25.5 shows a fragment of the IQ task that reads input (Line 08),

IQ

FIGURE 25.4 IQ and IZZ example.


IZZ

Design and Programming of Embedded Multiprocessors

25-9

01 void IQ::main() 02 while (true) 03 for(int j=0; j

Embedded Systems Handbook

Automotive Embedded Systems Handbook

Automotive Embedded Systems Handbook