RECONFIGURABLE COMPUTING
The Morgan Kaufmann Series in Systems on Silicon Series Editor: Wayne Wolf, Georgia Institute of Technology
The Designer’s Guide to VHDL, Second Edition Peter J. Ashenden The System Designer’s Guide to VHDL-AMS Peter J. Ashenden, Gregory D. Peterson, and Darrell A. Teegarden Modeling Embedded Systems and SoCs Axel Jantsch ASIC and FPGA Verification: A Guide to Component Modeling Richard Munden Multiprocessor Systems-on-Chips Edited by Ahmed Amine Jerraya and Wayne Wolf Functional Verification Bruce Wile, John Goss, and Wolfgang Roesner Customizable and Configurable Embedded Processors Edited by Paolo Ienne and Rainer Leupers Networks-on-Chips: Technology and Tools Edited by Giovanni De Micheli and Luca Benini VLSI Test Principles & Architectures Edited by Laung-Terng Wang, Cheng-Wen Wu, and Xiaoqing Wen Designing SoCs with Configured Processors Steve Leibson ESL Design and Verification Grant Martin, Andrew Piziali, and Brian Bailey Aspect-Oriented Programming with e David Robinson Reconfigurable Computing: The Theory and Practice of FPGA-Based Computation Edited by Scott Hauck and Andr´e DeHon Coming Soon . . . System-on-Chip Test Architectures Edited by Laung-Terng Wang, Charles Stroud, and Nur Touba Verification Techniques for System-Level Design Masahiro Fujita, Indradeep Ghosh, and Mukul Prasad
RECONFIGURABLE COMPUTING THE THEORY AND PRACTICE OF FPGA-BASED COMPUTATION Edited by Scott Hauck and Andre´ DeHon
AMSTERDAM • BOSTON • HEIDELBERG • LONDON NEW YORK • OXFORD • PARIS • SAN DIEGO SAN FRANCISCO • SYDNEY • TOKYO
Morgan Kaufmann is an imprint of Elsevier
Publisher: Senior Acquisitions Editor: Publishing Services Manager: Project Manager: Assistant Editors: Copyeditor: Proofreader: Indexer: Cover Image: Typesetting: Illustration Formatting: Interior Printer: Cover Printer:
Denise E. M. Penrose Charles B. Glaser George Morrison Marilyn E. Rash Michele Cronin, Matthew Cater Dianne Wood Jodie Allen Steve Rath © istockphoto diacriTech diacriTech Maple-Vail Book Manufacturing Group Phoenix Color Corp.
Morgan Kaufmann Publishers is an imprint of Elsevier. 30 Corporate Drive, Suite 400, Burlington, MA 01803-4255 This book is printed on acid-free paper. Copyright
© 2008 by Elsevier Inc. All rights reserved.
Designations used by companies to distinguish their products are often claimed as trademarks or registered trademarks. In all instances in which Morgan Kaufmann Publishers is aware of a claim, the product names appear in initial capital or all capital letters. Readers, however, should contact the appropriate companies for more complete information regarding trademarks and registration. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means—electronic, mechanical, photocopying, scanning, or otherwise— without prior written permission of the publisher. Permissions may be sought directly from Elsevier’s Science & Technology Rights Department in Oxford, UK: phone: (+44) 1865 843830, fax: (+44) 1865 853333, e-mail:
[email protected]. You may also complete your request on-line via the Elsevier homepage (http://elsevier.com), by selecting “Support & Contact” then “Copyright and Permission” and then “Obtaining Permissions.” Library of Congress Cataloging-in-Publication Data Reconfigurable computing: the theory and practice of FPGA-based computation/edited by Scott Hauck, Andr´e DeHon. p. cm. — (Systems on silicon) Includes bibliographical references and index. ISBN 978-0-12-370522-8 (alk. paper) 1. Adaptive computing systems. 2. Field-programmable gate arrays. I. Hauck, Scott. II. DeHon, Andr´e. QA76.9.A3R43 2008 2007029773 For information on all Morgan Kaufmann publications, visit our Web site at www.mkp.com or www.books.elsevier.com. Printed in the United States 08 09 10 11 12 10 9 8 7 6 5 4 3 2 1
CONTENTS
List of Contributors Preface Introduction
xx xxiii xxv
Part I: Reconfigurable Computing Hardware 1 Device Architecture 1.1
1.2
1.3
1.4
1.5
1.6
2
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Reconfigurable Computing Architectures 2.1
2.2
2.3
3
Logic—The Computational Fabric 1.1.1 Logic Elements . . . . . . . 1.1.2 Programmability . . . . . . The Array and Interconnect . . . . 1.2.1 Interconnect Structures . . 1.2.2 Programmability . . . . . . 1.2.3 Summary . . . . . . . . . . Extending Logic . . . . . . . . . . . 1.3.1 Extended Logic Elements 1.3.2 Summary . . . . . . . . . . Configuration . . . . . . . . . . . . . 1.4.1 SRAM . . . . . . . . . . . . 1.4.2 Flash Memory . . . . . . . 1.4.3 Antifuse . . . . . . . . . . . 1.4.4 Summary . . . . . . . . . . Case Studies . . . . . . . . . . . . . 1.5.1 Altera Stratix . . . . . . . . 1.5.2 Xilinx Virtex-II Pro . . . . Summary . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . .
1 3
29
Reconfigurable Processing Fabric Architectures . . . . . . . . . 2.1.1 Fine-grained . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.2 Coarse-grained . . . . . . . . . . . . . . . . . . . . . . . . . RPF Integration into Traditional Computing Systems . . . . . . 2.2.1 Independent Reconfigurable Coprocessor Architectures 2.2.2 Processor + RPF Architectures . . . . . . . . . . . . . . . Summary and Future Work . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
Reconfigurable Computing Systems 3.1 3.2
Early Systems . . . . . . . . PAM, VCC, and Splash . . 3.2.1 PAM . . . . . . . . . 3.2.2 Virtual Computer 3.2.3 Splash . . . . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
3 4 6 6 7 12 12 12 12 16 16 16 17 17 18 18 19 23 26 27
30 30 32 35 36 40 44 45
47 . . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
47 49 49 50 51
vi
Contents 3.3
3.4
3.5 3.6
3.7 3.8 3.9
4
Small-scale Reconfigurable Systems . . . 3.3.1 PRISM . . . . . . . . . . . . . . . . 3.3.2 CAL and XC6200 . . . . . . . . . . 3.3.3 Cloning . . . . . . . . . . . . . . . . Circuit Emulation . . . . . . . . . . . . . . 3.4.1 AMD/Intel . . . . . . . . . . . . . . 3.4.2 Virtual Wires . . . . . . . . . . . . Accelerating Technology . . . . . . . . . . 3.5.1 Teramac . . . . . . . . . . . . . . . Reconfigurable Supercomputing . . . . . 3.6.1 Cray, SRC, and Silicon Graphics 3.6.2 The CMX-2X . . . . . . . . . . . . Non-FPGA Research . . . . . . . . . . . . . Other System Issues . . . . . . . . . . . . . The Future of Reconfigurable Systems . References . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
Reconfiguration Management 4.1 4.2
4.3
4.4
4.5 4.6
65
Reconfiguration . . . . . . . . . . . . . . . . . . . . . . . . . Configuration Architectures . . . . . . . . . . . . . . . . . . 4.2.1 Single-context . . . . . . . . . . . . . . . . . . . . . . 4.2.2 Multi-context . . . . . . . . . . . . . . . . . . . . . . 4.2.3 Partially Reconfigurable . . . . . . . . . . . . . . . 4.2.4 Relocation and Defragmentation . . . . . . . . . . 4.2.5 Pipeline Reconfigurable . . . . . . . . . . . . . . . . 4.2.6 Block Reconfigurable . . . . . . . . . . . . . . . . . 4.2.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . Managing the Reconfiguration Process . . . . . . . . . . . 4.3.1 Configuration Grouping . . . . . . . . . . . . . . . 4.3.2 Configuration Caching . . . . . . . . . . . . . . . . 4.3.3 Configuration Scheduling . . . . . . . . . . . . . . 4.3.4 Software-based Relocation and Defragmentation 4.3.5 Context Switching . . . . . . . . . . . . . . . . . . . Reducing Configuration Transfer Time . . . . . . . . . . . 4.4.1 Architectural Approaches . . . . . . . . . . . . . . . 4.4.2 Configuration Compression . . . . . . . . . . . . . 4.4.3 Configuration Data Reuse . . . . . . . . . . . . . . Configuration Security . . . . . . . . . . . . . . . . . . . . . Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .
Part II: Programming Reconfigurable Systems 5 Compute Models and System Architectures 5.1
Compute Models . . . . . . . 5.1.1 Challenges . . . . . . 5.1.2 Common Primitives 5.1.3 Dataflow . . . . . . . 5.1.4 Sequential Control .
. . . . .
. . . . .
. . . . .
52 53 53 54 54 55 56 56 57 59 60 60 61 61 62 63
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
66 66 67 68 70 71 73 74 75 76 76 77 77 79 80 80 81 81 82 82 83 84
87 91 . . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
93 93 97 98 103
vii
Contents
5.2
6
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
Programming FPGA Applications in VHDL 6.1
6.2 6.3
7
5.1.5 Data Parallel . . . . . . . . . . . 5.1.6 Data-centric . . . . . . . . . . . . 5.1.7 Multi-threaded . . . . . . . . . . 5.1.8 Other Compute Models . . . . . System Architectures . . . . . . . . . . . 5.2.1 Streaming Dataflow . . . . . . . 5.2.2 Sequential Control . . . . . . . . 5.2.3 Bulk Synchronous Parallelism 5.2.4 Data Parallel . . . . . . . . . . . 5.2.5 Cellular Automata . . . . . . . . 5.2.6 Multi-threaded . . . . . . . . . . 5.2.7 Hierarchical Composition . . . References . . . . . . . . . . . . . . . . . .
VHDL Programming . . . . . . . . . . . . . . . . . 6.1.1 Structural Description . . . . . . . . . . . 6.1.2 RTL Description . . . . . . . . . . . . . . . 6.1.3 Parametric Hardware Generation . . . . 6.1.4 Finite-state Machine Datapath Example 6.1.5 Advanced Topics . . . . . . . . . . . . . . . Hardware Compilation Flow . . . . . . . . . . . . 6.2.1 Constraints . . . . . . . . . . . . . . . . . . Limitations of VHDL . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . .
129 . . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
Compiling C for Spatial Computing 7.1
7.2
7.3
7.4
105 105 106 106 107 107 110 118 119 122 123 125 125
Overview of How C Code Runs on Spatial Hardware 7.1.1 Data Connections between Operations . . . . 7.1.2 Memory . . . . . . . . . . . . . . . . . . . . . . . 7.1.3 If-then-else Using Multiplexers . . . . . . . . . 7.1.4 Actual Control Flow . . . . . . . . . . . . . . . . 7.1.5 Optimizing the Common Path . . . . . . . . . 7.1.6 Summary and Challenges . . . . . . . . . . . . Automatic Compilation . . . . . . . . . . . . . . . . . . . 7.2.1 Hyperblocks . . . . . . . . . . . . . . . . . . . . . 7.2.2 Building a Dataflow Graph for a Hyperblock 7.2.3 DFG Optimization . . . . . . . . . . . . . . . . . 7.2.4 From DFG to Reconfigurable Fabric . . . . . Uses and Variations of C Compilation to Hardware . 7.3.1 Automatic HW/SW Partitioning . . . . . . . . 7.3.2 Programmer Assistance . . . . . . . . . . . . . . Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . .
130 130 133 136 138 150 150 152 153 153
155 . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
156 157 157 158 159 161 162 162 164 164 169 173 175 175 176 180 180
viii
Contents
8
Programming Streaming FPGA Applications Using Block Diagrams in Simulink 8.1 8.2
8.3
8.4
8.5
9
. . . . . . . . . . . . . . .
Stream Computations Organized for Reconfigurable Execution 9.1
9.2
9.3 9.4
9.5
10
183
Designing High-performance Datapaths Using Stream-based Operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . An Image-processing Design Driver . . . . . . . . . . . . . . . . . . . 8.2.1 Converting RGB Video to Grayscale . . . . . . . . . . . . . . 8.2.2 Two-dimensional Video Filtering . . . . . . . . . . . . . . . . 8.2.3 Mapping the Video Filter to the BEE2 FPGA Platform . . Specifying Control in Simulink . . . . . . . . . . . . . . . . . . . . . . 8.3.1 Explicit Controller Design with Simulink Blocks . . . . . . 8.3.2 Controller Design Using the Matlab M Language . . . . . 8.3.3 Controller Design Using VHDL or Verilog . . . . . . . . . . 8.3.4 Controller Design Using Embedded Microprocessors . . . Component Reuse: Libraries of Simple and Complex Subsystems 8.4.1 Signal-processing Primitives . . . . . . . . . . . . . . . . . . . 8.4.2 Tiled Subsystems . . . . . . . . . . . . . . . . . . . . . . . . . . Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Programming . . . . . . . . . . . . . . . . . . . . 9.1.1 Task Description Format . . . . . . . . 9.1.2 C++ Integration and Composition . . System Architecture and Execution Patterns 9.2.1 Stream Support . . . . . . . . . . . . . 9.2.2 Phased Reconfiguration . . . . . . . . 9.2.3 Sequential versus Parallel . . . . . . . 9.2.4 Fixed-size and Standard I/O Page . . Compilation . . . . . . . . . . . . . . . . . . . . . Runtime . . . . . . . . . . . . . . . . . . . . . . . 9.4.1 Scheduling . . . . . . . . . . . . . . . . . 9.4.2 Placement . . . . . . . . . . . . . . . . . 9.4.3 Routing . . . . . . . . . . . . . . . . . . Highlights . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . .
203 . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
Programming Data Parallel FPGA Applications Using the SIMD / Vector Model 10.1 10.2 10.3 10.4 10.5
184 185 185 187 191 194 194 195 197 197 198 198 198 201 202
SIMD Computing on FPGAs: An Example . . . . . . . . . SIMD Processing Architectures . . . . . . . . . . . . . . . . Data Parallel Languages . . . . . . . . . . . . . . . . . . . . Reconfigurable Computers for SIMD/ Vector Processing Variations of SIMD/ Vector Computing . . . . . . . . . . . 10.5.1 Multiple SIMD Engines . . . . . . . . . . . . . . . 10.5.2 A Multi-SIMD Coarse-grained Array . . . . . . . . 10.5.3 SPMD Model . . . . . . . . . . . . . . . . . . . . . .
205 205 206 208 209 210 211 211 212 213 213 215 215 217 217
219 . . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
219 221 222 223 226 226 228 228
ix
Contents 10.6 10.7
11
Operating System Support for Reconfigurable Computing 11.1 11.2 11.3
11.4
11.5
11.6
11.7
11.8
12
Pipelined SIMD/ Vector Processing . . . . . . . . . . . . . . . . . . . . . Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
History . . . . . . . . . . . . . . . . . . . . . . . Abstracted Hardware Resources . . . . . . . 11.2.1 Programming Model . . . . . . . . . Flexible Binding . . . . . . . . . . . . . . . . . 11.3.1 Install Time Binding . . . . . . . . . 11.3.2 Runtime Binding . . . . . . . . . . . . 11.3.3 Fast CAD for Flexible Binding . . . Scheduling . . . . . . . . . . . . . . . . . . . . . 11.4.1 On-demand Scheduling . . . . . . . . 11.4.2 Static Scheduling . . . . . . . . . . . 11.4.3 Dynamic Scheduling . . . . . . . . . 11.4.4 Quasi-static Scheduling . . . . . . . . 11.4.5 Real-time Scheduling . . . . . . . . . 11.4.6 Preemption . . . . . . . . . . . . . . . Communication . . . . . . . . . . . . . . . . . 11.5.1 Communication Styles . . . . . . . . 11.5.2 Virtual Memory . . . . . . . . . . . . 11.5.3 I/O . . . . . . . . . . . . . . . . . . . . . 11.5.4 Uncertain Communication Latency Synchronization . . . . . . . . . . . . . . . . . 11.6.1 Explicit Synchronization . . . . . . . 11.6.2 Implicit Synchronization . . . . . . . 11.6.3 Deadlock Prevention . . . . . . . . . Protection . . . . . . . . . . . . . . . . . . . . . 11.7.1 Hardware Protection . . . . . . . . . 11.7.2 Intertask Communication . . . . . . 11.7.3 Task Configuration Protection . . . . Summary . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
231 . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The JHDL Design and Debug System 12.1 12.2
12.3
228 229 230
JHDL Background and Motivation . . . . . . . . . . . . . . . . The JHDL Design Language . . . . . . . . . . . . . . . . . . . . 12.2.1 Level-1 Design: Primitive Instantiation . . . . . . . . 12.2.2 Level-2 Design: Using the Logic Class and Its Provided Methods . . . . . . . . . . . . . . . . . . . . . 12.2.3 Level-3 Design: Programmatic Circuit Generation (Module Generators) . . . . . . . . . . . . . . . . . . . 12.2.4 JHDL Is a Structural Design Language . . . . . . . . 12.2.5 JHDL Is a Programmatic Circuit Design Language The JHDL CAD System . . . . . . . . . . . . . . . . . . . . . . . 12.3.1 Testbenches in JHDL . . . . . . . . . . . . . . . . . . . 12.3.2 The cvt Class . . . . . . . . . . . . . . . . . . . . . . .
232 234 234 236 236 237 238 239 239 239 240 241 241 242 243 243 246 247 247 248 248 248 249 249 250 251 251 252 252
255 . . . . . . . . . . . . . . .
255 257 257
. . . . .
259
. . . . . .
261 263 264 265 265 266
. . . . . .
. . . . . .
. . . . . .
. . . . . .
x
Contents 12.4 12.5
12.6
JHDL’s Hardware Mode . . . . . . . . . . . Advanced JHDL Capabilities . . . . . . . . 12.5.1 Dynamic Testbenches . . . . . . . . 12.5.2 Behavioral Synthesis . . . . . . . . 12.5.3 Advanced Debugging Capabilities Summary . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
Part III: Mapping Designs to Reconfigurable Platforms 13 Technology Mapping 13.1
13.2
13.3
13.4
14
Structural Mapping Algorithms . . . . . . . . . . . . . . . 13.1.1 Cut Generation . . . . . . . . . . . . . . . . . . . . 13.1.2 Area-oriented Mapping . . . . . . . . . . . . . . . 13.1.3 Performance-driven Mapping . . . . . . . . . . . 13.1.4 Power-aware Mapping . . . . . . . . . . . . . . . Integrated Mapping Algorithms . . . . . . . . . . . . . . . 13.2.1 Simultaneous Logic Synthesis, Mapping . . . . 13.2.2 Integrated Retiming, Mapping . . . . . . . . . . 13.2.3 Placement-driven Mapping . . . . . . . . . . . . . Mapping Algorithms for Heterogeneous Resources . . 13.3.1 Mapping to LUTs of Different Input Sizes . . . 13.3.2 Mapping to Complex Logic Blocks . . . . . . . . 13.3.3 Mapping Logic to Embedded Memory Blocks 13.3.4 Mapping to Macrocells . . . . . . . . . . . . . . . Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
275 277 . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
FPGA Placement Placement for General-purpose FPGAs 14.1
14.2 14.3
14.4 14.5 14.6
The FPGA Placement Problem . . . . . . . . . . . 14.1.1 Device Legality Constraints . . . . . . . . 14.1.2 Optimization Goals . . . . . . . . . . . . . 14.1.3 Designer Placement Directives . . . . . . Clustering . . . . . . . . . . . . . . . . . . . . . . . . Simulated Annealing for Placement . . . . . . . . 14.3.1 VPR and Related Annealing Algorithms 14.3.2 Simultaneous Placement and Routing with Annealing . . . . . . . . . . . . . . . . Partition-based Placement . . . . . . . . . . . . . . Analytic Placement . . . . . . . . . . . . . . . . . . Further Reading and Open Challenges . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . .
268 269 269 270 270 272 273
278 279 280 282 283 284 284 286 287 289 289 290 291 292 293 293
297 299 . . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
299 300 301 302 304 306 307
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
311 312 315 316 316
xi
Contents
15
Datapath Composition 15.1
15.2 15.3 15.4
15.5
15.6
15.7
15.8
16
Fundamentals . . . . . . . . . . . . . . . . . . . . . . 15.1.1 Regularity . . . . . . . . . . . . . . . . . . . 15.1.2 Datapath Layout . . . . . . . . . . . . . . . Tool Flow Overview . . . . . . . . . . . . . . . . . . The Impact of Device Architecture . . . . . . . . . 15.3.1 Architecture Irregularities . . . . . . . . . The Interface to Module Generators . . . . . . . . 15.4.1 The Flow Interface . . . . . . . . . . . . . 15.4.2 The Data Model . . . . . . . . . . . . . . . 15.4.3 The Library Specification . . . . . . . . . 15.4.4 The Intra-module Layout . . . . . . . . . . The Mapping . . . . . . . . . . . . . . . . . . . . . . 15.5.1 1:1 Mapping . . . . . . . . . . . . . . . . . . 15.5.2 N:1 Mapping . . . . . . . . . . . . . . . . . 15.5.3 The Combined Approach . . . . . . . . . . Placement . . . . . . . . . . . . . . . . . . . . . . . . 15.6.1 Linear Placement . . . . . . . . . . . . . . 15.6.2 Constrained Two-dimensional Placement 15.6.3 Two-dimensional Placement . . . . . . . . Compaction . . . . . . . . . . . . . . . . . . . . . . . 15.7.1 Selecting HWOPs for Compaction . . . . 15.7.2 Regularity Analysis . . . . . . . . . . . . . 15.7.3 Optimization Techniques . . . . . . . . . . 15.7.4 Building the Super-HWOP . . . . . . . . . 15.7.5 Discussion . . . . . . . . . . . . . . . . . . . Summary and Future Work . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . .
Specifying Circuit Layout on FPGAs 16.1 16.2 16.3 16.4 16.5
17
319
The Problem . . . . . . . . . . . . . . . . . . . . . Explicit Cartesian Layout Specification . . . . . Algebraic Layout Specification . . . . . . . . . . 16.3.1 Case Study: Batcher’s Bitonic Sorter . Layout Verification for Parameterized Designs Summary . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . .
347 . . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
PathFinder: A Negotiation-based, Performance-driven Router for FPGAs 17.1 17.2
17.3
319 320 322 323 324 325 326 327 327 328 328 329 329 330 332 333 333 335 336 337 338 338 338 342 343 344 344
The History of PathFinder . . . . . . . . . . . . . . . The PathFinder Algorithm . . . . . . . . . . . . . . . 17.2.1 The Circuit Graph Model . . . . . . . . . . 17.2.2 A Negotiated Congestion Router . . . . . . 17.2.3 The Negotiated Congestion/Delay Router 17.2.4 Applying A* to PathFinder . . . . . . . . . . Enhancements and Extensions to PathFinder . . . 17.3.1 Incremental Rerouting . . . . . . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
347 351 352 357 360 362 363
365 . . . . . . . .
. . . . . . . .
. . . . . . . .
366 367 367 367 372 373 374 374
xii
Contents
17.4 17.5 17.6
18
18.3 18.4 18.5 18.6
375 375
. . . . . .
376 376 377 379 379 380
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
Retiming: Concepts, Algorithm, and Restrictions Repipelining and C-slow Retiming . . . . . . . . . 18.2.1 Repipelining . . . . . . . . . . . . . . . . . . 18.2.2 C-slow Retiming . . . . . . . . . . . . . . . Implementations of Retiming . . . . . . . . . . . . Retiming on Fixed-frequency FPGAs . . . . . . . C-slowing as Multi-threading . . . . . . . . . . . . Why Isn’t Retiming Ubiquitous? . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . .
. . . . . . . . .
383 . . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
Configuration Bitstream Generation 19.1 19.2 19.3 19.4
20
. . . . . . . . . . . . . . . .
Retiming, Repipelining, and C-slow Retiming 18.1 18.2
19
17.3.2 The Cost Function . . . . . . . . . . . . . . . . . . 17.3.3 Resource Cost . . . . . . . . . . . . . . . . . . . . 17.3.4 The Relationship of PathFinder to Lagrangian Relaxation . . . . . . . . . . . . . . . . . . . . . . . 17.3.5 Circuit Graph Extensions . . . . . . . . . . . . . Parallel PathFinder . . . . . . . . . . . . . . . . . . . . . . Other Applications of the PathFinder Algorithm . . . . Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The Bitstream . . . . . . . . . . . . . Downloading Mechanisms . . . . . . Software to Generate Configuration Summary . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . .
401
. . . . . . . . Data . . . . . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
Fast Compilation Techniques 20.1
20.2
20.3 20.4
Accelerating Classical Techniques . . . . . 20.1.1 Accelerating Simulated Annealing 20.1.2 Accelerating PathFinder . . . . . . Alternative Algorithms . . . . . . . . . . . . 20.2.1 Multiphase Solutions . . . . . . . . 20.2.2 Incremental Place and Route . . . Effect of Architecture . . . . . . . . . . . . . Summary . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . .
Strengths and Weaknesses of FPGAs 21.1.1 Time to Market . . . . . . . . . 21.1.2 Cost . . . . . . . . . . . . . . . . 21.1.3 Development Time . . . . . . . 21.1.4 Power Consumption . . . . . . 21.1.5 Debug and Verification . . . . 21.1.6 FPGAs and Microprocessors .
403 406 407 409 409
411 . . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
Part IV: Application Development 21 Implementing Applications with FPGAs 21.1
384 388 389 390 393 394 395 398 398
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
414 415 418 422 422 425 427 431 432
435 439 . . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
439 439 440 440 440 440 441
xiii
Contents 21.2
21.3
21.4
21.5
22
22.2
22.3
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
23.2
23.3
23.4
Instance-specific Design . . . . . . . . . . . . . . 22.1.1 Taxonomy . . . . . . . . . . . . . . . . . . 22.1.2 Approaches . . . . . . . . . . . . . . . . . 22.1.3 Examples of Instance-specific Designs Partial Evaluation . . . . . . . . . . . . . . . . . . 22.2.1 Motivation . . . . . . . . . . . . . . . . . . 22.2.2 Process of Specialization . . . . . . . . . 22.2.3 Partial Evaluation in Practice . . . . . . 22.2.4 Partial Evaluation of a Multiplier . . . 22.2.5 Partial Evaluation at Runtime . . . . . 22.2.6 FPGA-specific Concerns . . . . . . . . . Summary . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
Fixed-point Number System . . . . . . . . . . . . 23.1.1 Multiple-wordlength Paradigm . . . . . 23.1.2 Optimization for Multiple Wordlength Peak Value Estimation . . . . . . . . . . . . . . . 23.2.1 Analytic Peak Estimation . . . . . . . . 23.2.2 Simulation-based Peak Estimation . . . 23.2.3 Summary of Peak Estimation . . . . . . Wordlength Optimization . . . . . . . . . . . . . 23.3.1 Error Estimation and Area Models . . 23.3.2 Search Techniques . . . . . . . . . . . . . Summary . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
Theory . . . . . . . . . . . . . DA Implementation . . . . . Mapping DA onto FPGAs . Improving DA Performance
455 456 457 459 462 463 464 464 466 470 471 473 473
475 . . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
Distributed Arithmetic 24.1 24.2 24.3 24.4
441 441 443 445 445 446 447 448 448 449 450 450 452 452
455
Precision Analysis for Fixed-point Computation 23.1
24
. . . . . . . . . . . . . .
Instance-specific Design 22.1
23
Application Characteristics and Performance . . . . . . . . . . 21.2.1 Computational Characteristics and Performance . . . 21.2.2 I/O and Performance . . . . . . . . . . . . . . . . . . . . General Implementation Strategies for FPGA-based Systems 21.3.1 Configure-once . . . . . . . . . . . . . . . . . . . . . . . . 21.3.2 Runtime Reconfiguration . . . . . . . . . . . . . . . . . . 21.3.3 Summary of Implementation Issues . . . . . . . . . . . Implementing Arithmetic in FPGAs . . . . . . . . . . . . . . . . 21.4.1 Fixed-point Number Representation and Arithmetic . 21.4.2 Floating-point Arithmetic . . . . . . . . . . . . . . . . . 21.4.3 Block Floating Point . . . . . . . . . . . . . . . . . . . . 21.4.4 Constant Folding and Data-oriented Specialization . Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
475 476 478 478 479 484 485 485 485 496 498 499
503 . . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
503 504 507 508
xiv
Contents 24.5
25
CORDIC Architectures for FPGA Computing 25.1
25.2 25.3
25.4
26
An Application of DA on an FPGA . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
CORDIC Algorithm . . . . . . . . . . . . . . . . . . . . 25.1.1 Rotation Mode . . . . . . . . . . . . . . . . . . 25.1.2 Scaling Considerations . . . . . . . . . . . . . 25.1.3 Vectoring Mode . . . . . . . . . . . . . . . . . 25.1.4 Multiple Coordinate Systems and a Unified Description . . . . . . . . . . . . . . . . . . . . 25.1.5 Computational Accuracy . . . . . . . . . . . . Architectural Design . . . . . . . . . . . . . . . . . . . . FPGA Implementation of CORDIC Processors . . . 25.3.1 Convergence . . . . . . . . . . . . . . . . . . . . 25.3.2 Folded CORDIC . . . . . . . . . . . . . . . . . 25.3.3 Parallel Linear Array . . . . . . . . . . . . . . 25.3.4 Scaling Compensation . . . . . . . . . . . . . Summary . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . .
513 . . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
514 514 517 519
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
520 522 526 527 527 528 530 534 534 535
Hardware/Software Partitioning 26.1 26.2
26.3 26.4
539
The Trend Toward Automatic Partitioning . . . . . . . . . . Partitioning of Sequential Programs . . . . . . . . . . . . . . 26.2.1 Granularity . . . . . . . . . . . . . . . . . . . . . . . . 26.2.2 Partition Evaluation . . . . . . . . . . . . . . . . . . . 26.2.3 Alternative Region Implementations . . . . . . . . . 26.2.4 Implementation Models . . . . . . . . . . . . . . . . 26.2.5 Exploration . . . . . . . . . . . . . . . . . . . . . . . . Partitioning of Parallel Programs . . . . . . . . . . . . . . . 26.3.1 Differences among Parallel Programming Models Summary and Directions . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
Part V: Case Studies of FPGA Applications 27 SPIHT Image Compression 27.1 27.2
27.3
27.4
511 511
Background . . . . . . . . . . . . . . . . . . . . . . . . . . SPIHT Algorithm . . . . . . . . . . . . . . . . . . . . . . 27.2.1 Wavelets and the Discrete Wavelet Transform 27.2.2 SPIHT Coding Engine . . . . . . . . . . . . . . Design Considerations and Modifications . . . . . . . 27.3.1 Discrete Wavelet Transform Architectures . . 27.3.2 Fixed-point Precision Analysis . . . . . . . . . 27.3.3 Fixed Order SPIHT . . . . . . . . . . . . . . . . Hardware Implementation . . . . . . . . . . . . . . . . . 27.4.1 Target Hardware Platform . . . . . . . . . . . . 27.4.2 Design Overview . . . . . . . . . . . . . . . . . . 27.4.3 Discrete Wavelet Transform Phase . . . . . . . 27.4.4 Maximum Magnitude Phase . . . . . . . . . . . 27.4.5 The SPIHT Coding Phase . . . . . . . . . . . .
540 542 545 547 549 550 552 557 557 558 559
561 565 . . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
565 566 567 568 571 571 575 578 580 581 581 582 583 585
xv
Contents 27.5 27.6
28
Automatic Target Recognition Systems on Reconfigurable Devices 28.1
28.2
28.3
28.4
28.5
29
Design Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Summary and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Automatic Target Recognition Algorithms . . . 28.1.1 Focus of Attention . . . . . . . . . . . . . 28.1.2 Second-level Detection . . . . . . . . . . Dynamically Reconfigurable Designs . . . . . . 28.2.1 Algorithm Modifications . . . . . . . . . 28.2.2 Image Correlation Circuit . . . . . . . . 28.2.3 Performance Analysis . . . . . . . . . . . 28.2.4 Template Partitioning . . . . . . . . . . . 28.2.5 Implementation Method . . . . . . . . . Reconfigurable Static Design . . . . . . . . . . . 28.3.1 Design-specific Parameters . . . . . . . . 28.3.2 Order of Correlation Tasks . . . . . . . . 28.3.3 Reconfigurable Image Correlator . . . . 28.3.4 Application-specific Computation Unit ATR Implementations . . . . . . . . . . . . . . . . 28.4.1 A Dynamically Reconfigurable System 28.4.2 A Statically Reconfigurable System . . 28.4.3 Reconfigurable Computing Models . . . Summary . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . .
591 . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
Boolean Satisfiability: Creating Solvers Optimized for Specific Problem Instances 29.1
29.2
29.3
29.4
29.5
587 588 589
Boolean Satisfiability Basics . . . . . . . . . . . . . . . . . . . 29.1.1 Problem Formulation . . . . . . . . . . . . . . . . . . 29.1.2 SAT Applications . . . . . . . . . . . . . . . . . . . . . SAT-solving Algorithms . . . . . . . . . . . . . . . . . . . . . . 29.2.1 Basic Backtrack Algorithm . . . . . . . . . . . . . . 29.2.2 Improving the Backtrack Algorithm . . . . . . . . . A Reconfigurable SAT Solver Generated According to an SAT Instance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29.3.1 Problem Analysis . . . . . . . . . . . . . . . . . . . . . 29.3.2 Implementing a Basic Backtrack Algorithm with Reconfigurable Hardware . . . . . . . . . . . . . . . 29.3.3 Implementing an Improved Backtrack Algorithm with Reconfigurable Hardware . . . . . . . . . . . . A Different Approach to Reduce Compilation Time and Improve Algorithm Efficiency . . . . . . . . . . . . . . . . . . 29.4.1 System Architecture . . . . . . . . . . . . . . . . . . . 29.4.2 Performance . . . . . . . . . . . . . . . . . . . . . . . 29.4.3 Implementation Issues . . . . . . . . . . . . . . . . . Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
592 592 592 594 594 594 596 598 599 600 601 601 602 603 604 604 606 607 609 610
613 . . . . . .
. . . . . .
. . . . . .
613 613 614 615 615 617
. . . . . . . . . . . .
618 618
. . . . . .
619
. . . . . .
624
. . . . . .
627 627 630 631 633 635
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
xvi
Contents
30
Multi-FPGA Systems: Logic Emulation 30.1 30.2 30.3
30.4
30.5 30.6
30.7 30.8
31
Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Uses of Logic Emulation Systems . . . . . . . . . . . . . . . . . Types of Logic Emulation Systems . . . . . . . . . . . . . . . . 30.3.1 Single-FPGA Emulation . . . . . . . . . . . . . . . . . . 30.3.2 Multi-FPGA Emulation . . . . . . . . . . . . . . . . . . . 30.3.3 Design-mapping Overview . . . . . . . . . . . . . . . . . 30.3.4 Multi-FPGA Partitioning and Placement Approaches 30.3.5 Multi-FPGA Routing Approaches . . . . . . . . . . . . . Issues Related to Contemporary Logic Emulation . . . . . . . 30.4.1 In-circuit Emulation . . . . . . . . . . . . . . . . . . . . . 30.4.2 Coverification . . . . . . . . . . . . . . . . . . . . . . . . . 30.4.3 Logic Analysis . . . . . . . . . . . . . . . . . . . . . . . . The Need for Fast FPGA Mapping . . . . . . . . . . . . . . . . . Case Study: The VirtuaLogic VLE Emulation System . . . . . 30.6.1 The VirtuaLogic VLE Emulation System Structure . 30.6.2 The VirtuaLogic Emulation Software Flow . . . . . . 30.6.3 Multiported Memory Mapping . . . . . . . . . . . . . . 30.6.4 Design Mapping with Multiple Asynchronous Clocks 30.6.5 Incremental Compilation of Designs . . . . . . . . . . . 30.6.6 VLE Interfaces for Coverification . . . . . . . . . . . . 30.6.7 Parallel FPGA Compilation for the VLE System . . . Future Trends . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
The Implications of Floating Point for FPGAs 31.1
31.2
31.3
32
637
Why Is Floating Point Difficult? . . . . . . . . . . 31.1.1 General Implementation Considerations 31.1.2 Adder Implementation . . . . . . . . . . . 31.1.3 Multiplier Implementation . . . . . . . . . Floating-point Application Case Studies . . . . . 31.2.1 Matrix Multiply . . . . . . . . . . . . . . . . 31.2.2 Dot Product . . . . . . . . . . . . . . . . . . 31.2.3 Fast Fourier Transform . . . . . . . . . . . Summary . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
671 . . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
Finite Difference Time Domain: A Case Study Using FPGAs 32.1
32.2
637 639 640 640 641 644 645 646 650 650 650 651 652 653 653 654 657 657 661 664 665 666 667 668
The FDTD Method . . . . . . . . . . . . . . . . . . . . . . 32.1.1 Background . . . . . . . . . . . . . . . . . . . . . 32.1.2 The FDTD Algorithm . . . . . . . . . . . . . . . 32.1.3 FDTD Applications . . . . . . . . . . . . . . . . 32.1.4 The Advantages of FDTD on an FPGA . . . . FDTD Hardware Design Case Study . . . . . . . . . . . 32.2.1 The WildStar-II Pro FPGA Computing Board 32.2.2 Data Analysis and Fixed-point Quantization .
671 673 675 677 679 679 683 686 692 694
697 . . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
697 697 701 703 705 707 708 709
xvii
Contents
32.3
33
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
Evolvable FPGAs 33.1 33.2 33.3 33.4
33.5
33.6
34
32.2.3 Hardware Implementation 32.2.4 Performance Results . . . Summary . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . .
The POE Model of Bioinspired Design Methodologies Artificial Evolution . . . . . . . . . . . . . . . . . . . . . . . 33.2.1 Genetic Algorithms . . . . . . . . . . . . . . . . . Evolvable Hardware . . . . . . . . . . . . . . . . . . . . . . 33.3.1 Genome Encoding . . . . . . . . . . . . . . . . . . Evolvable Hardware: A Taxonomy . . . . . . . . . . . . . 33.4.1 Extrinsic Evolution . . . . . . . . . . . . . . . . . 33.4.2 Intrinsic Evolution . . . . . . . . . . . . . . . . . . 33.4.3 Complete Evolution . . . . . . . . . . . . . . . . . 33.4.4 Open-ended Evolution . . . . . . . . . . . . . . . Evolvable Hardware Digital Platforms . . . . . . . . . . 33.5.1 Xilinx XC6200 Family . . . . . . . . . . . . . . . 33.5.2 Evolution on Commercial FPGAs . . . . . . . . 33.5.3 Custom Evolvable FPGAs . . . . . . . . . . . . . Conclusions and Future Directions . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . .
725 . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
Network Packet Processing in Reconfigurable Hardware 34.1
34.2
34.3
34.4
34.5
712 722 723 723
Networking with Reconfigurable Hardware . . . . . . . . . . . . 34.1.1 The Motivation for Building Networks with Reconfigurable Hardware . . . . . . . . . . . . . . . . . . 34.1.2 Hardware and Software for Packet Processing . . . . . 34.1.3 Network Data Processing with FPGAs . . . . . . . . . . 34.1.4 Network Processing System Modularity . . . . . . . . . Network Protocol Processing . . . . . . . . . . . . . . . . . . . . . 34.2.1 Internet Protocol Wrappers . . . . . . . . . . . . . . . . . 34.2.2 TCP Wrappers . . . . . . . . . . . . . . . . . . . . . . . . . 34.2.3 Payload-processing Modules . . . . . . . . . . . . . . . . . 34.2.4 Payload Processing with Regular Expression Scanning 34.2.5 Payload Scanning with Bloom Filters . . . . . . . . . . . Intrusion Detection and Prevention . . . . . . . . . . . . . . . . . 34.3.1 Worm and Virus Protection . . . . . . . . . . . . . . . . . 34.3.2 An Integrated Header, Payload, and Queuing System . 34.3.3 Automated Worm Detection . . . . . . . . . . . . . . . . . Semantic Processing . . . . . . . . . . . . . . . . . . . . . . . . . . 34.4.1 Language Identification . . . . . . . . . . . . . . . . . . . . 34.4.2 Semantic Processing of TCP Data . . . . . . . . . . . . . Complete Networking System Issues . . . . . . . . . . . . . . . . 34.5.1 The Rack-mount Chassis Form Factor . . . . . . . . . . 34.5.2 Network Control and Configuration . . . . . . . . . . . . 34.5.3 A Reconfiguration Mechanism . . . . . . . . . . . . . . . 34.5.4 Dynamic Hardware Plug-ins . . . . . . . . . . . . . . . . .
725 727 727 729 731 733 733 734 736 738 739 740 741 743 745 747
753 . . .
753
. . . . . . . . . . . . . . . . . . . . . .
753 754 755 756 757 758 758 760 761 762 762 763 764 766 767 767 768 770 770 771 772 773
. . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .
xviii
Contents
34.6
35
34.5.5 Partial Bitfile Generation . 34.5.6 Control Channel Security Summary . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
Active Pages: Memory-centric Computation 35.1
35.2
35.3
35.4
35.5 35.6 35.7
779
Active Pages . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35.1.1 DRAM Hardware Design . . . . . . . . . . . . . . . 35.1.2 Hardware Interface . . . . . . . . . . . . . . . . . . 35.1.3 Programming Model . . . . . . . . . . . . . . . . . Performance Results . . . . . . . . . . . . . . . . . . . . . . 35.2.1 Speedup over Conventional Systems . . . . . . . 35.2.2 Processor–Memory Nonoverlap . . . . . . . . . . . 35.2.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . Algorithmic Complexity . . . . . . . . . . . . . . . . . . . . 35.3.1 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . 35.3.2 Array-Insert . . . . . . . . . . . . . . . . . . . . . . . 35.3.3 LCS (Two-dimensional Dynamic Programming) 35.3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . Exploring Parallelism . . . . . . . . . . . . . . . . . . . . . . 35.4.1 Speedup over Conventional . . . . . . . . . . . . . 35.4.2 Multiplexing Performance . . . . . . . . . . . . . . 35.4.3 Processor Width Performance . . . . . . . . . . . . 35.4.4 Processor Width versus Multiplexing . . . . . . . 35.4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . Defect Tolerance . . . . . . . . . . . . . . . . . . . . . . . . . Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
Part VI: Theoretical Underpinnings and Future Directions 36 Theoretical Underpinnings 36.1 36.2
36.3
36.4
36.5
General Computational Array Model . . . . . . . . . Implications of the General Model . . . . . . . . . . . 36.2.1 Instruction Distribution . . . . . . . . . . . . 36.2.2 Instruction Storage . . . . . . . . . . . . . . . Induced Architectural Models . . . . . . . . . . . . . . 36.3.1 Fixed Instructions (FPGA) . . . . . . . . . . . 36.3.2 Shared Instructions (SIMD Processors) . . . Modeling Architectural Space . . . . . . . . . . . . . . 36.4.1 Raw Density from Architecture . . . . . . . . 36.4.2 Efficiency . . . . . . . . . . . . . . . . . . . . . 36.4.3 Caveats . . . . . . . . . . . . . . . . . . . . . . . Implications . . . . . . . . . . . . . . . . . . . . . . . . . 36.5.1 Density of Computation versus Description 36.5.2 Historical Appropriateness . . . . . . . . . . . 36.5.3 Reconfigurable Applications . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . .
773 774 775 776
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
779 780 780 781 781 782 784 786 786 787 788 791 794 794 795 796 796 797 799 799 801 802 802
805 807 . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
807 809 810 813 814 815 815 816 816 817 825 826 826 826 827 828
xix
Contents
37
Defect and Fault Tolerance 37.1 37.2
37.3
37.4
37.5 37.6
38
829
Defects and Faults . . . . . . . . . . . . . . . Defect Tolerance . . . . . . . . . . . . . . . . 37.2.1 Basic Idea . . . . . . . . . . . . . . . 37.2.2 Substitutable Resources . . . . . . 37.2.3 Yield . . . . . . . . . . . . . . . . . . 37.2.4 Defect Tolerance through Sparing 37.2.5 Defect Tolerance with Matching . Transient Fault Tolerance . . . . . . . . . . 37.3.1 Feedforward Correction . . . . . . 37.3.2 Rollback Error Recovery . . . . . . Lifetime Defects . . . . . . . . . . . . . . . . 37.4.1 Detection . . . . . . . . . . . . . . . . 37.4.2 Repair . . . . . . . . . . . . . . . . . Configuration Upsets . . . . . . . . . . . . . Outlook . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
Reconfigurable Computing and Nanoscale Architecture 38.1 38.2
38.3 38.4
38.5 38.6
38.7
38.8
Trends in Lithographic Scaling . . . . Bottom-up Technology . . . . . . . . . 38.2.1 Nanowires . . . . . . . . . . . . 38.2.2 Nanowire Assembly . . . . . . 38.2.3 Crosspoints . . . . . . . . . . . Challenges . . . . . . . . . . . . . . . . . Nanowire Circuits . . . . . . . . . . . . 38.4.1 Wired-OR Diode Logic Array 38.4.2 Restoration . . . . . . . . . . . Statistical Assembly . . . . . . . . . . . nanoPLA Architecture . . . . . . . . . . 38.6.1 Basic Logic Block . . . . . . . 38.6.2 Interconnect Architecture . . 38.6.3 Memories . . . . . . . . . . . . 38.6.4 Defect Tolerance . . . . . . . . 38.6.5 Design Mapping . . . . . . . . 38.6.6 Density Benefits . . . . . . . . Nanoscale Design Alternatives . . . . 38.7.1 Imprint Lithography . . . . . 38.7.2 Interfacing . . . . . . . . . . . . 38.7.3 Restoration . . . . . . . . . . . Summary . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . .
Index
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
830 830 830 832 832 835 840 843 844 845 848 848 849 849 850 850
853 . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
854 855 856 857 857 858 859 859 860 862 864 864 867 869 869 869 870 870 870 871 872 872 873
877
LIST OF CONTRIBUTORS Rajeevan Amirtharajah, Department of Electrical and Computer Engineering, University of California–Davis, Davis, California (Chapter 24) Vaughn Betz, Altera Corporation, San Jose, California (Chapter 14) Robert W. Brodersen, Department of Electrical Engineering and Computer Science, University of California–Berkeley, Berkeley, California (Chapter 8) Timothy J. Callahan, School of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania (Chapter 7) Eylon Caspi, Tabula, Inc., Santa Clara, California (Chapter 9) Chen Chang, Department of Mathematics and Department of Electrical Engineering and Computer Sciences, University of California–Berkeley, Berkeley, California (Chapter 8) Mark L. Chang, Electrical and Computer Engineering, Franklin W. Olin College of Engineering, Needham, Massachusetts (Chapter 1) Wang Chen, Department of Electrical and Computer Engineering, Northeastern University, Boston, Massachusetts (Chapter 32) Young H. Cho, Open Acceleration Systems Research, Chatsworth, California (Chapter 28) Michael Chu, DRC Computer, Sunnyvale, California (Chapter 9) Katherine Compton, Department of Electrical and Computer Engineering, University of Wisconsin–Madison, Madison, Wisconsin (Chapters 4 and 11) Jason Cong, Department of Computer Science, California NanoSystems Institute, University of California–Los Angeles, Los Angeles, California (Chapter 13) George A. Constantinides, Department of Electrical and Electronic Engineering, Imperial College, London, United Kingdom (Chapter 23) Andre´ DeHon, Department of Electrical and Systems Engineering, University of Pennsylvania, Philadelphia, Pennsylvania (Chapters 5, 6, 7, 9, 11, 36, 37, and 38) Chris Dick, Advanced Systems Technology Group, DSP Division of Xilinx, Inc., San Jose, California (Chapter 25) Carl Ebeling, Department of Computer Science and Engineering, University of Washington, Seattle, Washington (Chapter 17) Ken Eguro, Department of Electrical Engineering, University of Washington, Seattle, Washington (Chapter 20) Diana Franklin, Computer Science Department, California Polytechnic State University, San Luis Obispo, California (Chapter 35)
List of Contributors
xxi
Thomas W. Fry, Samsung, Global Strategy Group, Seoul, South Korea (Chapter 27) Maya B. Gokhale, Lawrence Livermore National Laboratory, Livermore, California (Chapter 10) Steven A. Guccione, Cmpware, Inc., Austin, Texas (Chapters 3 and 19) Scott Hauck, Department of Electrical Engineering, University of Washington, Seattle, Washington (Chapters 20 and 27) K. Scott Hemmert, Computation, Computers, Information and Mathematics Center, Sandia National Laboratories, Albuquerque, New Mexico (Chapter 31) Randy Huang, Tabula, Inc., Santa Clara, California (Chapter 9) Brad L. Hutchings, Department of Electrical and Computer Engineering, Brigham Young University, Provo, Utah (Chapters 12 and 21) Nachiket Kapre, Department of Computer Science, California Institute of Technology, Pasadena, California (Chapter 6) Andreas Koch, Department of Computer Science, Embedded Systems and ¨ of Darmstadt, Darmstadt, Applications Group, Technische Universitat Germany (Chapter 15) Miriam Leeser, Department of Electrical and Computer Engineering, Northeastern University, Boston, Massachusetts (Chapter 32) John W. Lockwood, Department of Computer Science and Engineering, Washington University in St. Louis, St. Louis, Missouri; and Department of Electrical Engineering, Stanford University, Stanford, California (Chapter 34) Wayne Luk, Department of Computing, Imperial College, London, United Kingdom (Chapter 22) Sharad Malik, Department of Electrical Engineering, Princeton University, Princeton, New Jersey (Chapter 29) Yury Markovskiy, Department of Electrical Engineering and Computer Sciences, University of California–Berkeley, Berkeley, California (Chapter 9) Margaret Martonosi, Department of Electrical Engineering, Princeton University, Princeton, New Jersey (Chapter 29) Larry McMurchie, Synplicity Corporation, Sunnyvale, California (Chapter 17) Brent E. Nelson, Department of Electrical and Computer Engineering, Brigham Young University, Provo, Utah (Chapters 12 and 21) Peichen Pan, Magma Design Automation, Inc., San Jose, California (Chapter 13) Oliver Pell, Department of Computing, Imperial College, London, United Kingdom (Chapter 22) Stylianos Perissakis, Department of Electrical Engineering and Computer Sciences, University of California–Berkeley, Berkeley, California (Chapter 9)
xxii
List of Contributors
Laura Pozzi, Faculty of Informatics, University of Lugano, Lugano, Switzerland (Chapter 9) Brian C. Richards, Department of Electrical Engineering and Computer Sciences, University of California–Berkeley, Berkeley, California (Chapter 8) Eduardo Sanchez, School of Computer and Communication Sciences, Ecole Polytechnique F´ed´erale de Lausanne; and Reconfigurable and Embedded Digital Systems Institute, Haute Ecole d’Ing´enierie et de Gestion du Canton de Vaud, Lausanne, Switzerland (Chapter 33) Lesley Shannon, School of Engineering Science, Simon Fraser University, Burnaby, BC, Canada (Chapter 2) Satnam Singh, Programming Principles and Tools Group, Microsoft Research, Cambridge, United Kingdom (Chapter 16) Greg Stitt, Department of Computer Science and Engineering, University of California–Riverside, Riverside, California (Chapter 26) Russell Tessier, Department of Computer and Electrical Engineering, University of Massachusetts, Amherst, Massachusetts (Chapter 30) Keith D. Underwood, Computation, Computers, Information and Mathematics Center, Sandia National Laboratories, Albuquerque, New Mexico (Chapter 31) Andres Upegui, Logic Systems Laboratory, School of Computer and ´ Communication Sciences, Ecole Polytechnique F´ed´erale de Lausanne, Lausanne, Switzerland (Chapter 33) Frank Vahid, Department of Computer Science and Engineering, University of California–Riverside, Riverside, California (Chapter 26) John Wawrzynek, Department of Electrical Engineering and Computer Sciences, University of California–Berkeley, Berkeley, California (Chapters 8 and 9) Nicholas Weaver, International Computer Science Institute, Berkeley, California (Chapter 18) Joseph Yeh, Lincoln Laboratory, Massachusetts Institute of Technology, Lexington, Massachusetts (Chapter 9) Peixin Zhong, Department of Electrical and Computer Engineering, Michigan State University, East Lansing, Michigan (Chapter 29)
PREFACE In the two decades since field-programmable gate arrays (FPGAs) were introduced, they have radically changed the way digital logic is designed and deployed. By marrying the high performance of application-specific integrated circuits (ASICs) and the flexibility of microprocessors, FPGAs have made possible entirely new types of applications. This has helped FPGAs supplant both ASICs and digital signal processors (DSPs) in some traditional roles. To make the most of this unique combination of performance and flexibility, designers need to be aware of both hardware and software issues. Thus, an FPGA user must think not only about the gates needed to perform a computation but also about the software flow that supports the design process. The goal of this book is to help designers become comfortable with these issues, and thus be able to exploit the vast opportunities possible with reconfigurable logic. We have written Reconfigurable Computing as a tutorial and as a reference on the wide range of concepts that designers must understand to make the best use of FPGAs and related reconfigurable chips—including FPGA architectures, FPGA logic applications, and FPGA CAD tools—and the skills they must have for optimizing a computation. It is targeted particularly toward those who view FPGAs not just as cheap, slow ASIC gates or as a means of prototyping before the “real” hardware is created, but are interested in evaluating or embracing the substantial advantages reprogrammable devices offer over other technologies. However, readers who focus primarily on ASIC- or CPU-based implementations will learn how FPGAs can be a useful addition to their normal skill set. For some traditional designers this book may even serve as an entry point into a completely new way of handling their design problems. Because we focus on both hardware and software systems, we expect readers to have a certain level of familiarity with each technology. On the hardware side, we assume that readers have a basic knowledge of digital logic design, including understanding concepts such as gates (including multiplexers, flip-flops, and RAM), binary number systems, and simple logic optimization. Knowledge of hardware description languages, such as Verilog or VHDL, is also helpful. We also assume that readers have basic knowledge of computer programming, including simple data structures and algorithms. In sum, this book is appropriate for most readers with a background in electrical engineering, computer science, or computer engineering. It can also be used as a text in an upper-level undergraduate or introductory graduate course within any of these disciplines. No one book can hope to cover every possible aspect of FPGAs exhaustively. Entire books could be (and have been) written about each of the concepts that are discussed in the individual chapters here. Our goal is to provide a good working knowledge of these concepts, as well as abundant references for those who wish to dig deeper.
xxiv
Preface
Reconfigurable Computing: The Theory and Practice of FPGA-Based Computation is divided into six major parts—hardware, programming, compilation/ mapping, application development, case studies, and future trends. Once the introduction has been read, the parts can be covered in any order. Alternatively, readers can pick and choose which parts they wish to cover. For example, a reader who wants to focus on CAD for FPGAs might skip hardware and application development, while a reader who is interested mostly in the use of FPGAs might focus primarily on application development. Part V is made up of self-contained overviews of specific, important applications, which can be covered in any order or can be sprinkled throughout a course syllabus. The part introduction lists the chapters and concepts relevant to each case study and so can be used as a guide for the reader or instructor in selecting relevant examples. One final consideration is an explanation of how this book was written. Some books are created by a single author or a set of coauthors who must stretch to cover all aspects of a given topic. Alternatively, an edited text can bring together contributors from each of the topic areas, typically by bundling together standalone research papers. Our book is a bit of a hybrid. It was constructed from an overall outline developed by the primary authors, Scott Hauck and Andr´e DeHon. The chapters on the chosen topics were then written by noted experts in these areas, and were carefully edited to ensure their integration into a cohesive whole. Our hope is that this brings the benefits of both styles of traditional texts, with the reader learning from the main experts on each topic, yet still delivering a well-integrated text.
Acknowledgments While Scott and Andr´e handled the technical editing, this book also benefited from the careful help from the team at Elsevier/Morgan Kaufmann. Wayne Wolf first proposed the concept of this book to us. Chuck Glaser, ably assisted by Michele Cronin and Matthew Cater, was instrumental in resurrecting the project after it had languished in the concept stage for several years and in pushing it through to completion. Just as important were the efforts of the production group at Elsevier/Morgan Kaufmann who did an excellent job of copyediting, proofreading, integrating text and graphics, laying out, and all the hundreds of little details crucial to bringing a book together into a polished whole. This was especially true for a book like this, with such a large list of contributors. Specifically, Marilyn E. Rash helped drive the whole production process and was supported by Dianne Wood, Jodie Allen, and Steve Rath. Without their help there is no way this monumental task ever would have been finished. A big thank you to all. Scott Hauck Andr´e DeHon
INTRODUCTION In the computer and electronics world, we are used to two different ways of performing computation: hardware and software. Computer hardware, such as application-specific integrated circuits (ASICs), provides highly optimized resources for quickly performing critical tasks, but it is permanently configured to only one application via a multimillion-dollar design and fabrication effort. Computer software provides the flexibility to change applications and perform a huge number of different tasks, but is orders of magnitude worse than ASIC implementations in terms of performance, silicon area efficiency, and power usage. Field-programmable gate arrays (FPGAs) are truly revolutionary devices that blend the benefits of both hardware and software. They implement circuits just like hardware, providing huge power, area, and performance benefits over software, yet can be reprogrammed cheaply and easily to implement a wide range of tasks. Just like computer hardware, FPGAs implement computations spatially, simultaneously computing millions of operations in resources distributed across a silicon chip. Such systems can be hundreds of times faster than microprocessor-based designs. However, unlike in ASICs, these computations are programmed into the chip, not permanently frozen by the manufacturing process. This means that an FPGA-based system can be programmed and reprogrammed many times. Sometimes reprogramming is merely a bug fix to correct faulty behavior, or it is used to add a new feature. Other times, it may be carried out to reconfigure a generic computation engine for a new task, or even to reconfigure a device during operation to allow a single piece of silicon to simultaneously do the work of numerous special-purpose chips. However, merging the benefits of both hardware and software does come at a price. FPGAs provide nearly all of the benefits of software flexibility and development models, and nearly all of the benefits of hardware efficiency—but not quite. Compared to a microprocessor, these devices are typically several orders of magnitude faster and more power efficient, but creating efficient programs for them is more complex. Typically, FPGAs are useful only for operations that process large streams of data, such as signal processing, networking, and the like. Compared to ASICs, they may be 5 to 25 times worse in terms of area, delay, and performance. However, while an ASIC design may take months to years to develop and have a multimillion-dollar price tag, an FPGA design might only take days to create and cost tens to hundreds of dollars. For systems that do not require the absolute highest achievable performance or power efficiency, an FPGA’s development simplicity and the ability to easily fix bugs and upgrade functionality make them a compelling design alternative. For many tasks, and particularly for beginning electronics designers, FPGAs are the ideal choice.
xxvi
Introduction
FIGURE I.1 structure.
I
An abstract view of an FPGA; logic cells are embedded in a general routing
Figure I.1 illustrates the internal workings of a field-programmable gate array, which is made up of logic blocks embedded in a general routing structure. This array of logic gates is the G and A in FPGA. The logic blocks contain processing elements for performing simple combinational logic, as well as flip-flops for implementing sequential logic. Because the logic units are often just simple memories, any Boolean combinational function of perhaps five or six inputs can be implemented in each logic block. The general routing structure allows arbitrary wiring, so the logical elements can be connected in the desired manner. Because of this generality and flexibility, an FPGA can implement very complex circuits. Current devices can compute functions on the order of millions of basic gates, running at speeds in the hundreds of Megahertz. To boost speed and capacity, additional, special elements can be embedded into the array, such as large memories, multipliers, fast-carry logic for arithmetic and logic functions, and even complete microprocessors. With these predefined, fixed-logic units, which are fabricated into the silicon, FPGAs are capable of implementing complete systems in a single programmable device. The logic and routing elements in an FPGA are controlled by programming points, which may be based on antifuse, Flash, or SRAM technology. For reconfigurable computing, SRAM-based FPGAs are the preferred option, and in fact are the primary style of FPGA devices in the electronics industry as a whole. In these devices, every routing choice and every logic function is controlled by a simple memory bit. With all of its memory bits programmed, by way of a configuration file or bitstream, an FPGA can be configured to implement the user’s desired function. Thus, the configuration can be carried out quickly and
Introduction
xxvii
without permanent fabrication steps, allowing customization at the user’s electronics bench, or even in the final end product. This is why FPGAs are field programmable, and why they differ from mask-programmable devices, which have their functionality fixed by masks during fabrication. Because customizing an FPGA merely involves storing values to memory locations, similarly to compiling and then loading a program onto a computer, the creation of an FPGA-based circuit is a simple process of creating a bitstream to load into the device (see Figure I.2). Although there are tools to do this from software languages, schematics, and other formats, FPGA designers typically start with an application written in a hardware description language (HDL) such as Verilog or VHDL. This abstract design is optimized to fit into the FPGA’s available logic through a series of steps: Logic synthesis converts high-level logic constructs and behavioral code into logic gates, followed by technology mapping to separate the gates into groupings that best match the FPGA’s logic resources. Next, placement assigns the logic groupings to specific logic blocks and routing determines the interconnect resources that will carry the user’s signals. Finally, bitstream generation creates a binary file that sets all of the FPGA’s programming points to configure the logic blocks and routing resources appropriately. After a design has been compiled, we can program the FPGA to perform a specified computation simply by loading the bitstream into it. Typically either a host microprocessor/microcontroller downloads the bitstream to the device, or an EPROM programmed with the bitstream is connected to the FPGA’s configuration port. Either way, the appropriate bitstream must be loaded every time the FPGA is powered up, as well as any time the user wants to change the circuitry when it is running. Once the FPGA is configured, it operates as a custom piece of digital logic. Because of the FPGA’s dual nature—combining the flexibility of software with the performance of hardware—an FPGA designer must think differently from designers who use other devices. Software developers typically write sequential programs that exploit a microprocessor’s ability to rapidly step through a series of instructions. In contrast, a high-quality FPGA design requires thinking about spatial parallelism—that is, simultaneously using multiple resources spread across a chip to yield a huge amount of computation. Hardware designers have an advantage because they already think in terms of hardware implementations; even so, the flexibility of FPGAs gives them new opportunities generally not available in ASICs and other fixed devices. Fieldprogrammable gate array designs can be rapidly developed and deployed, and even reprogrammed in the field with new functionality. Thus, they do not demand the huge design teams and validation efforts required for ASICs. Also, the ability to change the configuration, even when the device is running, yields new opportunities, such as computations that optimize themselves to specific demands on a second-by-second basis, or even time multiplexing a very large design onto a much smaller FPGA. However, because FPGAs are noticeably slower and have lower capacity than ASICs, designers must carefully optimize their design to the target device.
xxviii
Introduction
Source Code
Logic Synthesis Technology Mapping Placement Routing Bitstream Generationitstream
FIGURE I.2
I
A typical FPGA mapping flow.
Introduction
xxix
FPGAs are a very flexible medium, with unique opportunities and challenges. The goal of Reconfigurable Computing: The Theory and Practice of FPGA-Based Computation is to introduce all facets of FPGA-based systems—both positive and problematic. It is organized into six major parts: I
I
I
I
I
I
Part I introduces the hardware devices, covering both generic FPGAs and those specifically optimized for reconfigurable computing (Chapters 1 through 4). Part II focuses on programming reconfigurable computing systems, considering both their programming languages and programming models (Chapters 5 through 12). Part III focuses on the software mapping flow for FPGAs, including each of the basic CAD steps of Figure I.2 (Chapters 13 through 20). Part IV is devoted to application design, covering ways to make the most efficient use of FPGA logic (Chapters 21 through 26). This part can be viewed as a finishing school for FPGA designers because it highlights ways in which application development on an FPGA is different from both software programming and ASIC design. Part V is a set of case studies that show complete applications of reconfigurable logic (Chapters 27 through 35). Part VI contains more advanced topics, such as theoretical models and metric for reconfigurable computing, as well as defect and fault tolerance and the possible synergies between reconfigurable computing and nanotechnology (Chapters 36 through 38).
As the 38 chapters that follow will show, the challenges that FPGAs present are significant. However, the effort entailed in surmounting them is far outweighed by the unique opportunities these devices offer to the field of computing technology.
This page intentionally left blank
PART
I
RECONFIGURABLE COMPUTING HARDWARE
At a fundamental level, reconfigurable computing is the process of best exploiting the potential of reconfigurable hardware. Although a complete system must include compilation software and high-performance applications, the best place to begin to understand reconfigurable computing is at the chip level, as it is the abilities and limitations of chips that crucially influence all of a system’s steps. However, the reverse is true as well—reconfigurable devices are designed primarily as a target for the applications that will be developed, and a chip that does not efficiently support important applications, or that cannot be effectively targeted by automatic design mapping flows, will not be successful. Reconfigurable computing has been driven largely by the development of commodity field-programmable gate arrays (FPGAs). Standard FPGAs are somewhat of a mixed blessing for this field. On the one hand, they represent a source of commodity parts, offering cheap and fast programmable silicon on some of the most advanced fabrication processes available anywhere. On the other hand, they are not optimized for reconfigurable computing for the simple reason that the vast majority of FPGA customers use them as cheap, low-quality application-specific integrated circuits (ASICs) with rapid time to market. Thus, these devices are never quite what the reconfigurable computing user might want, but they are close enough. Chapter 1 covers commercial FPGA architectures in depth, providing an overview of the underlying technology for virtually all generally available reconfigurable computing systems. Because FPGAs are not optimized toward reconfigurable computing, there have been many attempts to build better silicon devices for this community. Chapter 2 details many of them. The focus of the new architectures might be the inclusion of larger functional blocks to speed up important computations, tight connectivity to a host processor to set up a coprocessing model, fast reconfiguration features to reduce the time to change configurations, or other concepts. However, as of now, no such system is commercially viable, largely because
2
Part I I
I
I
Reconfigurable Computing Hardware
The demand for reconfigurable computing chips is much smaller than that for the FPGA community as a whole, reducing economies of scale. FPGA manufacturers have access to cutting-edge fabrication processes, while reconfigurable computing chips typically are one to two process generations behind.
For these reasons, a reconfigurable computing chip is at a significant cost, performance, and electrical power-consumption disadvantage compared to a commodity FPGA. Thus, the architectural advantages of a reconfigurable computing-specific device must be huge to make up for the problems of less economies of scale and fabrication process lag. It seems likely that eventually a company with a reconfigurable computingspecific chip will be successful; however, so far there appears to have been only failures. Although programmable chips are important, most reconfigurable computing users need more. A real system generally requires large memories, input/output (I/O) ports to hook to various data streams, microprocessors or microprocessor interfaces to coordinate operation, and mechanisms for configuring and reconfiguring the device. Chapter 3 considers such complete systems, chronicling the development of reconfigurable computing boards. Chapters 1 through 3 present a good overview of most reconfigurable systems hardware, but one topic requires special consideration: the reconfiguration subsystems within devices. In the first FPGAs, configuration data was loaded slowly and sequentially, configuring the entire chip for a given computation. For glue logic and ASIC replacement, this was sufficient because FPGAs needed to be configured only once, at power-up; however, in many situations the device may need to be reconfigured more often. In the extreme, a single computation might be broken into multiple configurations, with the FPGA loading new configurations during the normal execution of that circuit. In this case, the speed of reconfiguration is important. Chapter 4 focuses on the configuration memory subsystems within an FPGA, considering the challenges of fast reconfiguration and showing some ways to greatly improve reconfiguration speed.
CHAPTER
1
DEVICE ARCHITECTURE Mark L. Chang Electrical and Computer Engineering Franklin W. Olin College of Engineering
The best race car drivers understand how their cars work. The best architects know how carpenters, bricklayers, and electricians do their jobs. And the best programmers know how the hardware they are programming does computation. Knowing how your device works, “down to the metal,” is essential for efficient utilization of available resources. In this chapter, we take a look inside the package to discover the basic hardware elements that make up a typical field-programmable gate array (FPGA). We’ll talk about how computation happens in an FPGA—from the blocks that do the computation to the interconnect that shuttles data from one place to another. We’ll talk about how these building blocks fit together in terms of FPGA architecture. And, of course, because programmability (as well as reprogrammability) is part of what makes an FPGA so useful, we’ll spend some time on that, too. Finally, we’ll take an in-depth look at the architectures of some commercially available FPGAs in Section 1.5, Case Studies. We won’t be covering many of the research architectures from universities and industry—we’ll save that for later. We also won’t be talking much about how you successfully program these things to make them useful parts of a computational platform. That, too, is later in the book. What you will learn is what’s “under the hood” of a typical commercial FPGA so that you will become more comfortable using it as a platform for solving problems and performing computations. The first step in our journey starts with how computation in an FPGA is done.
1.1
LOGIC—THE COMPUTATIONAL FABRIC Think of your typical desktop computer. Inside the case, among other things, are storage and communication devices (hard drives and network cards), memory, and, of course, the central processing unit, or CPU, where most of the computation happens. The FPGA plays a similar role in a reconfigurable computing platform, but we’re going to break it down. In very general terms, there are only two types of resources in an FPGA: logic and interconnect. Logic is where we do things like arithmetic, 1+1=2, and logical functions, if (ready) x=1 else x=0. Interconnect is how we get data (like the
4
Chapter 1
I
Device Architecture
results of the previous computations) from one node of computation to another. Let’s focus on logic first.
1.1.1
Logic Elements
From your digital logic and computer architecture background, you know that any computation can be represented as a Boolean equation (and in some cases as a Boolean equation where inputs are dependent on past results—don’t worry, FPGAs can hold state, too). In turn, any Boolean equation can be expressed as a truth table. From these humble beginnings, we can build complex structures that can do arithmetic, such as adders and multipliers, as well as decision-making structures that can evaluate conditional statements, such as the classic if-thenelse. Combining these, we can describe elaborate algorithms simply by using truth tables. From this basic observation of digital logic, we see the truth table as the computational heart of the FPGA. More specifically, one hardware element that can easily implement a truth table is the lookup table, or LUT. From a circuit implementation perspective, a LUT can be formed simply from an N:1 (N-toone) multiplexer and an N-bit memory. From the perspective of our previous discussion, a LUT simply enumerates a truth table. Therefore, using LUTs gives an FPGA the generality to implement arbitrary digital logic. Figure 1.1 shows a typical N-input lookup table that we might find in today’s FPGAs. In fact, almost all commercial FPGAs have settled on the LUT as their basic building block. The LUT can compute any function of N inputs by simply programming the lookup table with the truth table of the function we want to implement. As shown in the figure, if we wanted to implement a 3-input exclusive-or (XOR) function with our 3-input LUT (often referred to as a 3-LUT), we would assign values to the lookup table memory such that the pattern of select bits chooses the correct row’s “answer.” Thus, every “row” would yield a result of 0 except in the four cases where the XOR of the three select lines yields 1. 0 1 1 0 1 0 0 1 3
3
FIGURE 1.1 I A 3-LUT schematic (a) and the corresponding 3-LUT symbol and truth table (b) for a logical XOR.
1.1 Logic—The Computational Fabric
5
Of course, more complicated functions, and functions of a larger number of inputs, can be implemented by aggregating several lookup tables together. For example, one can organize a single 3-LUT into an 8 × 1 ROM, and if the values of the lookup table are reprogrammable, an 8 × 1 RAM. But the basic building block, the lookup table, remains the same. Although the LUT has more or less been chosen as the smallest computational unit in commercially available FPGAs, the size of the lookup table in each logic block has been widely investigated [1]. On the one hand, larger lookup tables would allow for more complex logic to be performed per logic block, thus reducing the wiring delay between blocks as fewer blocks would be needed. However, the penalty paid would be slower LUTs, because of the requirement of larger multiplexers, and an increased chance of waste if not all of the functionality of the larger LUTs were to be used. On the other hand, smaller lookup tables may require a design to consume a larger number of logic blocks, thus increasing wiring delay between blocks while reducing per–logic block delay. Current empirical studies have shown that the 4-LUT structure makes the best trade-off between area and delay for a wide range of benchmark circuits. Of course, as FPGA computing evolves into wider arenas, this result may need to be revisited. In fact, as of this writing, Xilinx has released the Virtex-5 SRAMbased FPGA with a 6-LUT architecture. The question of the number of LUTs per logic block has also been investigated [2], with empirical evidence suggesting that grouping more than one 4-LUT into a single logic block may improve area and delay. Many current commercial FPGAs incorporate a number of 4-LUTs into each logic block to take advantage of this observation. Investigations into both LUT size and number of LUTs per block begin to address the larger question of computational granularity in an FPGA. On one end of the spectrum, the rather simple structure of a small lookup table (e.g., 2-LUT) represents fine-grained computational capability. Toward the other end, coarse-grained, one can envision larger computational blocks, such as full 8-bit arithmetic logic units (ALUs), more typical of CPUs. As in the case of lookup table sizing, finer-grained blocks may be more adept at bit-level manipulations and arithmetic, but require combining several to implement larger pieces of logic. Contrast that with coarser-grained blocks, which may be more optimal for datapath-oriented computations that work with standard “word” sizes (8/16/ 32 bits) but are wasteful when implementing very simple logical operations. Current industry practice has been to strike a balance in granularity by using rather fine-grained 4-LUT architectures and augmenting them with coarser-grained heterogeneous elements, such as multipliers, as described in the Extended Logic Elements section later in this chapter. Now that we have chosen the logic block, we must ask ourselves if this is sufficient to implement all of the functionality we want in our FPGA. Indeed, it is not. With just LUTs, there is no way for an FPGA to maintain any sense of state, and therefore we are prohibited from implementing any form of sequential, or state-holding, logic. To remedy this situation, we will add a simple single-bit storage element in our base logic block in the form of a D flip-flop.
Chapter 1
6
I
Device Architecture
4 LUT D
Q
CLK
FIGURE 1.2
I
A simple lookup table logic block.
Now our logic block looks something like Figure 1.2. The output multiplexer selects a result either from the function generated by the lookup table or from the stored bit in the D flip-flop. In reality, this logic block bears a very close resemblance to those in some commercial FPGAs.
1.1.2
Programmability
Looking at our logic block in Figure 1.2, it is a simple task to identify all the programmable points. These include the contents of the 4-LUT, the select signal for the output multiplexer, and the initial state of the D flip-flop. Most current commercial FPGAs use volatile static-RAM (SRAM) bits connected to configuration points to configure the FPGA. Thus, simply writing a value to each configuration bit sets the configuration of the entire FPGA. In our logic block, the 4-LUT would be made up of 16 SRAM bits, one per output; the multiplexer would use a single SRAM bit; and the D flip-flop initialization value could also be held in a single SRAM bit. How these SRAM bits are initialized in the context of the rest of the FPGA will be the subject of later sections.
1.2
THE ARRAY AND INTERCONNECT With the LUT and D flip-flop, we begin to define what is commonly known as the logic block, or function block, of an FPGA. Now that we have an understanding of how computation is performed in an FPGA at the single logic block level, we turn our focus to how these computation blocks can be tiled and connected together to form the fabric that is our FPGA. Current popular FPGAs implement what is often called island-style architecture. As shown in Figure 1.3, this design has logic blocks tiled in a twodimensional array and interconnected in some fashion. The logic blocks form the islands and “float” in a sea of interconnect. With this array architecture, computations are performed spatially in the fabric of the FPGA. Large computations are broken into 4-LUT-sized pieces and mapped into physical logic blocks in the array. The interconnect is configured to route signals between logic blocks appropriately. With enough logic blocks, we can make our FPGAs perform any kind of computation we desire.
1.2 The Array and Interconnect
7
Logic block
Interconnect
FIGURE 1.3 I The island-style FPGA architecture. The interconnect shown here is not representative of structures actually used.
1.2.1
Interconnect Structures
Figure 1.3 does not tell the whole story. The interconnect structure shown is not representative of any structures used in actual FPGAs, but is more of a cartoon placeholder. This section introduces the interconnect structures present in many of today’s FPGAs, first by considering a small area of interconnection and then expanding out to understand the need for different styles of interconnect. We start with the simplest case of nearest-neighbor communication. Nearest neighbor Nearest-neighbor communication is as simple as it sounds. Looking at a 2 × 2 array of logic blocks in Figure 1.4, one can see that the only needs in this neighborhood are input and output connections in each direction: north, south, east, and west. This allows each logic block to communicate directly with each of its immediate neighbors. Figure 1.4 is an example of one of the simplest routing architectures possible. While it may seem nearly degenerate, it has been used in some (now obsolete) commercial FPGAs. Of course, although this is a simple solution, this structure suffers from severe delay and connectivity issues. Imagine, instead of a 2 × 2 array, a 1024 × 1024 array. With only nearest-neighbor connectivity, the delay scales linearly with distance because the signal must go through many cells (and many switches) to reach its final destination. From a connectivity standpoint, without the ability to bypass logic blocks in the routing structure, all routes that are more than a single hop away require
8
Chapter 1
FIGURE 1.4
I
I
Device Architecture
Nearest-neighbor connectivity.
traversing a logic block. With only one bidirectional pair in each direction, this limits the number of logic block signals that may cross. Signals that are passing through must not overlap signals that are being actively consumed and produced. Because of these limitations, the nearest-neighbor structure is rarely used exclusively, but it is almost always available in current FPGAs, often augmented with some of the techniques that follow. Segmented As we add complexity, we begin to move away from the pure logic block architecture that we’ve developed thus far. Most current FPGA architectures look less like Figure 1.3 and more like Figure 1.5. In Figure 1.5 we introduce the connection block and the switch box. Here the routing structure is more generic and meshlike. The logic block accesses nearby communication resources through the connection block, which connects logic block input and output terminals to routing resources through programmable switches, or multiplexers. The connection block (detailed in Figure 1.6) allows logic block inputs and outputs to be assigned to arbitrary horizontal and vertical tracks, increasing routing flexibility. The switch block appears where horizontal and vertical routing tracks converge as shown in Figure 1.7. In the most general sense, it is simply a matrix of programmable switches that allow a signal on a track to connect to another track. Depending on the design of the switch block, this connection could be, for example, to turn the corner in either direction or to continue straight. The design of switch blocks is an entire area of research by itself and has produced many varied designs that exhibit varying degrees of connectivity and efficiency [3–5]. A detailed discussion of this research is beyond the scope of this book. With this slightly modified architecture, the concept of a segmented interconnect becomes more clear. Nearest-neighbor routing can still be accomplished, albeit through a pair of connect blocks and a switch block. However, for
1.2 The Array and Interconnect
CB
Logic block
CB
Logic block
CB
Logic block
CB
Logic block
Switch box
CB
Switch box
CB
Switch box
CB
Switch box
CB
CB
Logic block
CB
Logic block
CB
Logic block
CB
Logic block
Switch box
CB
Switch box
CB
Switch box
CB
Switch box
CB
CB
Logic block
CB
Logic block
CB
Logic block
CB
Logic block
Switch box
CB
Switch box
CB
Switch box
CB
Switch box
CB
CB
Logic block
CB
Logic block
CB
Logic block
CB
Logic block
9
Switch box
CB
Switch box
CB
Switch box
CB
Switch box
CB
FIGURE 1.5 I An island-style architecture with connect blocks and switch boxes to support more complex routing structures. (The difference in relative sizes of the blocks is for visual differentiation.)
signals that need to travel longer distances, individual segments can be switched together in a switch block to connect distant logic blocks together. Think of it as a way to emulate long signal paths that can span arbitrary distances. The result is a long wire that actually comprises shorter “segments.” This interconnect architecture alone does not radically improve on the delay characteristics of the nearest-neighbor interconnect structure. However, the introduction of connection blocks and switch boxes separates the interconnect from the logic, allowing long-distance routing to be accomplished without consuming logic block resources. To improve on our structure, we introduce longer-length wires. For instance, consider a wire that spans one logic block as being of length-1 (L1). In some segmented routing architectures, longer wires may be present to allow signals to travel greater distances more efficiently. These segments may be
10
Chapter 1
I
Device Architecture
Programmable connection Logic block Connection block
FIGURE 1.6
I
Detail of a connection block.
FIGURE 1.7
I
An example of a common switch block architecture.
length-4 (L4), length-8 (L8), and so on. The switch blocks (and erhaps more embedded switches) become points where signals can switch from shorter to longer segments. This feature allows signal delay to be less than O(N) when covering a distance of N logic blocks by reducing the number of intermediate switches in the signal path. Figure 1.8 illustrates augmenting the single-segment interconnect with two additional lengths: direct-connect between logic blocks and length-2 (L2) lines. The direct-connect lines leave general routing resources free for other uses, and L2 lines allow signals to travel longer distances for roughly the same amount of switch delay. This interconnect architecture closely matches that of the Xilinx XC4000 series of commercial FPGAs. Hierarchical A slightly different approach to reducing the delay of long wires uses a hierarchical approach. Consider the structure in Figure 1.9. At the lowest level of hierarchy, 2 × 2 arrays of logic blocks are grouped together as a single cluster.
1.2 The Array and Interconnect
11
Length-2 lines
CB Logic block
Switch box
CB
CB Logic block
Switch box
CB Logic block
CB
Switch box
CB
CB Logic block
Switch box
CB
Logic connect lines
FIGURE 1.8 I Local (direct) connections and L2 connections augmenting a switched interconnect. 232 (16) 131 (4)
434 (32)
FIGURE 1.9
I
Hierarchical routing used by long wires to connect clusters of logic blocks.
Within this block, local, nearest-neighbor routing is all that is available. In turn, a 2 × 2 cluster of these clusters is formed that encompasses 16 logic blocks. At this level of hierarchy, longer wires at the boundary of the smaller, 2 × 2 clusters, connect each cluster of four logic blocks to the other clusters in the higher-level grouping. This is repeated in higher levels of hierarchy, with larger clusters and longer wires. The pattern of interconnect just described exploits the assumption that a welldesigned (and well-placed) circuit has mostly local connections and only a limited number of connections that need to travel long distances. By providing fewer resources at the higher levels of hierarchy, this interconnect architecture remains area-efficient while preserving some long-length wires to minimize the delay of signals that need to cross large distances. As in the segmented architecture, the connection points that connect one level of routing hierarchy to another can be anywhere in the interconnect structure. New points in the existing switch blocks may be created, or completely independent
Chapter 1
12
I
Device Architecture
switching sites elsewhere in the interconnect can be created specifically for the purpose of moving between hierarchy levels.
1.2.2
Programmability
As with the logic blocks in a typical commercial FPGA, each switch point in the interconnect structure is programmable. Within the connection block, programmable multiplexers select which routing track each logic block’s input and output terminals map to; in the switch block, the junction between vertical and horizontal routing tracks is switched through a programmable switch; and, finally, switching between routing tracks of different segment lengths or hierarchy levels is accomplished, again through programmable switches. For all of these programmable points, as in the logic block, modern FPGAs use SRAM bits to hold the user-defined configuration values. More discussion of these configuration bits comes later in this chapter.
1.2.3
Summary
Programmable routing resources are the natural counterpart to the logic resources in an FPGA. Where the logic performs the arithmetic and logical computations, the interconnection fabric takes the results output from logic blocks and routes them as inputs to other logic blocks. By tiling logic blocks together and connecting them through a series of programmable interconnects as described here, an FPGA can implement complex digital circuits. The true nature of spatial computing is realized by spreading the computation across the physical area of an FPGA. Today’s commercial FPGAs typically use bits of each of these interconnect architectures to provide a smooth and flexible set of routing resources. In actual implementation, segmentation and hierarchy may not always exhibit the logarithmic scaling seen in our examples. In modern FPGAs, the silicon area consumed by interconnect greatly dominates the area dedicated to logic. Anecdotally, 90 percent of the available silicon is interconnect whereas only 10 percent is logic. With this imbalance, it is clear that interconnect architecture is increasingly important, especially from a delay perspective.
1.3
EXTENDING LOGIC With a logic block like the one shown in Figure 1.2, tiled in a two-dimensional array with a supporting interconnect structure, we can implement any combinational and sequential logic. Our only constraint is area in terms of the number of available logic blocks. While this is comprehensive, it is far from optimal. In this section, we investigate how FPGA architects have augmented this simple design to increase performance.
1.3.1
Extended Logic Elements
Modern FPGA interconnect architectures have matured to include much more than simple nearest-neighbor connectivity to give increased performance for
1.3 Extending Logic
13
common applications. Likewise, the basic logic elements have been augmented to increase performance for common operations such as arithmetic functions and data storage. Fast carry chain One fundamental operation that the FPGA is likely to perform is an addition. From the basic logic block, it is apparent that we can implement a full-adder structure with two logic blocks given at least a 3-LUT. One logic block is configured to compute the sum, and one is configured to compute the carry. Cascading N pairs of logic blocks together will yield a simple N-bit full adder. As you may already know from digital arithmetic, the critical path of this type of addition comes not from the computation of the sum bits but rather from the rippling of the carry signal from lower-order bits to higher-order bits (see Figure 1.10). This path starts with the low-order primary inputs, goes through the logic block, out into the interconnect, into the adjacent logic block, and so on. Delay is accumulated at every switch point along the way. One clever way to increase speed is to shortcut the carry chain between adjacent logic blocks. We can accomplish this by providing a dedicated, minimally switched path from the output of the logic block computing the carry signal to the adjacent higher-order logic block pair. This carry chain will not need to be routed on the general interconnect network. By adding a minimal amount of overhead (wires), we dramatically speed up the addition operation. This feature does force some constraints on the spatial layout of a multibit addition. If, for instance, the dedicated fast carry chain only goes vertically, along columns of logic blocks, all additions must be oriented along the carry chain to take advantage of this dedicated resource. Additionally, to save switching area, the dedicated carry chain may not be a bidirectional path, which further restricts the physical layout to be oriented vertically and dictates the order of the bits relative to one another. The fast carry-chain of the Xilinx XC4000E is shown in Figure 1.11. Note that the bidirectional fast carry-chain wires are arranged along the columns while the horizontal lines are unidirectional. This allows large adder structures to be placed in a zig-zag pattern in the array and still make use of the dedicated carry-chain interconnect.
A3 B3
A2 B2
A1 B1
A0 B0
1
1
1
1
cout S3
S2
S1 Carry chain
FIGURE 1.10
I
A simple 4-bit full adder.
S0
cin
14
Chapter 1
FIGURE 1.11 p. 6-18.)
I
I
Device Architecture
Logic block
Logic block
Logic block
Logic block
Logic block
Logic block
Logic block
Logic block
Logic block
Logic block
Logic block
Logic block
Logic block
Logic block
Logic block
Logic block
The Xilinx XC4000E fast carry chain. (Source: Adapted from [6], Figure 11,
The fast carry-chain logic is now commonplace in commercial FPGAs, with the physical design constraints at this point completely abstracted away by the tools provided by manufacturers. The success of this optimization relies on the toolset’s ability to identify additions in the designer’s circuit description and then use the dedicated logic. With today’s tools, this kind of optimization is nearly transparent to the end user. Multipliers If addition is commonplace in algorithms, multiplication is certainly not rare. Several implementations are available if we wish to use general logic block resources to build our multipliers. From the area-efficient iterative shift-accumulate method to the area-consumptive array multiplier, we can use logic blocks to either compute additions or store intermediate values. While we can certainly implement a multiplication, we can do so only with a large delay penalty, or a large logic block footprint, depending on our implementation. In essence, our logic blocks aren’t very efficient at performing a multiplication. Instead of doing it with logic blocks, why not build real multipliers outside, but still connected to, the general FPGA fabric? Then, instead of inefficiently using simple LUTs to implement a multiply, we can route the values that need to be multiplied to actual multipliers implemented in silicon. How does this save space and time? Recall that FPGAs trade speed and power for configurability when compared to their ASIC (application-specific integrated circuit) counterparts. If you asked a VLSI designer to implement a fast multiplier out of transistors
1.3 Extending Logic
15
any way she wanted, it would take up far less silicon area, be much faster, and consume less power than we could ever manage using LUTs. The result is that, for a small price in silicon area, we can offload the otherwise area-prohibitive multiplication onto dedicated hardware that does it much better. Of course, just like fast carry chains, multipliers impose important design considerations and physical constraints, but we add one more option for computation to our palette of operations. It is now just a matter of good design and good tools to make an efficient design. Like fast carry chains, multipliers are commonplace in modern FPGAs. RAM Another area that has seen some customization beyond the general FPGA fabric is in the area of on-chip data storage. While logic blocks can individually provide a few bits of storage via the lookup table structure—and, in aggregate, many bits—they are far from an efficient use of FPGA resources. Like the fast carry chain and the “hard” multiplier, FPGA architectures have given their users generous amounts of on-chip RAM that can be accessed from the general FPGA fabric. Static RAM cells are extremely small and, when physically distributed throughout the FPGA, can be very useful for many algorithms. By grouping many static RAM cells into banks of memory, designers can implement large ROMs for extremely fast lookup table computations and constant-coefficient operations, and large RAMs for buffering, queuing, and basic scratch use—all with the convenience of a simple clocking strategy and the speed gained by avoiding off-chip communication to an external memory. Today’s FPGAs provide anywhere from kilobits to megabits of dedicated RAM. Processor blocks Tying all these blocks together, most commercial FPGAs now offer entire dedicated processors in the FPGA, sometimes even more than one. In a general sense, FPGAs are extremely efficient at implementing raw computational pipelines, exploiting nonstandard bit widths, and providing data and functional parallelism. The inclusion of dedicated CPUs recognizes the fact that algorithm flows that are very procedural and contain a high degree of branching do not lend themselves readily to acceleration using FPGAs. Entire CPU blocks can now be found in high-end FPGA devices. At the time of this writing, these CPUs are on the scale of 300 MHz PowerPC devices, complete, without floating-point units. They are capable of running an entire embedded operating system, and some are even able to reprogram the FPGA fabric around them. The CPU cores are not nearly as easily exploited as the carry chains, multipliers, and on-chip RAMs, but they represent a distinct shift toward making FPGAs more “platform”-oriented. With a traditional CPU on board (and perhaps up to four), a single FPGA can serve nearly as an entire “system-on-a-chip”—the holy grail of system integrators and embedded device manufacturers. With standard programming languages and toolchains available to developers, an entire project might indeed be implemented with a single-chip solution, dramatically reducing cost and time to market.
Chapter 1
16
1.3.2
I
Device Architecture
Summary
In the end, modern commercially available FPGAs provide a rich variety of basic, and not so basic, computational building blocks. With much more than simple lookup tables, the task for the FPGA architect is to decide in what proportion to provide these resources and how they should be connected. The task of the hardware designer is then to fully understand the capabilities of the target FPGAs to create designs that exploit their potential. The common thread among these extended logical elements is that they provide critical functionality that cannot be implemented very efficiently in the general FPGA fabric. As much as the technology drives FPGA architectures, applications provide a much needed push. If multiplies were rare, it wouldn’t make sense to waste silicon space on a “hard” multiplier. As FPGAs become more heterogeneous in nature, and become useful computational platforms in new application domains, we can expect to see even more varied blocks in the next generation of devices.
1.4
CONFIGURATION One of the defining features of an FPGA is its ability to act as “blank hardware” for the end user. Providing more performance than pure software implementations on general-purpose processors, and more flexibility than a fixed-function ASIC solution, relies on the FPGA being a reconfigurable device. In this section, we will discuss the different approaches and technologies used to provide programmability in an FPGA. Each configurable element in an FPGA requires 1 bit of storage to maintain a user-defined configuration. For a simple LUT-based FPGA, these programmable locations generally include the contents of the logic block and the connectivity of the routing fabric. Configuration of the FPGA is accomplished through programming the storage bits connected to these programmable locations according to user definitions. For the lookup tables, this translates into filling it with 1s and 0s. For the routing fabric, programming enables and disables switches along wiring paths. The configuration can be thought of as a flat binary file whose contents map, bit for bit, to the programmable bits in the FPGA. This bitstream is generated by the vendor-specific tools after a hardware design is finalized. While its exact format is generally not publicly known, the larger the FPGA, the larger the bitstream becomes. Of course, there are many known methods for storing a single bit of binary information. We discuss the most popular methods used for FPGAs next.
1.4.1
SRAM
As discussed in previous sections, the most widely used method for storing configuration information in commercially available FPGAs is volatile static RAM, or SRAM. This method has been made popular because it provides fast and infinite reconfiguration in a well-known technology.
1.4 Configuration
17
Drawbacks to SRAM come in the form of power consumption and data volatility. Compared to the other technologies described in this section, the SRAM cell is large (6–12 transistors) and dissipates significant static power because of leakage current. Another significant drawback is that SRAM does not maintain its contents without power, which means that at power-up the FPGA is not configured and must be programmed using off-chip logic and storage. This can be accomplished with a nonvolatile memory store to hold the configuration and a micro-controller to perform the programming procedure. While this may seem to be a trivial task, it adds to the component count and complexity of a design and prevents the SRAM-based FPGA from being a truly single-chip solution.
1.4.2
Flash Memory
Although less popular than SRAM, several families of devices use Flash memory to hold configuration information. Flash memory is different from SRAM in that it is nonvolatile and can only be written a finite number of times. The nonvolatility of Flash memory means that the data written to it remains when power is removed. In contrast with SRAM-based FPGAs, the FPGA remains configured with user-defined logic even through power cycles and does not require extra storage or hardware to program at boot-up. In essence, a Flash-based FPGA can be ready immediately. A Flash memory cell can also be made with fewer transistors compared to an SRAM cell. This design can yield lower static power consumption as there are fewer transistors to contribute to leakage current. Drawbacks to using Flash memory to store FPGA configuration information stem from the techniques necessary to write to it. As mentioned, Flash memory has a limited write cycle lifetime and often has slower write speeds than SRAM. The number of write cycles varies by technology, but is typically hundreds of thousands to millions. Additionally, most Flash write techniques require higher voltages compared to normal circuits; they require additional off-chip circuitry or structures such as charge pumps on-chip to be able to perform a Flash write.
1.4.3
Antifuse
A third approach to achieving programmability is antifuse technology. Antifuse, as its name suggests, is a metal-based link that behaves the opposite of a fuse. The antifuse link is normally open (i.e., unconnected). A programming procedure that involves either a high-current programmer or a laser melts the link to form an electrical connection across it—in essence, creating a wire or a shortcircuit between the antifuse endpoints. Antifuse has several advantages and one clear disadvantage, which is that it is not reprogrammable. Once a link is fused, it has undergone a physical transformation that cannot be reversed. FPGAs based on this technology are generally considered one-time programmable (OTP). This severely limits their flexibility in terms of reconfigurable computing and nearly eliminates this technology for use in prototyping environments.
Chapter 1
18
I
Device Architecture
However, there are some distinct advantages to using antifuse in an FPGA platform. First, the antifuse link can be made very small, compared to the large multi-transistor SRAM cell, and does not require any transistors. This results in very low propagation delays across links and zero static power consumption, as there is no longer any transistor leakage current. Antifuse links are also not susceptible to high-energy radiation particles that induce errors known as single-event upsets, making them more likely candidates for space and military applications.
1.4.4
Summary
There are several well-known methods for storing user-defined configuration data in an FPGA. We have reviewed the three most common in this section. Each has its strengths and weaknesses, and all can be found in current commercial FPGA products. Regardless of the technology used to store or convey configuration data, the idea remains the same. From vendor-specific tools, a device-specific programming bitstream is created and used either to program an SRAM or Flash memory, or to describe the pattern of antifuse links to be used. In the end, the user-defined configuration is reflected in the FPGA, bringing to reality part of the vision of reconfigurable computing.
1.5
CASE STUDIES If you’ve read everything thus far, the FPGA should no longer seem like a magical computational black box. In fact, you should have a good grasp of the components that make up modern commercial FPGAs and how they are put together. In this section, we’ll take it one step further and solidify the abstractions by taking a look at two real commercial architectures—the Altera Stratix and the Xilinx Virtex-II Pro—and linking the ideas introduced earlier in this chapter with concrete industry implementations. Although these devices represent near-current technologies, having been introduced in 2002, they are not the latest generation of devices from their respective manufacturers. The reason for choosing them over more cutting-edge examples is in part due to the level of documentation available at the time of this writing. As is often the case, detailed architecture information is not available as soon as a product is released and may never be available depending on the manufacturer. Finally, the devices discussed here are much more complex than we have space to describe. The myriad ways modern devices can be used to perform computation and the countless hardware and software features that allow you to create powerful and efficient designs are all part of a larger, more advanced dialog. So if something seems particularly interesting, we encourage you to grab a copy of the device handbook(s) and dig a little deeper.
1.5 Case Studies M512 RAM blocks
DSP blocks
19
M4K RAM blocks
M-RAM block
FIGURE 1.12
1.5.1
I
Altera Stratix block diagram. (Source: Adapted from [7], Chapter 2, p. 2-2.)
Altera Stratix
We begin by taking a look at the Altera Stratix FPGA. Much of the information presented here is adapted from the July 2005 edition of the Altera Stratix Device Handbook (available online at http://www.altera.com). The Stratix is an SRAM-based island-style FPGA containing many heterogeneous computational elements. The basic logical tile is the logic array block (LAB), which consists of 10 logic elements (LEs). The LABs are tiled across the device in rows and columns with a multilevel interconnect bringing together logic, memory, and other resources. Memory is provided through TriMatrix memory structures, which consist of three memory block sizes—M512, M4K, and M-RAM—each with its own unique properties. Additional computational resources are provided in DSP blocks, which can efficiently perform multiplication and accumulation. These resources are shown in a high-level block diagram in Figure 1.12. Logic architecture The smallest logical block in the array is the LE, shown in Figure 1.13. The general architecture of the LE is very similar to the structure that we introduced earlier—a single 4-LUT function generator and a programmable register as a state-holding element. In the Altera LE, you can see additional components to facilitate driving the interconnect (right side of Figure 1.12), setting and clearing the programmable register, choosing from several programmable clocks, and propagating the carry chain.
20
Chapter 1
I
Device Architecture cin To routing fabric Lookup table (LUT)
Data inputs
Carry logic
Load and clear logic
D
Q
To routing fabric
EN
Clock enable logic
Clock inputs
cout
FIGURE 1.13 p. 2-5.)
I
Simplified Altera Stratix logic element. (Source: Adapted from [7], Chapter 2,
Because the LEs are simple structures that may appear tens of thousands of times in a single device, Altera groups them into LABs. The LAB is then the basic structure that is tiled into an array and connected via the routing structure. Each LAB consists of 10 LEs, all LE carry chains, LAB-wide control signals, and several local interconnection lines. In the largest device, the EP1S80, there are 101 LAB rows and 91 LAB columns, yielding a total of 79,040 LEs. This is fewer than would be expected given the number of rows and columns because of the presence of the TriMatrix memory structures and DSP blocks embedded in the array. As shown in Figure 1.14, the LAB structure is dominated, at least conceptually, by interconnect. The local interconnect allows LEs in the same LAB to send signals to one another without using the general interconnect. Neighboring LABs, RAM blocks, and DSP blocks can also drive the local interconnect through direct links. Finally, the general interconnect (both horizontal and vertical channels) can drive the local interconnect. This high degree of connectivity is the lowest level of a rich, multilevel routing fabric. The Stratix has three types of memory blocks—M512, M4K, and M-RAM— collectively dubbed TriMatrix memory. The largest distinction between these blocks is their size and number in a given device. Generally speaking, they can be configured in a number of ways, including single-port RAM, dual-port RAM, shift-register, FIFO, and ROM table. These memories can optionally include parity bits and have registered inputs and outputs. The M512 RAM block is nominally organized as a 32 × 18-bit memory; the M4K RAM as a 128 × 36-bit memory; and the M-RAM as a 4K × 144-bit memory. Additionally, each block can be configured for a variety of widths depending on the needs of the user. The different-sized memories throughout the array provide
1.5 Case Studies
21
To row To row interconnects interconnects
To column interconnects
To column interconnects
From adjacent LAB
To adjacent local interconnect
Local interconnect LAB
FIGURE 1.14 p. 2-4.)
I
Simplified Altera Stratix LAB structure. (Source: Adapted from [8], Chapter 2,
an efficient mapping of variable-sized memory designs to the device. In total, on the EP1S80 there are over 7 million memory bits available for use, divided into 767 M512 blocks, 364 M4K blocks, and 9 M-RAM blocks. The final element of logic present in the Altera Stratix is the DSP block. Each device has two columns of DSP blocks that are designed to help implement DSP-type functions, such as finite-impulse response (FIR) and infinite-impulse response (IIR) filters and fast Fourier transforms (FFT), without using the general logic resources of the LEs. The common computational function required in these operations is often a multiplication and an accumulation. Each DSP block can be configured by the user to support a single 36 × 36-bit multiplication, four 18 × 18-bit multiplications, or eight 9 × 9-bit multiplications, in addition to an optional accumulation phase. In the EP1S80, there are 22 total DSP blocks. Routing architecture The Altera Stratix provides an interconnect system dubbed MultiTrack that connects all the elements just discussed using routing lines of varying fixed lengths. Along the row (horizontal) dimension, the routing resources include direct connections left and right between blocks (LABs, RAMs, and DSP) and interconnects of lengths 4, 8, and 24 that traverse either 4, 8, or 24 blocks left and right, respectively. A detailed depiction of an R4 interconnect at a single
22
Chapter 1
I
Device Architecture C4, C8, and C16 column interconnects R4 interconnect driving left
Neighbor LAB
R4 interconnect driving right
Primary LAB
Neighbor LAB
FIGURE 1.15 I Simplified Altera Stratix MultiTrack interconnect. (Source: Adapted from [7], Chapter 2, p. 2-14.)
LAB is shown in Figure 1.15. The R4 interconnect shown spans 4 blocks, left to right. The relative sizing of blocks in the Stratix allows the R4 interconnect to span four LABs; three LABs and one M512 RAM; two LABs and one M4K RAM; or two LABs and one DSP block, in either direction. This structure is repeated for every LAB in the row (i.e., every LAB has its own set of dedicated R4 interconnects driving left and right). R4 interconnects can drive C4 and C16 interconnects to propagate signals vertically to different rows. They can also drive R24 interconnects to efficiently travel long distances. The R8 interconnects are identical to the R4 interconnects except that they span 8 blocks instead of 4 and only connect to R8 and C8 interconnects. By design, the R8 interconnect is faster than two R4 interconnects joined together. The R24 interconnect provides the fastest long-distance interconnection. It is similar to the R4 and R8 interconnects, but does not connect directly to the LAB local interconnects. Instead, it is connected to row and column interconnects at every fourth LAB and only communicates to LAB local interconnects through R4 and C4 routes. R24 interconnections connect with all interconnection routes except L8s.
1.5 Case Studies
23
In the column (vertical) dimension, the resources are very similar. They include LUT chain and register chain direct connections and interconnects of lengths 4, 8, and 16 that traverse 4, 8, or 16 blocks up and down, respectively. The LAB local interconnects found in row routing resources are mirrored through LUT chain and register chain interconnects. The LUT chain connects the combinatorial output of one LE to the fast input of the LE directly below it without consuming general routing resources. The register chain connects the register output of one LE to the register input of another LE to implement fast shift registers. Finally, although this discussion was LAB-centric, all blocks connect to the MultiTrack row and column interconnect using a direct connection similar to the LAB local connection interfaces. These direct connection blocks also support fast direct communication to neighboring LABs.
1.5.2
Xilinx Virtex-II Pro
Launched and shipped right behind the Altera Stratix, the Xilinx Virtex-II Pro FPGA was the flagship product of Xilinx, Inc. for much of 2002 and 2003. A good deal of the information that is presented here is adapted from “Module 2 (Functional Description)” of the October 2005 edition of Xilinx VirtexII Pro™ and Virtex-II Pro X™ Platform FPGA Handbook (available at http:// www.xilinx.com). The Virtex-II Pro is an SRAM-based island-style FPGA with several heterogeneous computational elements interconnected through a complex routing matrix. The basic logic tile is the configurable logic block (CLB), consisting of four slices and two 3-state buffers. These CLBs are tiled across the device in rows and columns with a segmented, hierarchical interconnect tying all the resources together. Dedicated memory blocks, SelectRAM+, are spread throughout the device. Additional computational resources are provided in dedicated 18 × 18-bit multiplier blocks. Logic architecture The smallest piece of logic from the perspective of the interconnect structure is the CLB. Shown in Figure 1.16, it consists of four equivalent slices organized into two columns of two slices each with independent carry chains and a common shift chain. Each slice connects to the general routing fabric through a configurable switch matrix and to each other in the CLB through a fast local interconnect. Each slice comprises primarily two 4-LUT function generators, two programmable registers for state holding, and fast carry logic. The slice also contains extra multiplexers (MUXFx and MUXF5) to allow a single slice to be configured for wide logic functions of up to eight inputs. A handful of other gates provide extra functionality in the slice, including an XOR gate to complete a 2-bit full adder in a single slice, an AND gate to improve multiplier implementations in the logic fabric, and an OR gate to facilitate implementation of sum-of-products chains.
24
Chapter 1
I
Device Architecture cout
Slice X1Y1
Switch matrix
Shift
cout
Slice X1Y0
cin Slice X0Y1
Slice X0Y0
Fast connects to neighbors
cin
FIGURE 1.16 p. 35.)
I
Xilinx Virtex-II Pro configurable CLB. (Source: Adapted from [8], Figure 32,
In the largest Virtex-II Pro device, the XC2VP100, there are 120 rows and 94 columns of CLBs. This translates into 44,096 individual slices and 88,192 4-LUTs—comparable to the largest Stratix device. In addition to these general configurable logic resources, the Virtex-II Pro provides dedicated RAM in the form of block SelectRAM+. Organized into multiple columns throughout the device, each block SelectRAM+ provides 18 Kb of independently clocked, true dual-port synchronous RAM. It supports a variety of configurations, including single- and dual-port access in various aspect ratios. In the largest device there are 444 blocks of block SelectRAM+ organized into 16 columns, yielding a total of 8,183,808 bits of memory. Complementing the general logic resources are a number of 18 × 18-bit 2’s complement signed multiplier blocks. Like the DSP blocks in the Altera Stratix, these multiplier structures are designed for DSP-type operations, including FIR, IIR, FFT, and others, which often require multiply-accumulate structures. As shown in Figure 1.17, each 18 × 18 multiplier block is closely associated with an 18Kb block SelectRAM+. The use of the multiplier/block SelectRAM+ memory, with an accumulator implemented in LUTs, allows the implementation of efficient multiply-accumulate structures. Again, in the largest device, just as with block SelectRAM+, there are 16 columns yielding a total of 444 18 × 18-bit multiplier blocks. Finally, the Virtex-II Pro has one unique feature that has been carried into newer products and can also be found in competing Altera products. Embedded
1.5 Case Studies
25
Switch matrix
Switch matrix 18 Kbit block selectRAM1
18318 multiplier
Switch matrix
Switch matrix
FIGURE 1.17 I Virtex-II Pro multiplier/block SelectRAM+ organization. (Source: Adapted from [8], Figure 53, p. 48.)
in the silicon of the FPGA, much like the multiplier and block SelectRAM+ structures, are up to four IBM PowerPC 405-D5 CPU cores. These cores can operate up to 300+ MHz and communicate with surrounding CLB fabric, block SelectRAM+, and general interconnect through dedicated interface logic. Onchip memory (OCM) controllers allow the PowerPC core to use block SelectRAM+ as small instruction and data memories if no off-chip memories are available. The presence of a complete, standard microprocessor that has the ability to interface at a very low level with general FPGA resources allows unique, system-on-a-chip designs to be implemented with only a single FPGA device. For example, the CPU core can execute housekeeping tasks that are neither time-critical nor well suited to implementation in LUTs. Routing architecture The Xilinx Virtex-II Pro provides a segmented, hierarchical routing structure that connects to the heterogeneous fabric of elements through a switch matrix block. The routing resources (dubbed Active Interconnect) are physically located in horizontal and vertical routing channels between each switch matrix and look quite different from the Altera Stratix interconnect structures. The routing resources available between any two adjacent switch matrix rows or columns are shown in Figure 1.18, with the switch matrix block shown in black. These resources include, from top to bottom, the following: I I
24 long lines that span the full height and width of the device. 120 hex lines that route to every third or sixth block away in all four directions.
26
Chapter 1
I
Device Architecture
FIGURE 1.18 p. 45.)
I
Xilinx Virtex-II Pro routing resources. (Source: Adapted from [7], Figure 54,
I
I I
1.6
40 double lines that route to every first or second block away in all four directions. 16 direct connect routes that route to all immediate neighbors. 8 fast-connect lines in each CLB that connect LUT inputs and outputs.
SUMMARY This chapter presented the basic inner workings of FPGAs. We introduced the basic idea of lookup table computation, explained the need for dedicated computational blocks, and described common interconnection strategies. We learned how these devices maintain generality and programmability while providing performance through dedicated hardware blocks. We investigated a number of ways to program and maintain user-defined configuration information. Finally, we tied it all together with brief overviews of two popular commercial architectures, the Altera Stratix and the Xilinx Virtex-II Pro. Now that we have introduced the basic technology that serves as the foundation of reconfigurable computing, we will begin to build on the FPGA to create
1.6 Summary
27
reconfigurable devices and systems. The following chapters will discuss how to efficiently conceptualize computations spatially rather than procedurally, and the algorithms necessary to go from a user-specified design to configuration data. Finally, we’ll look into some application domains that have successfully exploited the power of reconfigurable computing.
References [1] J. Rose, A. E. Gamal, A Sangiovanni-Vincentelli. Architecture of field-programmable gate arrays. Proceedings of the IEEE 81(7), July 1993. [2] P. Chow, et al. The design of an SRAM-based field-programmable gate array—Part 1: Architecture. IEEE Transactions on VLSI Systems 7(2), June 1999. [3] H. Fan, J. Liu, Y. L. Wu, C. C. Cheung. On optimum switch box designs for 2-D FPGAs. Proceedings of the 38th ACM/SIGDA Design Automation Conference (DAC), June 2001. [4] ———-. On optimal hyperuniversal and rearrangeable switch box designs. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 22(12), December 2003. [5] H. Schmidt, V. Chandra. FPGA switch block layout and evaluation. IEEE International Symposium on Field-Programmable Gate Arrays, February 2002. [6] Xilinx, Inc. Xilinx XC4000E and XC4000X Series Field-Programmable Gate Arrays, Product Specification (Version 1.6), May 1999. [7] Altera Corp. Altera Stratix™ Device Handbook, July 2005. [8] Xilinx, Inc. Xilinx Virtex-II Pro™ and Virtex-II Pro™ Platform FPGA Handbook, October 2005.
This page intentionally left blank
CHAPTER
2
RECONFIGURABLE COMPUTING ARCHITECTURES Lesley Shannon School of Engineering Science Simon Fraser University
There has been considerable research into possible reconfigurable computing architectures. Alternatives range from systems constructed using standard off-the-shelf field-programmable gate arrays (FPGAs) to systems constructed using custom-designed chips. Standard FPGAs benefit from the economies of scale; however, custom chips promise a higher speed and density for customcomputing tasks. This chapter explores different design choices made for reconfigurable computing architectures and how these choices affect both operation and performance. Questions we will discuss include: I
I
Should the reconfigurable fabric be instantiated as a separate coprocessor or integrated as a functional unit (see Instruction augmentation subsection of Section 5.2.2) What is the appropriate granularity (Chapter 36) for the reconfigurable fabric?
Computing applications generally consist of both control flow and dataflow. General-purpose processors have been designed with a control plane and a data plane to support these two requirements. All reconfigurable computers have a reconfigurable fabric component that is used to implement at least a portion of the dataflow component of an application. In this discussion, the reconfigurable fabric in its entirety will be referred to as the reconfigurable processing fabric, or RPF. The RPF may be statically or dynamically reconfigurable, where a static RPF is only configured between application runs and a dynamic RPF may be updated during an application’s execution. In general, the reconfigurable fabric is relatively symmetrical and can be broken down into similar tiles or cells that have the same functionality. These blocks will be referred to as processing elements, or PEs. Ideally, the RPF is used to implement computationally intensive kernels in an application that will achieve significant performance improvement from the pipelining and parallelism available in the RPF. The kernels are called virtual instruction configurations, or VICs, and we will discuss possible RPF architectures for implementing them in the following section.
Chapter 2
30
2.1
I
Reconfigurable Computing Architectures
RECONFIGURABLE PROCESSING FABRIC ARCHITECTURES One of the defining characteristics of a reconfigurable computing architecture is the type of reconfigurable fabric used in the RPF. Different systems have quite different granularities. They range from fine-grained fabrics that manipulate data at the bit level similarly to commercial FPGA fabrics, to coarse-grained fabrics that manipulate groups of bits via complex functional units such as ALUs (arithmetic logic units) and multipliers. The remainder of this section will provide examples of these architectures, highlighting their advantages and disadvantages.
2.1.1
Fine-grained
Fine-grained architectures offer the benefit of allowing designers to implement bit manipulation tasks without wasting reconfigurable resources. However, for large and complex calculations, numerous fine-grained PEs are required to implement a basic computation. This results in much slower clock rates than are possible if the calculations could be mapped to fewer, coarse-grained PEs. Fine-grained architectures may also limit the number of VICs that can be concurrently stored in the RPF because of capacity limits. Garp’s nonsymmetrical RPF The BRASS Research Group designed the Garp reconfigurable processor as an MIPS processor and on-chip cache combined with an RPF [14]. The RPF is composed of an array of PEs, as shown in Figure 2.1. Unlike most RPF architectures, not all of the PEs (drawn as rounded squares in the array) are the same. There is one control PE in each row (illustrated as the dark gray square in the leftmost column) that provides communication between the RPF and external resources. For example, the control block can be used to generate an interrupt for the main processor or to initiate memory transactions. The remaining PEs (illustrated as light gray squares) in the array are used for data processing and modeled after the configurable logic blocks (CLBs) in the Xilinx 4000 series [13]. The number of columns of PEs is fixed at 24, with the middle 16 PEs dedicated to providing memory access for the RPF. The 3 extra PEs on the left and the 4 extra PEs on the right in Figure 2.1 are used for operations such as overflow, error checking, status checking, and wider data sizes. The number of rows in the RPF is not fixed by the architecture, but is typically at least 32 [13]. A wire network is provided between rows and columns, but the only way to switch wires is through logic blocks, as there are no connections from one wire to another. Each PE operates at the bit level on two bits of data, performing the same operation on both bits based on the assumption that a large fraction of most configurations will be used for multibit operations. By creating identical configurations for both bits, the configuration size and time can be reduced but only at the expense of flexibility [13]. The loading of configurations into an RPF with a fine-grained fabric is extremely costly relative to coarse-grained architectures. For example, each PE
2.1 Reconfigurable Processing Fabric Architectures
31
1 control PE per row 23 logic PEs per row 3 extra logic PEs
16 logic PEs (32 bits) aligned with processor data word msb
. . .
. . .
. . .
. . .
. . .
4 extra logic PEs
lsb
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
32-bit word alignment on memory bus
FIGURE 2.1
I
Garp’s RPF architecture. (Source: Adapted from [13].)
in Garp’s RPF requires 64 configuration bits (8 bytes) to specify the sources of inputs, the PE’s function, and any wires to be driven by the PE [13]. So, if there are only 32 rows in the RPF, 6144 bytes are required to load the configuration. While this may not seem significant given that the configuration bitstream of a commercial FPGA is on the order of megabytes (MB), it is considerable relative to a traditional CPU’s context switch. For example, if the bit path to external memory from the Garp is assumed to be 128 bits, loading the full configuration takes 384 sequential memory accesses. Garp’s RPF architecture supports partial array configuration and is dynamically reconfigurable during application execution (i.e., a dynamic RPF). Garp’s RPF architecture allows only one VIC to be stored on the RPF at a time. However, up to four different full RPF VIC configurations can be stored in the on-chip cache [13]. The VICs can then be swapped in and out of the RPF as they are needed for the application. The loading and execution of configurations on the reconfigurable array is always under the control of a program running on the main (MIPS) processor. When the main processor initiates a computation on the RPF, an iteration counter in the RPF is set to a predetermined value. The configuration executes until the iteration counter reaches zero, at which point the RPF stalls. The MIPS-II instruction set has been extended to provide the necessary support to the RPF [13].
32
Chapter 2
I
Reconfigurable Computing Architectures
Originally, the user was required to write configurations in a textual language that is similar to an assembler. The user had to explicitly assign data and operations to rows and columns. This source code was fed through a program called the configurator to generate a representation for the configuration as a collection of bits in a text file. The rest of the user’s source code could then be written in C, where the configuration was referenced using a character array initializer. This required some further assembly language programming to invoke the Garp instructions that interfaced with the reconfigurable array. Since then, considerable compiler work has been done on this architecture, and the user is now able to program the entire application in a high-level language (HLL) [14] (see Chapter 7).
2.1.2
Coarse-grained
For the purpose of this discussion, we describe coarse-grained architectures as those that use a bus interconnect and PEs that perform more than just bitwise operations, such as ALUs and multipliers. Examples include PipeRench and RaPiD (which is discussed later in this chapter). PipeRench The PipeRench RPF architecture [6], as shown in Figure 2.2, is an ALU-based system with a specialized reconfiguration strategy (Chapter 4). It is used as a coprocessor to a host microprocessor for most applications, although applications such as PGP and JPEG can be run on PipeRench in their entirety [8]. The architecture was designed in response to concerns that standard FPGAs do not provide reasonable forward compatibility, compilation time, or sufficient hardware to implement large kernels in a scalable and portable manner [6]. The PipeRench RPF uses pipelined configuration, first described by Goldstein et al. [6], where the reconfigurable fabric is divided into physical pipeline stages that can be reconfigured individually. Thus, the resulting RPF architecture is both partially and dynamically reconfigurable. PipeRench’s compiler is able to compile the static design into a set of “virtual” stages such that each virtual stage can be mapped to any physical pipeline stage in the RPF. The complete set of virtual stages can then be mapped onto the actual number of physical stages available in the pipeline. Figure 2.3 illustrates how the virtual pipeline stages of an application can be mapped onto a PipeRench architecture with three physical pipeline stages. A pipeline stage can be loaded during each cycle, but all cyclic dependencies must fit within a single stage. This limits the types of computations the array can support, because many computations contain cycles with multiple operations. Furthermore, since configuration of a pipeline stage can occur concurrent to execution of another pipeline stage, there is no performance degradation due to reconfiguration. A row of PEs is used to create a physical stage of the pipeline, also called a physical stripe, as shown in Figure 2.2. The configuration word, or VIC, used to configure a physical stripe is also known as a virtual stripe. Before a physical
2.1 Reconfigurable Processing Fabric Architectures
ALU
ALU
Register file
Register file
G l o b a l b u s s e s
ALU
···
PE 1
PE 0
Register file PE N 2 1
Stripe n Stripe n11 Interconnect network
ALU
ALU
Register file
Register file
PE 0
33
ALU
···
Register file PE N 2 1
PE 1
Stripe n 11 Stripe n12
G l o b a l b u s s e s
Interconnect network
ALU
ALU
Register file
Register file
PE 0
FIGURE 2.2
I
PE 1
ALU
···
Register file PE N 2 1
PipeRench architecture: PEs and interconnect. (Source: Adapted from [6].)
stripe is configured with a new virtual stripe, the state of the present virtual stripe, if any, must be stored outside the fabric so it can be restored when the virtual stripe is returned to the fabric. The physical stripes are all identical so that any virtual stripe can be placed onto any physical stripe in the pipeline. The interconnect between adjacent stripes is a full crossbar, which enables the output of any PE in one stage to be used as the input of any PE in the adjacent stage [6]. The PEs for PipeRench are composed of an ALU and a pass register file. The pass register file is required as there can be no unregistered data transmitted over the interconnect network between stripes, creating pipelined interstripe connections. One register in the pass register file is specifically dedicated to intrastripe feedback. An 8-bit PE granularity was chosen to optimize the performance of a suite of kernels [6]. It has been suggested that reconfigurable fabric is well suited to stream-based functions (see Chapter 5, Section 5.1.2) and custom instructions [6]. Although
34
Chapter 2
I
Reconfigurable Computing Architectures Cycle: 1
2
3
0
0
0
1
1
1
2
2
2
3
3
3
4
4
Stage 0 Stage 1 Stage 2 Stage 3
4
5
6 0
Stage 4 (a) Cycle: 1 Stage 0 Stage 1 Stage 2
0
2
3
4
5
6
0
0
3
3
3
1
1
1
4
4
2
2
2
0
(b)
FIGURE 2.3 I The virtual pipeline stages of an application (a). The light gray blocks represent the configuration of a pipeline stage; the dark gray blocks represent its execution. The mapping of virtual pipeline stages to three physical pipeline stages (b). The physical pipeline stages are labeled each cycle with the virtual pipeline stage being executed. (Source: Adapted from [6].)
the first version of PipeRench was implemented as an attached processor, the next was designed as a coprocessor so that it would be more tightly coupled with the host processor [6]. However, the developers of PipeRench argue against making the RPF a functional unit on the host processor. They state that this could “restrict the applicability of the reconfigurable unit by disallowing state to be stored in the fabric and in some cases by disallowing direct access to memory, essentially eliminating their usefulness for stream-based processing” [6]. PipeRench uses a set of CAD tools to synthesize a stripe based on the parameters N, B, and P, where N is the number of PEs in the stripe, B is the width in bits of each PE, and P is the number of registers in a PE’s pass register file. By adjusting these parameters, PipeRench’s creators were able to choose a set of values that provides the best performance according to a set of benchmarks [6]. Their CAD tools are able to achieve an acceptable placement of the stripes on the architecture, but fail to achieve a reasonable interconnect routing, which has to be optimized by hand. The user also has to describe the kernels to be executed on the PipeRench architecture using the Dataflow Intermediate Language (DIL), a single-assignment C-like language created for the architecture. DIL is intended for use by programmers and as an intermediate language for any high-level language compiler
2.2 RPF Integration into Traditional Computing Systems
35
that targets PipeRench architectures [6]. Obviously, applications have to be recompiled, and probably even redesigned, to run on PipeRench.
2.2
RPF INTEGRATION INTO TRADITIONAL COMPUTING SYSTEMS Whereas the RPF in a reconfigurable computing device dictates the programmable logic resources, a full reconfigurable computing system typically also has a microprocessor, memory, and possibly other structures. One defining characteristic of reconfigurable computing chips is the integration, or lack of integration, of the RPF with a host CPU. As shown in Figure 2.4, there are multiple ways to integrate an RPF into a computing system’s memory hierarchy. The different memory components of the system are drawn as shaded rectangles, where the darker shading indicates a tighter coupling of the memory component to the processor. The types of RPF integration for these computing systems are illustrated as rounded
CPU
Tightly coupled RFU
L1 cache
Loosely coupled RPF
Reconfigurable processing fabric
L2 cache
Memory bus
Main memory
RPF coprocessor
I/O bus
FIGURE 2.4 from [6].)
I
Possible locations for the RPF in the memory hierarchy. (Source: Adapted
36
Chapter 2
I
Reconfigurable Computing Architectures
rectangles, where the darker shading indicates a tighter coupling of the RPF to the processor. Some systems have the RPF as a separate processor [2–7]; however, most applications require a microprocessor somewhere to handle complex control. In fact, some separate reconfigurable computing platforms are actually defined to include a host processor that interfaces with the RPF [1]. Unfortunately, when the RPF is integrated into the computing system as an independent coprocessor, the limited bandwidth between CPU and reconfigurable logic can be a significant performance bottleneck. Other systems include an RPF as an extra functional unit coupled with a more traditional processor core on one chip [8–24]. How tightly the RPF is coupled with the processor’s control plane varies.
2.2.1
Independent Reconfigurable Coprocessor Architectures
Figure 2.5 illustrates a reconfigurable computing architecture with an independent RPF [1–7]. In these systems, the RPF has no direct data transfer links to the processor. Instead, all data communication takes place through main memory. The host processor, or a separate configuration controller, loads a configuration into the RPF and places operands for the VIC into the main memory. The RPF can then perform the computation and return the results back to main memory. Since independent coprocessor RPFs are separate from the traditional processor, the integration of the RPF into existing computer systems is simplified. Unfortunately, this also limits the bandwidth and increases the latency of transmissions between the RPF and traditional processing systems. For this reason, independent coprocessor RPFs are well suited only to applications where the RPF can act independently from the processor. Examples include data-streaming applications with significant digital signal processing, such as multimedia applications like image compression and decompression, and encryption. RaPiD One example of an RPF coprocessor is the Reconfigurable Pipelined Datapaths [4], or RaPiD, class of architectures. RaPiD’s RPF can be used as an independent
Host PC
Configuration controller (VICs)
Memory interface
Memory
Reconfigurable coprocessor
FIGURE 2.5 I A reconfigurable computing system with an independent reconfigurable coprocessor.
2.2 RPF Integration into Traditional Computing Systems
37
coprocessor or integrated with a traditional computing system as shown in Figure 2.5. RaPiD is designed for applications that have very repetitive pipelined computations that are typically represented as nested loops [5]. The underlying architecture is comparable to a super-scalar processor with numerous PEs and instruction generation decoupled from external memory but with no cache, no centralized register file, and no crossbar interconnect, as shown in Figure 2.6. Memory access is controlled by the stream generator, which uses first-in-firstout (FIFOs), or streams (Chapter 5, Sections 5.1.2 and 5.2.1), to obtain and transfer data from external memory via the memory interface, as shown in Figure 2.7. Each stream has an associated address generator, and the individual address patterns are generated statically at compile time [5]. The actual reads and writes External memory
Memory interface
• • •
Stream generator
• • •
I
• • •
Datapath
Instruction generator (VICs)
FIGURE 2.6
• • •
Configurable control plane
A block diagram of the RaPiD architecture (Source: Adapted from [5].)
Memory Interface
Repeater
Input stream FIFO
Address generator
To datapath
FIGURE 2.7
I
Input stream FIFO · · ·
To datapath
Repeater
Output stream FIFO
Address generator
Repeater
Address generator
From datapath
RaPiD’s stream generator. (Source: Adapted from [5].)
Output stream FIFO · · ·
From datapath
Repeater
Address generator
Chapter 2
Reconfigurable Computing Architectures
I
from the FIFOs are triggered by instruction bits at runtime. If the datapath’s required input data is not available (i.e., the input FIFO is empty) or if the output data cannot be stored (i.e., the output FIFO is full), then the datapath will stall. Fast access to memory is therefore important to limit the number of stalls that occur. Using a fast static RAM (SRAM), combined with techniques, such as interleaving and out-of-order memory accesses, reduces the probability of having to stall the datapath [5]. The actual architecture of RaPiD’s datapath is determined at fabrication time and is dictated by the class of applications that will be using the RaPiD RPF. This is done by varying the PE structure and the data width, and by choosing between fixed-point or floating-point data for numerical operations. The ability to change the PE’s structure is fundamental to RaPiD architectures, with the complexity of the PE ranging from a simple general-purpose register to a multioutput booth-encoded multiplier with a configurable shifter [5]. The RaPiD datapath consists of numerous PEs, as shown in Figure 2.8. The creators of RaPiD chose to benchmark an architecture with a rather complex PE consisting of ALUs, RAMs, general-purpose registers, and a multiplier to provide reasonable performance [5]. The coarse-grained architecture was chosen because it theoretically allows simpler programming and better density [5]. Furthermore, the datapath can be dynamically reconfigured (i.e., a dynamic RPF) during the application’s execution. Instead of using a crossbar interconnect, the PEs are connected by a more area-efficient linear-segmented bus structure and bus connectors, as shown in Figure 2.8. The linear bus structure significantly reduces the control overhead— from the 95 to 98 percent required by FPGAs to 67 percent [5]. Since
···
PE
PE
PE
PE
PE
PE
PE
PE
PE
Bus segments
Programmable interconnect
38
BC
BC BC BC
BC BC
BC
Bus connectors
FIGURE 2.8
I
An overview of RaPiD’s datapath. (Source: Adapted from [5].)
BC
···
2.2 RPF Integration into Traditional Computing Systems
39
the processor performance was benchmarked for a rather complex PE, the datapath was composed of only 16 PEs [5]. Each operation performed in the datapath is determined by a set of control bits, and the outputs are a data word plus status bits. These status bits enable data-dependent control. There are both hard control bits and soft control bits. As the hard control bits are for static configuration and are field programmable via SRAM bits, they are time consuming to set. They are normally initialized at the beginning of an application and include the tristate drivers and the programmable routing bus connectors, which can also be programmed to include pipelined delays for the datapath. The soft control bits can be dynamically configured because they are generated efficiently and affect multiplexers and ALU operations. Approximately 25 percent of the control bits are soft [5]. The instruction generator generates soft control bits in the form of VICs for the configurable control plane, as shown in Figure 2.9. The RaPiD system is built around the assumption that there is regularity in the computations. In other words, most of its processing time is spent within nested loops, as opposed to initialization, boundary processing, or completion [5], so the soft control bits are generated by a small programmable controller as a short instruction word (i.e., a VIC). The programmable controller is optimized to execute nested loop structures. For each nested loop, the user’s original code is statically compiled to remove all conditionals on loop variables and expanded to generate static instructions for loops [5]. The innermost loop can then often be packed into a single VIC with a count indicating how many times the VIC should be issued. One VIC can also be used to control more than one operation in more than one pipeline stage [5]. Figure 2.10(a) shows a snippet of code that includes conditional statements (if and for). This same functionality is shown in terms of static instructions in Figure 2.10(b). As there are often parallel loop nests in applications, the instruction generator has multiple programmable controllers running in parallel (see Figure 2.9) [5]. Although this causes synchronization concerns, the appropriate status bits exist to provide the necessary handshaking. The VICs from each controller are
Instruction generator Programmable controller Programmable controller Programmable controller Status bit
FIGURE 2.9
I
Datapath S y n c
Soft control bit
M e r g e
R e p e a t
Configurable control plane
VIC
RaPiD’s instruction generator. (Source: Adapted from [5].)
40
Chapter 2
I
Reconfigurable Computing Architectures
for (i=0; i muxsel_inverted_sig, c => sela_sig ); — another instantiation of the and gate and_inst_b : andgate port map ( a => b, b => mux_sel, c => selb_sig ); or_inst : orgate port map ( a => sela_sig, b => selb_sig, c => c ); end;
1. VHDL files typically start by including the IEEE library and certain important packages like std_logic_1164 (Listing 6.1, lines 1–2) that permit the use of type std_logic and Boolean operations on it. Additional packages such as std_logic_arith and std_logic_unsigned are often included for supporting arithmetic operations. 2. The VHDL description of a hardware module requires an entity declaration (Listing 6.1, lines 6–13) that specifies the interface of the module with the outside world. It is an enumeration of the interface ports. The declaration also provides additional information about the ports such as their direction (in/out), data type, bit width, and endianness. An entity declaration
6.1 VHDL Programming
133
in VHDL is analogous to an interface definition in Java or a function header declaration in C. 3. Almost all VHDL signals and ports use the data type std_logic and std_logic_vector. These data types define how VHDL models electrical behavior of signals, which we discuss in the Multivalued logic subsection of Section 6.1.5. The vector std_logic_vector allows declaration of buses that are bundled together. We will see its use in a subsequent example. 4. While an entity specifies the interface of a hardware module, its internal structure and function are enclosed within the architecture definition (Listing 6.1, lines 16–83). 5. In a structural description of a module, the constituent submodules are declared, instantiated, and connected to each other. Each submodule needs to be first declared in the component declaration (Listing 6.1, lines 20–25). This is merely a copy of the entity declaration where only the submodule’s interface is specified. Once the components are declared, they can then be instantiated (Listing 6.1, lines 54–58). Each instance of the component is unique, and a component can have multiple instances (Listing 6.1, lines 61 and 69). The instantiated components are connected to each other via internal signals by a process called port mapping (Listing 6.1, lines 55–58). Port mapping is performed on a signal-bysignal basis using the => symbol. It is analogous to assembling a set of integrated circuits (ICs) on a breadboard and wiring up the connections between the IC pins using jumper wires. Observe the similarity between the schematic representation of the multiplexer and the structural VHDL in the example. 6. Notice in the example that the component for the AND gate is reused for each AND gate in the design (Listing 6.1, lines 61–66 and 69–74). This is one of the benefits of a structural representation—it permits reuse of existing code for recurring design elements and helps reduce total code size. 7. The submodules used in Listing 6.1 are primitives supported in the vendor library. In a larger design that is a collection of several multiplexers, the different multiplexers can be declared, instantiated, and connected to each other as required. A design can have several such levels of structural hierarchy. Hierarchy is a fairly common technique for design composition.
6.1.2
RTL Description
The multiplexer’s RTL description can be specified much more succinctly than its corresponding structural representation. In RTL, logic is organized as transformations on data bits between register stages. By selecting the number of pipeline stages wisely, the designer can create a high-performance, high-speed hardware implementation, and by carefully deciding the degree of resource sharing, the size of the mapped design can be controlled as well. RTL provides the designer with sufficient low-level control to allow her to create an implementation that meets her specifications. For the VHDL description, we still need the logical equations that define the multiplexer, but these can now be represented directly as equations, from
134
Chapter 6
I
Programming FPGA Applications in VHDL
which a synthesis tool infers the actual gates. The tool tries to choose the gates on the basis of user-specified design criteria such as high speed or small area. Listing 6.2 shows how to write a 4-input multiplexer with registered outputs (Listing 6.1 simply showed a 2-input multiplexer without a register). 1. As before, we start with the package and entity declarations (Listing 6.2, lines 6–18). 2. The RTL description of the VHDL entity is enclosed in the architecture block (Listing 6.2, lines 20–52). The logic equations and registers that are part of the RTL description are written here. Earlier, we used the architecture block to write the structural port-mapping statements. Listing 6.2 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38
I
RTL for a 4-input multiplexer.
— library and package includes library ieee; use ieee.std_logic_1164.all; — entity declaration for the 4-input multiplexer entity mux4 is port ( clk : in std_logic; reset : in std_logic; a : in std_logic; b : in std_logic; c : in std_logic; d : in std_logic; — notice the use of the type vector. mux_sel : in std_logic_vector (1 downto 0); e : out std_logic ); end; — RTL description of the multiplexer is defined here architecture rtl of mux4 is — internal signals used in the multiplexer are — declared here before use signal e_c : std_logic; — indicates start of the actual RTL code begin — concurrent signal assignment — the multiplexer functionality is described — at a level above gates e_c b, out1 => xor1_out); xor2: xorg port map (in1 => xor1_out, in2 => cin, out1 => sum); and1: andg port map (in1 => a, in2 => b, out1 => and1_out); or1: org port map (in1 => a, in2 => b, out1 => or1_out); and2: andg port map (in1 => cin, in2 => or1_out, out1 => and2_out); or2: org port map (in1 => and1_out, in2 => and2_out, out1 => cout); end structural;
FIGURE 16.5
I
An example of explicit layout in VHDL.
352
Chapter 16
I
Specifying Circuit Layout on FPGAs
Explicit layout works well for small circuits that are not parameterized and for VHDL and Verilog descriptions that do not make use of statements like for . . . generate. In parameterized circuits, layout specifications become quite complex, with location specifications becoming difficult to comprehend layout calculation expressions. Because layout specifications are string attributes, one has the extra complexity of performing integer index calculations and then converting them into their string representation. This is often too tedious to be practical. The difficulty of working with explicit Cartesian layout specifications has led to the development of various systems to specify layout at a higher level of abstraction.
16.3
ALGEBRAIC LAYOUT SPECIFICATION Algebraic layout specification typically does not involve Cartesian coordinates. Instead, one specifies the geometric relationship between one circuit and another. These specifications (or constraints) are gathered together, and a deterministic layout can then be calculated. Techniques such as this have been shown to work for parameterized circuits, circuits with irregular layouts, and recursively defined circuit layouts. Such descriptions are also slightly less tightly coupled to a specific FPGA architecture or family. In this section we describe how algebraic layout specifications work in the Lava system [1]. Several other systems are based on similar principles. Lava is based on the concept of circuit combinators, which are calculations that take circuits as inputs and deliver a circuit as a result; essentially, they are procedures that compute on circuit descriptions. One important design decision in Lava is the coupling of the description of circuit behavior and that of circuit layout by using circuit combinators that compose both behavior and layout. This works well when the circuit layout description can use the same patterns as those of the circuit behavior. When this is not the case, one can directly use Cartesian coordinates. One important combinator is the serial composition combinator. This combinator, written as an infix operator >->, takes two circuits R and S as arguments and delivers a circuit comprising R with its output connected to the input of S. Furthermore, R is laid out to the left of S, which matches a left-to-right dataflow. Figure 16.6 shows the composition of an AND2 and an INV gate. Each gate or circuit starts life in its own coordinate system. The basic gates each have a height and width of one unit. The serial composition combinator sees that the circuit on the left has a width of one and then translates the circuit on the right by one unit. These algebraic descriptions can be arbitrarily nested. When the system needs to produce a VHDL or EDIF netlist, the algebraic specifications are computed and a netlist that contains RLOCs is automatically generated. Notice, now that layout has been combined with behavior, that there is a need for several kinds of serial composition combinators. Those for right-to-left ( INV
3
Layout calculation.
Figure 16.7 shows the layout produced by the Lava circuit expression AND2 >-> FD clk, which serially composes an AND2 gate with an FD component (a flip-flop). In the Xilinx device, a LUT–flip-flop pair is called a slice. AND2 and a flip-flop (FD) each have a width and height of one unit, or slice, causing the FD flipflop to be mapped to a slice to the right of the slice containing the function generator for the AND2 gate. Such a process is very inefficient. To allow circuits to be composed but mapped to the same location we can use the serial overlay operator, written as >|>. This is illustrated on the right side of Figure 16.7 and shows both the AND2 gate and the FD flip-flop mapped to the same location. The circuit tiles presented so far have only one-dimensional dataflow. Foursided tiles allow us to specify dataflow horizontally and vertically. Rather than introduce a new basic tile, a 4-sided tile can be represented in terms of a 2-sided tile. This is done by considering the 4-sided tile as a function that maps a pair of input values to a pair of output values. Each element of each pair corresponds to a face of the tile, as shown in Figure 16.8. We can now define a below combinator, which places one tile below another (r below s is shown in the middle of the
354
Chapter 16
I
Specifying Circuit Layout on FPGAs
Y
G
(a)
FIGURE 16.7
I
Y
G
(b)
The overlay combinator: (a) AND2 >-> FD clk; (b) AND2 >|> FD clk.
r
r g
d
a (a, b) -. (c, d) I
s
f
r
b
r
c
r
c
b
FIGURE 16.8
e
e
col 4r
Four-sided tiles.
figure). The col combinator replicates a tile vertically (col 4 r is shown on the right of the figure). A concrete example of the col combinator is shown in Figure 16.9. The col combinator acts on a 1-bit adder circuit that takes a pair as input (the carry-in [cin] and another pair of values to be added) and delivers a pair as its output
16.3 Algebraic Layout Specification cout
MUXCY 0 1
b3
XORCY
sum3
XORCY
sum2
XORCY
sum1
XORCY
sum0
LUT a3
MUXCY 0 1
b2
LUT a2
MUXCY 0 1
b1
LUT a1
MUXCY 0 1
b0
LUT a0 cin
FIGURE 16.9
I
A col 4 1-bit adder.
355
356
Chapter 16
I
Specifying Circuit Layout on FPGAs
(the sum and the carry-out [cout]). It will connect the carry-out of each stage to the carry-in of the next stage. Furthermore, it will vertically stack the 1-bit adders. The actual FPGA layout produced for col 8 oneBitAdder is shown in Figure 16.10. In this case the automatic placement tools would have produced the same layout because the carry chain would have constrained a vertical alignment for the circuit. Through combinations of these regular abutment techniques, very complex but regular circuits can be efficiently created.
FIGURE 16.10
I
FPGA layout of col 8 oneBitAdder.
16.3 Algebraic Layout Specification
16.3.1
357
Case Study: Batcher’s Bitonic Sorter
This section presents the layout specification of a high-speed parallel sorter that would have been difficult to lay out using explicit Cartesian coordinates. We show how to build complex structures incrementally by composing the layout of subcomponents using simple operators. The use of hierarchy achieves complex layout structures that would have been difficult or tedious to produce otherwise and impossible to produce in a compositional manner. The objective is to build a parallel sorter from a parallel merger, as shown in Figure 16.11. A parallel merger takes two sublists of numbers where each sublist is sorted and produces a completely sorted list of numbers as its output. All inputs and outputs are shifted in, in parallel rather than serially. Furthermore, for performance reasons the sorter should have the same floorplan as shown in the figure. This parallel sorter uses a two-sorter as its building block, which is shown fully placed in Figure 16.12. This circuit has left-to-right dataflow. Although the >=> combinator is also a serial composition combinator, it does not have any layout semantics because it is used to compose wiring circuits (which are not subject to layout directives). The two-sorter in Figure 16.12 has been carefully designed to have a rectangular footprint because we will want to tile many of these circuits together vertically and horizontally to produce a compact and high-performance sorter network. Another important combinator we will use in our sorter design is the twocombinator, which makes two copies of a circuit r, one of which works on the bottom half of the input and the other on the top half of the input, as illustrated in Figure 16.13. Furthermore, the second copy of r should be placed vertically on top of the first copy. The two combinator can be defined as two r = halve >-> par [r,r] >-> unhalve
which says halve the input, use two copies of r in parallel (stacked vertically) on the halved input, and then take the result and unhalve it.
Sorter
Merger
Sorter
FIGURE 16.11
I
The recursive structure of a sorter.
358
Chapter 16
I
Specifying Circuit Layout on FPGAs
m u x b
v r e g
x
m u x
v r e g
y
a a.b
clk
FIGURE 16.12
twoSorter clk 5 fork2 >-> fsT comparator >-> condSwap clk I
Two-sorter layout and behavior specification.
r
r two r
FIGURE 16.13
I
The two-combinator.
Interleave (ilv) is another combining form that uses two copies of the same circuit. This combinator has the property that the bottom circuit processes the inputs at even positions and the top circuit processes the inputs at odd positions. It can be defined as ilv r = unriffle >-> two r >-> riffle
An instance of ilv r for an 8-input bus is shown in Figure 16.14. The related evens combinator chops the input list into pairs and then applies copies of the same circuit to each input. Given these ingredients, we can give a recursive description of a parallel merger butterfly circuit: bfly r 1 = r bfly r n = ilv (bfly r (n-1)) >-> evens r
A bitonic merger of degree 3 is shown in Figure 16.15, which not only describes how to compose the behavior of elements to form a merger circuit, but also
16.3 Algebraic Layout Specification
359
r
r
unriffle
FIGURE 16.14
I
riffle
The ilv combinator.
5 2S 6
5
2S
4
7 2S 8
7
2S 3
6
2S
2S 1
8 1
I
2S
4
2S
8
2S
3 1
6 5
2S
6
2S
8 7
2
3
2
7 5
2
4
FIGURE 16.15
two r
4 3
2S
2 1
A bitonic merger.
specifies the layout of the merger circuit using algebraic layout specifications. This circuit is a bitonic merger that can merge its inputs as long as one half of the input is increasing in the opposite order from the other half, as shown in the figure.
360
Chapter 16
I
Specifying Circuit Layout on FPGAs
2S
2S
2S
2S
2S
2S
2S
2S
2S
2S
2S
2S
2S
2S
2S
2S
2S
2S
2S
2S
2S
2S
2S
2S
FIGURE 16.16
I
Sorter recursion and layout for 8 inputs.
Now that we have our merger, we can recursively unfold the pictorial specification in of the sorter layout to produce the design and layout in Figure 16.16 (for 8 inputs). This layout can be specified using the following combinators: sortB cmp 1 = cmp sorB cmp n = two (sortB cmp (n-1)) >-> pair >-> snD reverse >-> unpair >-> butterfly cmp n
In the figure the description uses two subsorters to produce a bitonic input for a merger (shown on the right). The 8-input description can be evaluated to produce an EDIF or VHDL netlist containing RLOC specifications for every gate. The FPGA layout of a degree-5 sorter (32 inputs) with 16-bit numbers is shown in Figure 16.17 on a Xilinx Virtex-II device. The resulting netlist is the same but with the layout information removed. It is shown in Figure 16.18. The netlist with the layout information leads to an implementation that is approximately 50 percent faster, and a 64-input sorter leads to a 75 percent speed improvement. The case study just outlined shows how a complicated and recursive layout can be described in a feasible manner using algebraic layout combinators rather than explicit Cartesian coordinates.
16.4
LAYOUT VERIFICATION FOR PARAMETERIZED DESIGNS A common problem with parameterized layout descriptions (especially those based on Cartesian coordinates) is that designer errors can produce bad layouts that cannot be realized on the target FPGAs—for example, the layout specification may try to map too many logic gates into the same location. Such errors
16.4 Layout Verification for Parameterized Designs
FIGURE 16.17
I
361
The sorter FPGA layout (32 16-bit inputs).
make the production of IP cores that rely on layout very difficult and time consuming. For a nonparameterized design, this is not much of an issue: The developer can check if the design maps, places, and routes. However, for a parameterized design it is usually impractical to check every possible combination of parameters to ensure that each one leads to a valid layout. A recent, interesting approach for layout verification involves theorem provers to statically analyze and formally verify that a design is free of layout errors. This is the approach taken by Pell [2] in his Quartz declarative block composition system, which uses a special hardware description notation that can be formally analyzed with the Isabelle theorem prover. The Quartz system works on algebraic layout combinators similar to those presented in the previous section. The Quartz system verifies layout correctness by checking for validity, containment, and intersection. Validity ensures that the size function of a block always evaluates to a positive result. Containment ensures that for all parameter values all subblocks stay within the bounding box of the overall circuit. The intersection property checks for badly overlapping blocks.
362
16.5
Chapter 16
I
Specifying Circuit Layout on FPGAs
FIGURE 16.18
I
The sorter with layout information removed.
SUMMARY User specification of the layout of circuits for FPGAs is sometimes necessary to meet performance requirements, to reduce area, or to facilitate dynamic reconfiguration. While a user-defined layout is impractical for many complete designs because of complexity or time-to-market constraints, optimizing the most critical blocks of a circuit can have significant benefits, especially for reusable IP blocks and vendor libraries. Some vendor tools provide the ability to specify the layout of gates or composite blocks through either absolute or relative Cartesian coordinates. However, these tools are tedious to use and error prone, particularly for parameterized circuits. Various systems have adopted algebraic layout specifications that use geometric relationships between blocks instead of coordinate values. Such descriptions work well for irregular and recursive layouts, as demonstrated by the recursive parallel sorter in this chapter. However, one may still specify illegal layouts for parameterized circuits, and no satisfactory technique exists for
16.5 Summary
363
finding them. A promising approach is the use of theorem provers to statically analyze algebraic layout descriptions to ensure that they have no layout errors for any given permutation of parameters.
References [1] P. Bjesse, K. Claessen, M. Sheeran, S. Singh. Lava: Hardware design in Haskell. International Conference on Functional Programming (ICFP), Springer-Verlag, 1998. [2] O. Pell. Verification of FPGA layout generators in higher order logic. Journal of Automated Reasoning 37(1–2), August 2006. [3] P. J. Roxby, S. Singh. Rapid construction of partial configuration datastreams from high level constructs using JBits. Field Programmable Logic (FPL), Springer-Verlag, 2001. [4] S. Singh. Death of the RLOC. Field-Programmable Custom Computing Machines (FCCM), April 2000.
This page intentionally left blank
CHAPTER
17
PATHFINDER: A NEGOTIATION-BASED, PERFORMANCE-DRIVEN ROUTER FOR FPGAS Larry McMurchie Synplicity Corporation
Carl Ebeling Department of Computer Science and Engineering University of Washington
Routing is a crucial step in the mapping of circuits to field-programmable gate arrays (FPGAs). For large circuits that utilize many FPGA resources, it can be very difficult and time consuming to successfully route all of the signals. Additionally, the performance of the mapped circuit depends on routing critical and near-critical paths with minimum interconnect delays. One disadvantage of FPGAs is that they are slower than their ASIC counterparts, so it is important to squeeze out every possible nanosecond of delay in the routing. The first goal, a complete routing of all signals, is difficult to achieve in FPGAs because of the hard constraints on routing resources. Unlike ASICs and printed circuit boards (PCBs), FPGAs have a fixed amount of interconnect. The usual approach in placement is to minimize the wiring resources anticipated for routing signals. Although this reduces the overall demand for resources, signals inevitably compete for the same resources during routing. The challenge is to find a way to allocate resources so that all signals can be routed. The second goal, minimizing delay, requires the use of minimum-delay routes for signals, which can be expensive in terms of routing resources, especially for high-fanout signals. Thus, the solution to the entire routing problem requires the simultaneous solution of two interacting and often competing subproblems. Early solutions to the FPGA routing problem were based on the considerable literature on routing in the context of ASICs and gate arrays. The problem of routing FPGAs bears a considerable resemblance to the problem of global routing for custom integrated circuit design, where signals are assigned to channels. However, the two problems differ in several fundamental respects. First, routing resources in FPGAs are discrete and scarce while they are relatively continuous in custom integrated circuits (ICs). For this reason FPGAs require an integrated approach using both global and detailed routing. A second difference is that global routing for custom ICs is based on an undirected graph embedded in Cartesian space (i.e., a two-dimensional grid). In FPGAs the switches are often directional, and the routing resources connect arbitrary (but fixed) locations,
366
Chapter 17
I
PathFinder: A Negotiation-based, Performance-driven Router
requiring a directed graph that may not be embedded in Cartesian space. Both of these distinctions are important, as they prevent direct application of much of the previous work in routing. By far, the most common approach to global routing of custom ICs is a shortestpath algorithm with obstacle avoidance. By itself, this technique usually yields many unroutable nets that must be rerouted by hand. A plethora of rip-up and retry approaches have been proposed to remedy this deficiency [1–3]. The basic problem with rip-up and retry is that the success of a route is dependent not just on the choice of nets to reroute but also on the order in which rerouting is done. Delay is usually factored into the standard rip-up and retry approach by ordering the nets to be routed such that critical nets are routed most directly [4–6]. To make the FPGA routing problem tractable, nearly all of the routing schemes in the literature incorporate features of the underlying architecture. Palczewski [7] describes a maze router with rip-up and reroute targeting the Xilinx 4000 series. In this work the structure of the plane-parallel switchbox in the 4000 series is exploited in conjunction with an A∗ search. Brown et al. [4] employ an architecture model consisting of channels, switchboxes, connection matrices, and logic blocks. A global router balances channel densities and a detailed router generates families of explicit paths within channels to resolve congestion. These approaches, as well as others, obtain some of their success by exploiting the features of a particular architecture model. The problem is that new architectures become constrained by the restrictions of such existing routing algorithms.
17.1
THE HISTORY OF PATHFINDER PathFinder was used initially in the development of the Triptych FPGA architecture [8–10]. In fact, Triptych, with its heavy reliance on effective placement and routing tools, was a catalyst for the development of the PathFinder algorithm— a perfect example of “necessity being the mother of invention.” As part of an FPGA architecture exploration tool called Emerald [11], PathFinder was also employed in the development of an FPGA under development by IBM in the mid-1990s. This was particularly appropriate because PathFinder is inherently architecture independent. That experience showed that PathFinder was indeed an improvement over other FPGA routers available at the time. The PathFinder algorithm was adopted and carefully implemented by Betz and Rose in the very popular versatile place and route (VPR) FPGA tool suite [12, 13], which has been widely used for academic and industry research. The Toronto place-and-route challenge [14] was established as a way to compare different FPGA placement and routing algorithms. Since the contest was established in 1997, the champion has been either VPR’s implementation of PathFinder or SCPathFinder, implemented at the University of California–Santa Cruz. Although companies are reluctant to divulge the details of their design tools, it is clear that some version of the PathFinder algorithm is currently used by virtually all commercial FPGA routers.
17.2 The PathFinder Algorithm
17.2
367
THE PATHFINDER ALGORITHM 17.2.1
The Circuit Graph Model
One of the key features of PathFinder is its architecture independence, which derives from the use of a simple underlying graph representation of FPGA architectures. This model allows PathFinder to be adapted to virtually any architecture and thus used to explore new architectures with very little startup cost. Once an architecture has been decided on, PathFinder can be specialized to it for improved results and performance. The routing resources in an FPGA and their connections are represented by the directed graph G = (V, E). The set of vertices V corresponds to the electrical nodes or wires in the FPGA architecture, and the edges E correspond to the switches that connect these nodes. An example of this graph model is shown in Figure 17.1 for a version of the Triptych FPGA cell. Note that devices are represented only implicitly by the wires connected to their terminals. That is, routing from one device terminal to another is routing between the wires connected to those terminals. Associated with each node n in the architecture is a base cost bn that represents the relative cost of using that node. This cost is typically proportional to the length of the wire, although other measures like capacitance or number of fanins and fanouts are also possible. Each node also has a delay dn , which may or may not be the same as bn . Given a signal i in a circuit mapped onto the FPGA, the signal net Ni is the set of terminals, including the source terminal si and sinks tij . Ni forms a subset of V. A solution to the routing problem for signal i is the directed routing tree RTi embedded in G and connecting the source si to all of its sinks tij .
17.2.2
A Negotiated Congestion Router
We assume that the reader is familiar with Djikstra’s shortest-path graph algorithm [15–17], which is at the core of many routing algorithms. Note that in our formulation costs are associated with nodes, not edges. This changes the basic
3-LUT
FIGURE 17.1 at the right.
I
D
3-LUT
D
The circuit for a Triptych FPGA cell is represented in PathFinder by the graph
368
Chapter 17
I
PathFinder: A Negotiation-based, Performance-driven Router
shortest-path algorithm only slightly by redefining the cost of a path from node ni to node nj as the sum of the node costs along the path, including the starting and ending nodes. Routing algorithms differ primarily in the cost function applied to the routing resources and in how individual applications of the shortest-path algorithm are used to successfully route all the signals of a netlist onto the graph representing the architecture. We ignore the issue of fanout in our initial presentation and assume that each signal is a simple route from source to a single sink. A naive routing algorithm proceeds by applying the shortest-path algorithm to each signal in order, with the cost of a node defined as cn = b n
(17.1)
Resources already used by previous routes are not available to later routes. It is clear that the order in which signals are routed is crucial, as later routes have many fewer available routing resources. Some algorithms perform rip-up and retry when later routes cannot find a path. Selected early routes that are blocking are ripped up and rerouted later—in essence, adaptively changing the order in which signals are routed. The very simple example in Figure 17.2 shows how this naive algorithm can fail. There are three signals, 1, 2, and 3, to be routed from the sources S1 , S2 , and S3 to their respective sinks D1 , D2 , and D3 . The ovals represent partial paths through one or more nodes, annotated with the associated costs. Ignoring congestion, the minimum-cost path for each signal would use node B. If the naive obstacle avoidance routing scheme is used, the order in which the signals are routed becomes crucial: Routing in the order 1, 2, 3 fails, and the minimum-cost routing solution will be found only when starting with signal 2.
S2
S1 3 4
S3
1
1
A
1
2 3
B 4
3
1
D1
FIGURE 17.2
I
First-order congestion.
1
D2
C 1
3
2
D3
17.2 The PathFinder Algorithm
369
This problem can be solved by introducing negotiated congestion avoidance, first suggested by Nair [18] by extending the cost of using a given node n in a route to cn = bn · pn
(17.2)
where bn is the base cost of using n, and pn is a function of the number of other signals presently using n (pn is often called the “present-sharing” term). Note that in the naive router, pn = 1 if no other signals are using n, and infinity otherwise. In the negotiated congestion algorithm, pn is set initially to 1 and all signals are routed. This allows each signal to be routed as if no other signals were present. The cost of sharing is then increased, and all nets are ripped up and rerouted in turn. This iterative process continues, with the cost of sharing increasing at each iteration until all signals have been successfully routed. The idea is that the cost of a congested node will increase and that signals that have other alternatives will eventually find other paths, leaving the node to the signal that needs it most. pn is a function of the iteration i and the number of signals sharing a node k. The definition of pn is a key tuning parameter of PathFinder. The negotiated congestion avoidance algorithm solves the problem of Figure 17.2. During the first iteration, pn is initialized to 1, and consequently no penalty is imposed for the use of n regardless of how many signals occupy it. Thus, in the first iteration all three signals share B. When the sharing function pn increases sufficiently, signal 1 will find that a route through node A gives a lower cost than a route through the congested node B. During an even later iteration signal 3 will find that a route through node C gives a lower cost than that through B. This scheme of negotiation for routing resources depends on a relatively gradual increase in the cost of sharing nodes. If the increase is too abrupt, signals may be forced to take high-cost routes that lead to other congestion. Just as in the standard rip-up and retry scheme, the ordering becomes important. While iterative negotiated congestion routing with the cost function of equation 17.2 can optimally route simple “first-order” routing problems like that in Figure 17.2, it fails on more complex “second-order” routing problems like that shown in Figure 17.3. Again we need to route three signals, one from each source to the corresponding sink. Let us first consider this example from the standpoint of obstacle avoidance with rip-up and retry. Assume that we start with the routing order (1, 2, 3). Signal 1 routes through node B, and signals 2 and 3 share node C. For rip-up and retry to succeed, both signals 1 and 2 would have to be rerouted, with signal 2 rerouted first. Because signal 1 does not use a congested node, determining that it needs to be rerouted is in general difficult. This second-order congestion problem cannot be solved using pn alone. Signal 2 will never choose node B because the present sharing costs for nodes B and C are the same, with B used by signal 1 and C used by signal 3. Since the path through C is cheaper, it is always chosen. PathFinder solves this by extending the cost function with a “history” term, hn : cn = (bn + hn ) · pn
(17.3)
370
Chapter 17
I
PathFinder: A Negotiation-based, Performance-driven Router
S1
2
A
1
2
1
S3
1
B
2
1
D1
FIGURE 17.3
S2
I
C
1
2
D2
1
D3
Second-order congestion.
Unlike pn , hn “remembers” the congestion that has occurred on node n during previous routing iterations. That is, the history term is updated after each routing iteration; any node shared by multiple signals has its history term increased by some amount. The effect of hn is to permanently increase the cost of using congested nodes so that routes through other nodes are attempted. Without this term, as soon as signals stop sharing a node, its cost drops to the base cost and it again becomes attractive. This leads to oscillations where signals switch back and forth between nodes but never resolve the congestion problem. The addition of the history term is a key difference between PathFinder and Nair’s routing algorithm [18]. The term hn allows the problem in Figure 17.3 to be routed successfully. On each iteration that node C is shared, hn is increased slightly. When signal 2 switches to using node B, the cost of node C remains elevated. Now the history cost of node B rises because it is shared by signals 1 and 2. Eventually signal 1 will route through node A. Note that, depending on the base costs and how pn and hn are defined, signal 2 may switch back and forth between nodes B and C several times before the history costs of both are sufficiently high to force signal 1 onto node A. The history term hn is updated whenever a node n has shared signals. The size of δh , the amount by which hn is increased, and how this depends on k, the number of sharing signals, are tunable parameters. If δh is too small, many iterations may be required to resolve the congestion; if it is too large, some solutions may not be found. Additionally, the relationship between pn and hn is very important. For example, it can be important to give the history term a chance to solve congestion before forcing the issue with pn . The details of the Negotiated Congestion algorithm are given in Figure 17.4. The while loop at line 2 executes the routing iterations until a solution has been
17.2 The PathFinder Algorithm
iteration 0 While shared resources exist Iteration iteration + 1 Loop over all signals i (signal router) Rip up routing tree RTi RTi si Loop until all sinks tij have been found Initialize priority queue PQ to RTi at cost 0 Loop until new tij is found Remove lowest cost node m from PQ Loop over fanouts n of node m Add n to PQ at cost Pim + cn end loop end loop Loop over nodes n in path tij to si (backtrace) Update cn Add n to RTi end loop end loop end loop Loop over all nodes ni shared by multiple signals hi hi + δ(k) end loop end while
FIGURE 17.4
I
371
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Negotiated Congestion algorithm.
found. The signal router loop at line 4 iterates over all signals in the netlist, ripping up and rerouting the nets one at a time. The routing tree RTi is the set of nodes used to route signal i. To reroute a signal, the routing tree is reset to be just the signal’s source. The priority queue is used to implement the breadth-first search of Djikstra’s algorithm. At each iteration of the loop of line 9, the lowest-cost node is taken from the priority queue. It is generally best to order the nodes with the same cost according to when they were inserted into the queue, with the newest nodes being extracted first. The cost used when inserting a new node in the priority queue at line 12 is Pim + cn
(17.4)
where Pim is the cost of the current partial path from the source, and cn is the cost of using node n. A signal is routed one sink at a time using Djikstra’s breadth-first algorithm. When the search finds a sink, the nodes on the path from the source to it are added to RTi . This is done by back-tracing the search path to the source. The search is then restarted with the priority queue being initialized with all the nodes already in RTi . In this way, all the nodes on routes to previously found sinks are used as potential sources for routes to subsequent sinks. This algorithm for constructing the routing tree is similar to Prim’s algorithm for determining a minimum spanning tree over an undirected graph, and it is identical to one
372
Chapter 17
I
PathFinder: A Negotiation-based, Performance-driven Router
suggested by Takahashi and Matsuyama [19] for constructing a tree embedded in an undirected graph. The quality of the points chosen by the algorithm is an open question for directed graphs; however, finding optimum (or even nearoptimum) points is not essential for the router to be successful in adjusting costs to eliminate congestion. The VPR router [12] reduces the cost of reinitializing the priority queue for each fanout by observing that for large-fanout nets, most of the paths found in searching for the previous fanout remain valid, especially if the segment added to the routing tree is relatively small. Thus, the search continues from the previous state after the new segment has been added to the routing tree. Because of the way Djikstra’s algorithm ignores nodes after they have been visited once, this optimization must be implemented carefully to avoid expensive routing trees for high-fanout nets. Other algorithms for forming the fanout tree are possible. For example, there are times when routing to the most distant sink first results in a better routing tree. At the end of each iteration, the history cost of each node shared by multiple signals is updated. The δ added to the history cost is generally a function of k, the number of signals sharing the node.
17.2.3
The Negotiated Congestion/Delay Router
To introduce delay into the Negotiated Congestion algorithm, we redefine the cost of using node n when routing a signal from si to tij as Cn = Aij dn + (1 − Aij ) cn
(17.5)
where cn is defined in equation 17.3 and Aij is the slack ratio: Aij = Dij ⁄ Dmax
(17.6)
where Dij is the delay of the longest delay (register–register) path containing the signal segment (si , tij ), and Dmax is the maximum delay over all paths (i.e., the critical-path delay). Thus, 0 < Aij ≤ 1. (This standard definition of slack ratio is easily extended to include circuit inputs and outputs with timing constraints as well as circuits with multiple clocks.) Because path delay is made up of both device and wire delay, and the router can only control the wire delay, a more accurate formulation for Aij is Aij = (Dij − Ddevij ) ⁄ (Dmax − Ddevij )
(17.7)
where Ddevij is the path delay from node i to node j attributable to devices, and Dij − Ddevij is thus the wire delay on the path from node i to node j. With equation 17.7, paths with the same path delay but greater wire delay pay more attention to delay and less to congestion. The first term of equation 17.5 is the delay-sensitive term; the second term is congestion sensitive. Equations 17.5, 17.6, and 17.7 are the keys to providing the appropriate mix of minimum-cost and minimum-delay trees. If a particular source/sink pair lies on the critical-path, then Aij = 1 and the cost of node n
17.2 The PathFinder Algorithm
373
is just the delay term; hence a minimum-delay route is used and congestion is ignored. In practice, Aij is limited to a maximum value such as 0.9 or 0.95 so that congestion is not completely ignored. If a source/sink pair belongs to a path whose delay is much smaller than the critical-path, then Aij is small and the congestion term dominates, resulting in a route that avoids congestion at the expense of extra delay. To accommodate delay, the basic Negotiated Congestion algorithm of Figure 17.4 is changed as follows. For the first iteration, all Aij are initialized to 1 and minimum-delay routes are found for every signal. This yields the smallest possible critical-path delay. All Aij are recomputed after every routing iteration using the critical-path delay and the delays incurred by signals on that iteration. The sinks of each signal are now routed in decreasing Aij order. This allows the most timing-constrained sinks to determine the coarse structure of the routing tree with no interference from less constrained paths. The priority queue (line 8 in Figure 17.4) is initialized by inserting each node of RTi with the cost Aij ∑k dk , where the nk are nodes on the path from the source ni to node nj . This initializes the nodes already in the partial routing tree with the weighted path delay from the source. The router completes when no more shared resources exist. Note that by recalculating all Aij , we have kept a tight rein on the critical-path. Over the course of the routing iterations, the critical-path increases only to the extent required to resolve congestion. This approach is fundamentally different from other schemes [4, 5] that attempt to resolve congestion first and then reduce delay by rerouting critical nets. The PathFinder algorithm is particularly powerful for asymmetric architectures that have a range of slow and fast wires. By making the slower wires lower cost, the negotiation algorithm automatically assigns critical signals to the fast wires as needed and noncritical signals to the slow wires.
17.2.4
Applying A* to PathFinder
Djikstra’s shortest-path algorithm performs an expensive breadth-first search of the graph. This search has an O(n2 ) running time for two-dimensional circuit structures, where n is the length of the path. The A∗ heuristic [20] is a technique that uses additional information about the cost of paths in the graph to bound the size of the search. The cost of a partial path becomes the cost of the partial path plus the estimated cost from the end of the partial path to the destination. If this estimated cost is a lower bound on the actual cost, then the search will provide an optimal solution. If the estimated cost is accurate, then the search becomes a depth-first search with O(n) running time. In applying A∗ to PathFinder, both the cost and the delay of paths in the graph must be estimated. We modify equation 17.4 as follows: Cn = Pim + Aij (dn + Destnj ) + (1 − Aij )(cn + Cestnj )
(17.8)
where Destnj and Cestnj are the estimated delay and cost, respectively, of the minimum-delay route from n to sink j.
374
Chapter 17
I
PathFinder: A Negotiation-based, Performance-driven Router
To use the A∗ heuristic, the router must know the destination in order to determine the estimated cost. Instead of letting the breadth-first router find the closest destination when there are multiple fanouts, the path length estimates are used to sort the fanouts from closest to furthest and the routing is performed in this order. In many FPGAs, such as those that are standard island style, the cost and delay of routes can be estimated based on the locations of the source and destination using the geometry of the layout. A more general and accurate method is to use the shortest-path algorithm to create a complete “distance table” that contains the cost estimate of the minimum-delay route from every node to all potential sinks. This is only feasible, however, for relatively small architectures or for coarse-grained architectures that have many fewer nodes than fine-grained FPGAs. To reduce the table size, clustering can be used and estimates stored for the cost/delay between clusters [21]. If the cost/delay between two clusters is taken as the minimum cost/delay between any two nodes in the two clusters, it represents a true lower bound. Clustering has been reported to reduce the size of the distance table by a factor of 100 while slowing the search only by a factor of 2 [21]. In the early iterations of PathFinder, when sharing is ignored, the full advantage of A∗ is obtained. That is, if the cost/delay estimates are accurate, a depthfirst search is achieved. As the cost of sharing rises, however, the cost estimates, which do not include the sharing costs, become less and less accurate and the search becomes less efficient. In experiments with PathFinder and A∗ , Swartz et al. [22] used a multiplicative direction factor α to inflate the path estimate. In effect, α determines how aggressively the router drives toward the target sink. An α of 1.0 corresponds to true A∗ and is guaranteed to find the shortest source/sink connection. Swartz et al. determined that an α of 1.5 gave the best results for large circuits, with no measurable degradation in the quality of the resulting routing. However, note that the cost function had only a congestion term and no delay term. Tessier also experimented with accelerating routing with even more aggressive use of the A∗ search [23, 24].
17.3
ENHANCEMENTS AND EXTENSIONS TO PATHFINDER Many research papers have discussed extensions and optimizations of the PathFinder algorithm. First and foremost is the work by Betz and Rose on VPR [12], which for the past eight years has been a widely used vehicle for academic and industrial research into FPGA architectures and CAD. We discuss here some of the more salient ideas that have been applied to PathFinder.
17.3.1
Incremental Rerouting
A common optimization suggested in the original PathFinder paper [8] is to limit the rip-up and rerouting of signals in an iteration only to those that use shared resources. Intuitively, this reduces the amount of “wasted” effort that
17.3 Enhancements and Extensions to PathFinder
375
goes into rerouting signals that always take the same path. The argument is that if a signal does not use a shared resource, it will take the same path as it did before, because history costs can only rise and thus no other path can become cheaper. This argument fails where pn becomes smaller as sharing signals reroute around a congested node. Experience shows that this optimization increases the number of routing iterations, but reduces the total running time substantially, with negligible impact on the quality of the solution found.
17.3.2
The Cost Function
There are many ways to tune PathFinder for specific architectures or to achieve specific goals. Many variations of the cost function have been described that change how the three cost terms bn , pn , and hn are computed and combined. The essential feature of the cost function is that hn is a function of the history of the congestion of the node and that pn is a function of the current congestion. The rates at which hn and pn increase can be tuned; increasing them quickly, for example, decreases the number of iterations required but also decreases the quality of the solution. The history term may include a decay function on the assumption that the more recent history is more valid than the distant past. This is particularly important when PathFinder is used in an integrated place-androute tool [21, 25]. The PathFinder cost function can also be modified to include both short-path and long-path delay terms [26]. For long paths, delay is minimized by using the PathFinder cost function. For short paths, however, the cost function is changed to find a path with a target delay, not the minimum delay. This changes the underlying shortest-path problem considerably and requires an accurate “lookahead” function that predicts the remaining delay to the destination so that the router can opportunistically add the appropriate extra delay.
17.3.3
Resource Cost
Determining the base cost of routing resources is harder than it appears. The shortest-path algorithm attempts to minimize the total cost of a solution, so minimizing the cost should also minimize congestion. The typical cost function used by routers is the length of the wire, which is a good heuristic for typical architectures where the number of available wires is inversely proportional to their individual lengths. A better heuristic is to base the cost of a wire on the expected routing demand for it. This can be approximated by routing a set of placed benchmarks onto an architecture and measuring wire by wire the routing demand. Another method is to perform a large number of random routes using a typical Rent’s wirelength distribution through the architecture and again measuring the overall use of each wire. In this formulation, wire costs are initialized to 1, raised a` la PathFinder according to wire usage, and converge to some constant value. Delay is an approximation that is often used for cost as it is typically closely related to wirelength and relative demand. It also simplifies the cost function for the integrated congestion and delay router.
376
Chapter 17
17.3.4
I
PathFinder: A Negotiation-based, Performance-driven Router
The Relationship of PathFinder to Lagrangian Relaxation
The PathFinder algorithm is very similar to Lagrangian relaxation for finding an optimal routing subject to congestion and delay constraints [27–29]. In Lagrangian relaxation, the constraints are relaxed by multiplying them by a vector of Lagrangian multipliers and adding them to the objective function to be minimized. The solution to a Lagrangian formulation with a specific set of Lagrangian multipliers provides an approximate solution to the original minimization problem. An iterative procedure that modifies the Lagrangian multipliers is used to find increasingly better solutions. A subgradient method is used to update the multipliers. Intuitively, the multipliers are increased or decreased depending on the extent to which the corresponding constraint is satisfied. A Lagrangian relaxation method proceeds somewhat differently from the PathFinder algorithm. The multipliers operate much like PathFinder’s history term, but there is no corresponding present-sharing term pn . While the history term is monotonically nondecreasing, the Lagrangian multipliers can both increase and decrease depending on how well the corresponding constraint is satisfied. The amount by which the multipliers are adjusted in Lagrangian relaxation is also decreased with each iteration.
17.3.5
Circuit Graph Extensions
The simple circuit graph model is very general, but there are some specific circuit structures that require extensions. This section describes some solutions for these. Symmetric device inputs Lookup tables (LUTs) are the prime example of FPGA devices whose pins are “permutable.” That is, the inputs to a LUT can be swapped arbitrarily by permuting the table’s contents. Other devices like adders also have symmetric inputs. In the simple graph model, a signal is routed to a specific input terminal and there is no way to specify a route to one of a set of terminals. Symmetric inputs are easily accommodated in the graph model by adding “pseudo-multiplexers” on the inputs of the LUT. These are shown as dashed nodes at the top of Figure 17.5. Signal sinks can be arbitrarily assigned to the LUT inputs and routed in the usual way. After the routing solution has been found, the pseudo-multiplexers are removed and implemented “virtually” by permuting the LUT table contents appropriately. In the example of Figure 17.5, the signals a, b, and c are routed to the LUT inputs A, B, and C, respectively, using the pseudo-multiplexers as shown with bold lines. This routing is then used to permute the LUT inputs as shown on the right by modifying the LUT contents. De-multiplexers A de-multiplexer is a device that can connect its input to at most one of several outputs. Each output connection is represented as an edge in the circuit graph shown in Figure 17.6. Wire fanout, of course, is not constrained, and there is no way in the graph model to specify a constraint on the number of fanouts that can be used. This case is handled by a special counter that counts the number of the edges that are used. If more than one edge is being used, the
17.4 Parallel PathFinder
FIGURE 17.5
I
b
A
b
A
c
B 3-LUT
c
B 3-LUT
a
C
a
C
b
A
b
B
c
B 3-LUT
c
C 3-LUT
a
C
a
A
377
Symmetric device inputs are handled by inserting pseudo-multiplexers.
1
FIGURE 17.6
I
De-multiplexers are handled by negotiating for the fanouts of the de-multiplexer.
de-multiplexer is being shared in much the same way that wires can be shared by signals. A PathFinder cost function can be applied with both a sharing and a history component so that the single fanout used is determined by means of negotiation. Bidirectional switches Edges in the graph model, which represent connections, are directional. This models multiplexer-based architectures directly. Transistors that are often used to construct configurable interconnects are bidirectional. These bidirectional switches simply translate to two directional edges in the graph. The router uses at most one of the edges, which induces a logical direction on the switch. That is, when a switch is turned on in a configuration, it is being driven by an output from one side to the other.
17.4
PARALLEL PATHFINDER A typical large FPGA design has many thousands of signals. If separate signals could be routed in parallel, the degree of parallelism would be limited only by
378
Chapter 17
I
PathFinder: A Negotiation-based, Performance-driven Router
the number of signals to be routed and the number of processors available. The difficulty, of course, is that the route taken by each signal depends on the knowledge of other signal routes, as routing resources cannot be shared. Although parallel implementations of global standard cell routers exist, the problem for FPGAs becomes much harder because the routing resources are discrete and fixed. Because the routing of separate signals in an FPGA is tightly coupled, it might appear that a parallel approach to routing FPGAs would not be possible given that knowledge of other signal locations is necessary to find a feasible route. This is the case in a typical maze router, which uses rip-up and reroute to resolve conflicts. In PathFinder, however, there is no restriction on the number of signals that can occupy a resource simultaneously during routing. Instead, the cost of using congested resources is the mechanism used to resolve resource conflicts. If the congestion costs are decentralized in a parallel environment, the concerns are how and when they will be updated and whether the update method will be acceptable in terms of the number of processors effectively utilized and the quality of the resulting routing. In Chan et al. [30] a distributed memory multiprocessor implementation of the PathFinder algorithm is described. Each processor has a private local memory and is connected in a network. Processors communicate with each other by sending and receiving messages via Unix socket communication. A complete copy of the routing resource graph, including first- and second-order congestion costs, is kept and maintained by each processor. The signals in a netlist to be routed are statically assigned to processors such that each processor has about the same number of sinks to be routed. No attempt is made to assign signals to processors based on locality. Processors route signals asynchronously and thus communicate updated congestion costs asynchronously. There is no guarantee of the order or the timing of the arrival of such congestion cost updates, resulting in a source of indeterminism. Processors are allowed to proceed to successive iterations without waiting for others, although a limit of a few iterations of separation is generally employed. It is conceded that, because of latency, this parallel routing algorithm may not converge. Imagine a scenario in which two signals being routed by two different processors vie for the same resource. Message latency or merely concurrency may cause the two signals to oscillate between routing iterations, because each processor knows where the other processor’s signal was in the last iteration but not in the current one. Such cases generally occur during the last iterations of a route. At that point, Chan and colleagues [30] reduce the multiprocessor implementation to a single-processor implementation in order to resolve the congestion. This parallel implementation was tested on a set of benchmarks ranging from 118 to 1542 signal nets on the Xilinx 4000 architecture. Speedups ranged from 1.6 to 2.2 times for two processors and 2.3 to 3.8 times for four processors. For nearly all benchmarks, no additional speedups are obtained for more than four processors. The performance of the benchmarks (in terms of delay or clock rate) was shown to vary minimally with increasing numbers of processors.
17.6 Summary
379
This initial implementation of a parallel form of PathFinder is significant in that it demonstrates appreciable speedups while employing a rather simple computational framework. Because of the inherent approximations of congestion cost and its gradual increase, PathFinder exhibits good qualities for parallelism in a framework where congestion costs are communicated asynchronously, as they become available. It may result (as shown by Chan et al. [30]) in an increased number of iterations to converge, but is able to employ more multiple loosely connected processors to good advantage.
17.5
OTHER APPLICATIONS OF THE PATHFINDER ALGORITHM PathFinder has been used to incrementally reroute signals around faults in cluster-based FPGAs [31]. This rerouting uses the accumulated history costs acquired by the initial routing to quickly find a new routing solution when nodes and edges in the circuit graph have been removed because of faults. QuickRoute [32] extends PathFinder to handle pipelined routing structures. The key idea in QuickRoute is to change Djikstra’s shortest-path algorithm to allow nodes to be visited more than once, by paths with different latencies. This causes many more overlapping paths to be explored, but the negotiated congestion avoidance of PathFinder still performs well. Several groups have applied PathFinder to the problem of scheduling the communication in computing graphs to coarse-grained architectures or multiprocessors [33–35]. In this application of PathFinder, the routing becomes a space–time problem.
17.6
SUMMARY The widespread use of PathFinder by commercial FPGA routers and university research efforts alike is a testimonial to its robustness. Several key facets of the algorithm make it attractive. However, its primary advantage is the iterative nature of resolving congestion, using both current as well as historical resource use in the formulation of the cost function. By very gradually increasing cost due to both usages, the routing search space is thoroughly explored. Routing with other objective functions, delay in particular, is easily integrated into the cost function. A primary feature implicit in PathFinder (that distinguishes it from previous efforts) is the allowance of nonphysically feasible intermediate states—for example, shared resources—while converging to a physically feasible final state. Finally, by being grounded in a directed graph representation, PathFinder is very adaptable to changing FPGA architectures as well as other problems that can be abstracted to a directed graph. In the future we see the routing problem as being an increasingly dominant hurdle in the use of FPGAs with millions of resources. To reduce the runtime, more investigation will be required to effectively parallelize PathFinder, making
380
Chapter 17
I
PathFinder: A Negotiation-based, Performance-driven Router
use of additional computational resources. Given the growing focus on other objectives such as power consumption, it is likely that we will see experimentation with other cost function formulations as well. Acknowledgments We wish to thank Gaetano Borriello for initial discussions about routing when PathFinder was being applied to the Triptych architecture, and Steven Yee for his help in constructing detailed descriptions of the Xilinx architectures. We also thank Pak Chan and Martine Schlag for sharing the results on parallel PathFinder.
References [1] W. A. Dees, R. J. Smith. Performance of interconnection rip-up and reroute strategies. Design Automation Conference, 1981. [2] R. Linsker. An iterative-improvement penalty-function-driven wire routing system. IBM J. Res. Development 28(5), 1984. [3] J. Cohn, D. Garrod, R. Rutenbar, L. Carley. Koan/anagram II: New tools for device-level analog placement and routing. IEEE Journal of Solid-State Circuits 26(3), 1991. [4] S. Brown, J. Rose, Z. Vranesic. A detailed router for field-programmable gate arrays. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 11(5), 1992. [5] J. Frankle. Iterative and adaptive slack allocation for performance-driven layout and FPGA routing. Design Automation Conference, 1992. [6] M. J. Alexander, J. P. Cohoon, J. L. Ganley, G. Robins. An architecture-independent approach to FPGA routing based on multi-weighted graphs. Proceedings of the Conference on European Design Automation, 1994. [7] M. Palczewski. Plane parallel a maze router and its application to FPGAs. Design Automation Conference, 1992. [8] L. McMurchie, C. Ebeling. A negotiation-based performance-driven router for FPGAs. Proceedings of the 1995 ACM Third International Symposium on FieldProgrammable Gate Arrays Aided Design, 1995. [9] G. Borriello, C. Ebeling, S. Hauck, S. Burns. The triptych FPGA architecture. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 3(4), 1995. [10] C. Ebeling, L. McMurchie, S. Hauck, S. Burns. Placement and routing tools for the triptych FPGA. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 3(4), 1995. [11] D. C. Cronquist, L. McMurchie. Emerald: An architecture-driven tool compiler for FPGAs. Proceedings of the Fourth ACM International Symposium on FieldProgrammable Gate Arrays, 1996. [12] V. Betz, J. Rose. VPR: A new packing, placement and routing tool for FPGA research. Proceedings of the Seventh International Workshop on Field-Programmable Logic and Applications. Springer-Verlag, 1997. [13] V. Betz, J. Rose, A. Marquardt. Architecture and CAD for deep-submicron FPGAs. Kluwer Academic, 1999. [14] V. Betz. The FPGA place-and-route challenge (www.eecg.toronto.edu/vaughn/ challenge/challenge.html). [15] E. W. Dijkstra. A note on two problems in connexion with graphs. Numerische Mathematik 1(1), December 1959.
17.6 Summary
381
[16] E. Moore. The shortest path through a maze. International Symposium on the Theory of Switching, April 1959. [17] C. Y. Lee. An algorithm for path connections and its applications. IRE Transactions on Electronic Computers 10, September 1961. [18] R. Nair. A simple yet effective technique for global wiring. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 6(2), 1987. [19] H. Takahashi, A. Matsuyama. An approximate solution for the Steiner problem in graphs. Math. Japonica 24(6), 1980. [20] P. Hart, N. Nilsson, B. Raphael. A formal basis for the heuristic determination of minimum cost paths. IEEE Transactions on Systems Science and Cybernetics, 1968. [21] A. Sharma. Place and Route Techniques for FPGA Architecture Advancement, Ph.D. thesis, University of Washington, 2005. [22] J. S. Swartz, V. Betz, J. Rose. A fast routability-driven router for FPGAs. Proceedings of the ACM/SIGDA Ssixth International Symposium on Field-Programmable Gate Arrays, 1998. [23] R. G. Tessier. Negotiated A∗ routing for FPGAs. Fifth Canadian Workshop on FieldProgrammable Logic, 1998. [24] R. G. Tessier. Fast Place and Route Approaches for FPGAs, Ph.D. thesis, MIT, 1999. [25] A. Sharma, S. Hauck, C. Ebeling. Architecture-adaptive routability-driven placement for FPGAs. International Conference on Field-Programmable Logic and Applications, 2005. [26] R. Fung, V. Betz, W. Chow. Simultaneous short-path and long-path timing optimization for FPGAs. IEEE/ACM International Conference on Computer Aided Design, 2004. [27] S. Lee, Y. Cheon, M. D. F. Wong. A min-cost flow based detailed router for FPGAs. International Conference on Computer-Aided Design, 2003. [28] S. Lee, M. Wong. Timing-driven routing for FPGAs based on Lagrangian relaxation. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 22(4), 2003. [29] M. M. Ozdal, M. D. F. Wong. Simultaneous escape routing and layer assignment for dense PCBs. Proceedings of the 2004 IEEE/ACM International Conference on Computer-Aided Design, 2004. [30] P. K. Chan, M. D. F. Schlag, C. Ebeling, L. McMurchie. Distributed-memory parallel routing for field-programmable gate arrays. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 19(8), August 2000. [31] V. Lakamraju, R. Tessier. Tolerating operational faults in cluster-based FPGAs. Proceedings of the 2000 ACM/SIGDA Eighth International Symposium on FieldProgrammable Gate Arrays, 2000. [32] S. Li, C. Ebeling. QuickRoute: A fast routing algorithm for pipelined architectures. IEEE International Conference on Field-Programmable Technology, 2004. [33] B. Mei, S. Vernalde, D. Verkest, H. De Man, R. Lauwereins. Exploiting loop-level parallelism on coarse-grained reconfigurable architectures using modulo scheduling. Design, Automation and Test in Europe, 2003. [34] J. Cook, L. Baugh, D. Gottlieb, N. Carter. Mapping computation kernels to clustered programmable reconfigurable processors. IEEE International Conference on FieldProgrammable Technology, 2003. [35] L.-Y. Lin, C.-Y. Wang, P.-J. Huang, C.-C. Chou, J.-Y. Jou. Communication-driven task binding for multiprocessor with latency insensitive network-on-chip. Proceedings of the 2005 Conference on Asia South Pacific Design Automation, 2005.
This page intentionally left blank
CHAPTER
18
RETIMING, REPIPELINING, AND C-SLOW RETIMING Nicholas Weaver International Computer Science Institute
Although pipelining is a huge benefit in field-programmable gate array (FPGA) designs, and may be required on some FPGA fabrics [5, 10, 12], it is often difficult for a designer to manage and balance pipeline stages and to insert the necessary delays to meet design requirements. Leiserson et al. [4] were the first to propose retiming, an automatic process to relocate pipeline stages to balance a design. Their algorithm, in O(n2 lg(n)) time, can rebalance a design so that the critical path is optimally pipelined. In addition, two modifications, repipelining and C-slow retiming, can add additional pipeline stages to a design to further improve the critical path. The key idea is simple: If the number of registers around every cycle in the design does not change, the end-to-end symantics do not change. Thus, retiming attempts to solve two primary constraints: All paths longer than the desired critical path are registered, and the number of registers around every cycle is unchanged. This optimization is useful for conventional FPGAs but absolutely essential for fixed-frequency FPGA architectures, which are devices that contain large numbers of registers and are designed to operate at a fixed, but very high, frequency, often by pipelining the interconnect as well as the computation. To meet the array’s fixed frequency, a design must ensure that every path is properly registered. Repipelining or C-slow retiming enables a design to be transformed to meet this constraint. Without automated repipelining or C-slow retiming, the designer must manually ensure that all pipeline constraints are met by the design. Retiming operates by determining an optimal placement for existing registers, while repipelining and C-slowing add registers before the retiming process begins. After retiming, the design should be optimally (or near-optimally) balanced, with no pipeline stage requiring significantly more time than any other stage. Section 18.1 describes the basic retiming operation and the retiming algorithm and its semantics. Then Section 18.2 discusses repipelining and C-slowing: two different techniques for adding registers. Repipelining improves feedforward designs by adding additional pipelining stages, while C-slowing creates
384
Chapter 18
I
Retiming, Repipelining, and C-slow Retiming
an interleaved design by replacing every register with a sequence of C registers. Both of these transformations increase throughput but also increase latency. Section 18.3 surveys the various implementations, beginning with Leiserson’s original algorithm and concluding with both academic and commercial tools. Section 18.4 discusses implementing retiming for fixed-frequency arrays. Unlike general FPGAs, fixed-frequency FPGAs require retiming in order to match user designs with architectural constraints. Finally, Section 18.5 discusses an interesting side effect of C-slowing: the creation of interleaved, multi-threaded architectures. We conclude in Section 18.6 with a discussion of the reasons that retiming is not a ubiquitous optimization in FPGA tool flows.
18.1
RETIMING: CONCEPTS, ALGORITHM, AND RESTRICTIONS The goal of retiming is to move the pipeline registers in a design into the optimal position. Figure 18.1 shows a trivial example. In this design, the nodes represent logic delays (a), with the inputs and outputs passing through mandatory, fixed registers. The critical path is 5, and the input and output registers cannot be moved. Figure 18.1(b) shows the same graph after retiming. The critical path is reduced from 5 to 4, but the I/O semantics have not changed, as three cycles are still required for a datum to proceed from input to output. As can be seen, the initial design has a critical path of 5 between the internal register and the output. If the internal register could be moved forward, the critical path would be shortened to 4. However, the feedback loop would then be incorrect. Thus, in addition to moving the register forward, another register would need to be added to the feedback loop, resulting in the final design. Additionally, even if the last node is removed, it could never have a critical path lower than 4 because of the feedback loop. There is no mechanism that can reduce the critical path of a single-cycle feedback loop by moving registers: Only additional registers can speed such a design. Retiming’s objective is to automate this process: For a graph representing a circuit, with combinational delays as nodes and integer weights on the edges, find a new assignment of edge weights that meets a targeted critical path or fail if the critical path cannot be met. Leiserson’s retiming algorithm is guaranteed to find such an assignment, if it exists, that both minimizes the critical path and ensures that around every loop in the design the number of registers always remains the same. It is this second constraint, ensuring that all feedback loops
in
1
1
1
1
2 (a)
FIGURE 18.1
I
2
out
in
1
1
1
1
2
2
out
(b)
A small graph before retiming (a) and the same graph after retiming (b).
18.1 Retiming: Concepts, Algorithm, and Restrictions TABLE 18.1
I
385
The constraint system used by the retiming procsess
Condition normal edge from u → v
Constraint r (u) − r (v) ≤ w (e)
Edge from u → v must be registered
r (u) − r (v) ≤ w (e) − 1
Edge from u → v can never be registered
r (u) − r (v) ≤ 0 and r (v) − r (u) ≤ 0
Critical paths must be registered
r (u) − r (v) ≤ W (u, v) − 1 for all u, v such that D (u, v) > P
are unchanged, which ensures that retiming doesn’t change the semantics of the circuit. In Table 18.1, r(u) is the lag computed for each node (which is used to determine the final number of registers on each edge), w(e) is the initial number of registers on an edge, W(u, v) is the minimum number of registers between u and v, and D(u, v) is the critical path between u and v. Leiserson’s algorithm takes the graph as input and then adds an additional node representing the external world, with appropriate edges added to account for all I/Os. This additional node is necessary to ensure that the circuit’s global I/O semantics are unchanged by retiming. Two matrices are then calculated, W and D, that represent the number of registers and critical path between every pair of nodes in the graph. These matrices are necessary because retiming operates by ensuring that at least one register exists on every path that is longer than the critical path in the design. Each node also has a lag value r that is calculated by the algorithm and used to change the number of registers that will be placed on any given edge. Conventional retiming does not change the design semantics: All input and output timings remain unchanged while minor design constraints are imposed on the use of FPGA features. More details and formal proofs of correctness can be found in Leiserson’s original paper [4]. The algorithm works as follows: 1. Start with the circuit as a directed graph. Every node represents a computational element, with each element having a computational delay. Each edge can have zero or more registers as a weight w. Add an additional dummy node with 0 delay, with an edge from every output and to every input. This additional node is to ensure that from every input to every output the number of registers is unchanged and therefore the data input to output timing is unaffected. 2. Calculate W and D. D is the critical path for every node to every other node, and W is the initial number of registers along this path. This requires solving the all-pairs shortest-path problem, of which the optimal algorithm, by Dijkstra, requires O(n2 lg(n)) time. This dominates the asymptotic running time of the algorithm. 3. Choose a target critical path and create the constraints, as summarized in Table 18.1. Each node has a lag value r, which will eventially specify the change in the number of registers between each node. Initialize all nodes to have a lag of 0.
386
Chapter 18
I
Retiming, Repipelining, and C-slow Retiming
4. Since all constraints are pairwise integer inequalities, the Bellman–Ford constraint solver is guaranteed to find a solution if one exists or to terminate if not. The Bellman–Ford algorithm performs N iterations (N = the number of constraints to solve). In each iteration, every constraint is examined. If a constraint is already satisified, nothing happens. Otherwise, r(u) or r(v) is decremented to meet the particular constraint. Once an iteration occurs where no values change, the algorithm has found a solution. If there is no solution, after N iterations the algorithm terminates with a failure. 5. If the constraint solver fails to find a solution, or a tighter critical path is desired, choose a new critical path and return to step 3. 6. With the final set of constraints, a new set of registers is constructed for each edge, w· w (e) = w(e) − r(u) + r(v). A graphical example of the algorithm’s results is shown in Figure 18.1. The initial graph has a critical path of 5, which is clearly nonoptimal. After retiming, the graph has a critical path of 4, but the I/O semantics have not changed, as any input will still require three cycles to affect the output. To determine whether a critical path P can be achieved, the retiming algorithm creates a series of constraints to calculate the lag on each node (Table 18.1). The primary constraints ensure correctness: No edge will have a negative number of registers, while every cycle will always contain the original number of registers. All I/O passes through the intermediate node, ensuring that input and output timings do not change. These constraints can be modified so that a particular line will contain no registers, or a mandatory minimum number of registers, to meet architectural constraints without changing the complexity of the equations. But it is the final constraint, that all critical paths above a predetermined delay P are registered, that gives this optimization its effectiveness. If the constraint system has a solution, the new lag assignments for all nodes will allocate registers properly to meet the critical path P. But if there is no solution, there cannot be an assignment of registers that meets P. Thus, the common usage is to find the minimum P where the constraints are all met. In general, multiple constraint-solving attempts are made to search for the minimum critical path P. The constraints for P are the final retimed design. There are two ways to speed up this process. First, if the Bellman–Ford algorithm can find a solution, it usually converges very quickly. Thus, if there is no solution that satisfies P, it is usually effective to abandon the Bellman–Ford algorithm early after 0.1N iterations rather than N iterations. This seems to have no impact on the quality of results, yet it can greatly speed up searching for the minimum P that can be satisfied in the design. A second optimization is to use the last computed set of constraints as a starting point. In conventional retiming, the Bellman–Ford process is invoked multiple times to find the lowest satisfiable critical path. In contrast, fixed-frequency repipelining or C-slow retiming uses Bellman–Ford to discover the minimum number of additional registers needed to satisfy the constraints. In both cases,
18.1 Retiming: Concepts, Algorithm, and Restrictions
387
keeping the last failed or successful solution in the data structure provides a starting point that can significantly speed up the process if a solution exists. Retiming in this way imposes only minimal design limitations: Because it applies only to synchronous circuits, there can be no asynchronous resets or similar elements. A synchronous global reset imposes too many constraints to allow effective retiming. Local synchronous resets and enables only produce small, self loops that have no effect on the correct operation of the algorithm. Most other design features can be accommodated simply by adding appropriate constraints. For example, an FPGA with a tristate bus cannot have registers placed on this bus. A constraint that says that all edges crossing the bus can never be registered (r(u) − r(v) ≤ 0 and r(v) − r(u) ≤ 0) ensures this. Likewise, an embedded memory with a mandatory output flip-flop can have a constraint (r(u) − r(v) ≤ w(e) − 1) that ensures that at least one register is placed on this output. Memories themselves can be retimed similarly to any other element in the design, with dual-ported memories treated as a single node for retiming purposes. Memories that are synthesized with a negative clock edge (to create the design illusion of asynchronicity) can be either unchanged or switched to operate on the positive edge with constraints to mandate the placement of registers. Some FPGA designs have registers with predefined initial values. If retiming is allowed to move these registers, the proper initial values must be calculated such that the circuit still produces the same behavior. In an ASIC model, all flip-flops start in an undefined state, and the designer must create a small state machine in order to reset the design. FPGAs, however, have all flip-flops start in a known, user-defined state, and when a dedicated global reset is applied the flip-flops are reset to it. This has serious implications in retiming. If the decision is made to utilize the ASIC model, retiming is free to safely ignore initial conditions because explicit reset logic in state machines will still operate correctly—this is reflected in the I/O semantics. However, without the ability to violate the initial conditions with an ASIC-style model, retiming quality often suffers as additional logic is required or limits are placed on where flipflops may be moved in a design. In practice, performing retiming with initial conditions is NP-hard. Cong and Wu [3] have developed an algorithm that computes initial states by restricting the design to forward retiming only so that it propagates the information and registers forward throughout the computation. This is because solving initial states for all registers moved forward is straightforward, but backward movement is NP hard as it reduces to satisfiability. Additionally, global set/reset imposes a huge constraint on retiming. An asynchronous set/reset can never be retimed (retiming cannot modify an asynchronous circut) while a synchronous set/reset just imposes too high a fanout. An important question is how to deal with multiple clocks. If the interfaces between the clock domains are registered by clocks from both domains, it is a simple process to retime the domains separately, with mandatory registers
388
Chapter 18
I
Retiming, Repipelining, and C-slow Retiming TABLE 18.2
I
The results of retiming four benchmarks
Benchmark
Unretimed
AES core Smith/Waterman Synthetic datapath LEON processor
48 43 51 23
MHz MHz MHz MHz
Automatically retimed 47 40 54 25
MHz MHz MHz MHz
on the domain crossings—the constraints placed on the I/Os ensure correct and consistent timing through the interface. Yet without this design constraint, retiming across multiple clock domains is very hard, and there does not appear to be any clean automatic solution. Table 18.2 shows the results for a particular retiming tool [13]—the Xilinx Virtex family of FPGAs—on four benchmark circuits: an AES core, a Smith/Waterman systolic cell, a synthetic microprocessor datapath, and the LEON-I synthesized SPARC core. This tool does not use a perfectly accurate delay model and has to place registers after retiming, so it sometimes creates slightly suboptimal results. The biggest problem with retiming is that it is of limited benefit to a wellbalanced design. As mentioned earlier, if the clock cycle is defined by a singlecycle feedback loop, retiming can never improve the design, as moving the register around the feedback loop produces no effect. Thus, for example, the Smith–Waterman example in Table 18.2 does not benefit from retiming. The Smith–Waterman benchmark design consists of a series of repeated identical systolic cells that implement the Smith–Waterman sequence alignment algorithm. The cells each contain a single-cycle feedback loop, which cannot be optimized. The AES encryption algorithm also consists of a singlecycle feedback loop. In this case, the initial design used a negative-edge BlockRAM to implement the S-boxes, which the retiming tool converted to a positive edge memory with a “must register” constraint. Nevertheless, retiming can still be a benefit if the design consists of multiple feedback loops (such as the synthetic microprocessor datapath or the LEON SPARC–compatible microprocessor core) or an initially unbalanced pipeline. Still, for well-designed circuits, even complex ones, retiming is often only a slight benefit, as engineers have considerable experience designing reasonably optimized feedback loops. The key benefit to retiming occurs when more registers can be added to the design along the critical path. We will discuss two techniques, repipelining and C-slow retiming, which first add a large number of registers that general retiming can then move into the optimal location.
18.2
REPIPELINING AND C-SLOW RETIMING The biggest limitation of retiming is that it simply cannot improve a design beyond the design-dependent limit produced by an optimal placement of
18.2 Repipelining and C-slow Retiming
389
registers along the critical path. As mentioned earlier, if the critical path is defined by a single-cycle feedback loop, retiming will completely fail as an optimization. Likewise, if a design is already well balanced, changing the register placement produces no improvement. As was seen in the four reasonably optimized benchmarks (refer to Table 18.2), this is often the case. Repipelining and C-slow retiming are tranformations designed to add registers in a predictible matter that a designer can account for, which retiming can then move to optimize the design. Repipelining adds registers to the beginning or end of the design, changing the pipeline latency but no other semantics. C-slow retiming creates an interleaved design by replacing every register with a sequence of C registers.
18.2.1
Repipelining
Repipelining is a minor extension to retiming that can increase the clock frequency for feedforward computations at the cost of additional latency through more pipeline registers. Unlike C-slow retiming, repipelining is only beneficial when a computation’s critical path contains no feedback loops. Feedforward computations, those that contain no feedback loops, are commonly seen in DSP kernels and other tasks. For example, the discrete cosine transform (DCT), the fast Fourier transform (FFT), and finite impulse response filters (FIRs) can all be constructed as feedforward pipelines. Repipelining is derived from retiming in one of two ways, both of which create semantically equivalent results. The first involves adding additional pipeline stages to the start of the computation and allowing retiming to rebalance the delays and create an absolute number of additional stages. The second involves decoupling the inputs and outputs to allow the retimer to add additional pipelining. Although these techniques operate in slightly different ways, they both provide extra registers for the retimer to then move and they produce roughly equivalent results. If the designer wishes to add P pipeline stages to a design, all inputs simply have P delays added before retiming proceeds. Because retiming will develop an optimum placement for the resulting design, the new design contains P additional pipeline stages that are scattered throughout the computation. If a CAD tool supports retiming but not repipelining, the designer can simply add the registers to the input of the design manually and let the tool determine the optimum placement. Another option is to simply remove the cycle between all outputs and inputs, with additional constraints to ensure that all outputs share an output lag, with all inputs sharing a different input lag. This way, the inputs and outputs are all synchronized but retiming can add an arbitrary number of additional pipeline registers between them. To place a limit on these registers, an additional constraint must be added to ensure that for a single I/O pair no more than P pipeline registers are added. Depending on the other constraints in the retiming process, this may add fewer than P additional pipeline stages, but will never add more than P.
390
Chapter 18
I
Retiming, Repipelining, and C-slow Retiming
Repipelining adds additional cycles of latency to the design, but otherwise retains the rest of the circuit’s behavoir. Thus, it produces the same results and the same relative timing on the outputs (e.g., if input B is supposed to be presented three cycles after input A, or output C is produced two cycles after output D, these relative timings remain unchanged). It is only the data-in to data-out timing that is affected. Unfortunately, repipelining can only improve feedforward designs or designs where the feedback loop is not on the critical path. If performance is limited by a feedback loop, repipelining offers no benefit over normal retiming. Repipelining is designed to improve throughput, but will almost always make overall latency worse. Although the increased pipelining will boost the clock rate (and thus reduce some of the delay from unbalanced clocked paths), the delay from additional flip-flops on the input-to-output paths typically overwhelms this improvement and the resulting design will take longer to produce a result for an individual input. This is a fundamental trade-off in repipelining and C-slow retiming. While ordinary retiming improves both latency and throughput, repipelining and C-slow retiming generally improve throughput at the cost of additional latency due to the additional pipeline stages required.
18.2.2
C-slow Retiming
Unlike repipelining, C-slow retiming can enhance designs that contain feedback loops. C-slowing enhances retiming simply by replacing every register with a sequence of C separate registers before retiming occurs; the resulting design operates on C distinct execution tasks. Because all registers are duplicated, the computation proceeds in a round-robin fashion, as illustrated in Figure 18.2. In this example, which is 2-slow, the design interleaves between two computations. On the first clock cycle, it accepts the first input for the first stream of execution. On the second clock cycle, it accepts the first input for the second stream, and on the third it accepts the second input for the first stream. Because of the interleaved nature of the design, the two streams of execution will never interfere. On odd clock cycles, the first stream of execution accepts input; on even clock cycles, the second stream accepts input.
in
1
1
1
1
2 (a)
2
out
in
1
1
1
1
2
2
out
(b)
FIGURE 18.2 I The example from Figure 18.1, converted to 2-slow operation (a). The critical path remains unchanged, but the design now operates on two independent streams in a round-robin fashion. The design retimed (b). By taking advantage of the extra flip-flops, the critical path has been reduced from 5 to 2.
18.2 Repipelining and C-slow Retiming
391
The easiest way to utilize a C-slowed block is to simply multiplex and de-multiplex C separate datastreams. However, a more sophisticated interface may be desired depending on the application (as described in Section 18.5). One possible interface is to register all inputs and outputs of a C-slowed block. Because of the additional edges retiming creates to track I/Os and to ensure a consistent interface, every stream of execution presents all outputs at the same time, with all inputs registered on the next cycle. If part of the design is C-slowed, but all parts operate on the same clock, the result can be retimed as a complete whole and still preserve all other semantics. One way to think of C-slowing is as a threaded design, with an overall system clock and with each stream having a “stream clock” of 1/C—each stream is completely independent. However, C-slowing imposes some more significant FPGA design constraints, as summarized in Table 18.3. Register clock enables and resets must be expressed as logic features, since each independent thread must have an independent reset or enable. Thus, they can remain features in the design but cannot be implemented by current FPGAs using native enables and resets. Other specialized features, such as Xilinx SRL16s (a mode where a LUT is used as a 16-bit shift register), cannot be utilized in a C-slow design for the same reason. One important challenge is how to properly C-slow memory blocks. In cases where the C-slowed design is used to support N independent computations, one needs the illusion that each stream of execution is completely independent and unchanged. To create this illusion, the memory capacity must be increased by a factor of C, with additional address lines driven by a thread counter. This ensures that each stream of execution enjoys a completely separate memory space. For dual-ported memories, this potentially enables a greater freedom in retiming: The two ports can have different lags as long as the difference in lag is less than C. After retiming, the difference is added to the appropriate port’s thread counter, which ensures that each stream of execution will read and write to both ports in order while enabling slightly more freedom for retiming to proceed. C-slowing normally guarantees that all streams view independent memories. However, a designer may desire shared memory common to all streams. Such TABLE 18.3
I
The effects of various FPGA features on retiming, repipelining, and C-slowing
FPGA feature
Effect on retiming
Effect on repipelining
Effect on C-slowing
Asynchronous global set/reset Synchronous global set/reset Asynchronous local set/reset Synchronous local set/reset Clock enables Tristate buffers Memories SRL16 Multiple clock domains
Forbidden Effectively forbidden Forbidden Allowed Allowed Allowed Allowed Allowed Design restrictions
Forbidden Effectively forbidden Forbidden Allowed Allowed Allowed Allowed Allowed Design restrictions
Forbidden Forbidden Forbidden Express as logic Express as logic Allowed Increase size Express as logic Design restrictions
392
Chapter 18
I
Retiming, Repipelining, and C-slow Retiming
memories could be embedded in a design, but the designer would need to consider how multiple streams would affect the semantics and would need to notify any automatic tool to treat the memory in a special manner. Beyond this, there are no other semantic effects imposed by C-slow retiming. C-slowing significantly improves throughput, but it can only apply to tasks where there are at least C independent threads of execution and where throughput is the primary goal. The reason is that C-slowing makes the latency substantially worse. This trade-off brings up a fundimental observation: Latency is a property of the design and computational fabric whereas throughput is a property derived from cost. Both repipelining and C-slow retiming can be applied only when there is sufficient task-level parallelism, in the form of either a feedforward pipeline (repipelining) or independent tasks (C-slowing). Table 18.4 shows the difference that C-slowing can make in four designs. While the retiming tool alone was unable to improve the AES or Smith Waterman designs, C-slowing substantially increased throughput, improving the clock rate by 80–95 percent! However, latency for individual tasks was made worse, resulting in significantly slower clock rates for individual tasks. Latency can be improved only up to a given point for a design through conventional retiming. Once the latency limit is met, no amount of optimization, save a major redesign or an improvement in the FPGA fabric, has any effect. This often appears in cryptographic contexts, where feedback mode–based encryption (such as CFB) requires the complete processing of each block before the next can be processed. In contrast, throughput is actually a part of a throughput/cost metric: throughput/area, throughput/dollar, or throughput/joule. This is because independent task throughput can be added via replication, creating independent modules that perform the same function, as well as C-slowing. When sufficient parallelism exists, and costs are not constrained, simply throwing more resources at the problem is sufficient to improve the design to meet desired goals. One open question on C-slowing is its effect in a low-power environment. Higher throughput, achieved through high-speed clocking, naturally increases the power consumption of a design, just as replicating units for higher throughput increases power consumption. In both cases, if lower power is desired, the higher-throughput design can be modified to save power by reducing the clock rate and operating voltage. Unlike the replicated case, the question of whether a C-slowed design would offer power savings if both frequency and voltage were reduced is highly design TABLE 18.4
I
The effect of C-slowing on four benchmarks
Benchmark AES encryption Smith/Waterman Synthetic datapath LEON processor core
Initial clock 48 43 51 23
MHz MHz MHz MHz
C-factor 4-slow 3-slow 3-slow 2-slow
C-slow clock 87 84 91 46
MHz MHz MHz MHz
Stream clock 21 28 30 23
MHz MHz MHz MHz
18.3 Implementations of Retiming
393
and usage dependent. Although the finer pipelining allows the frequency and the voltage to be scaled back to a significant degree while maintaining throughput, the activity factor of each signal may now be considerably higher. Because each of the C streams of execution is completely independent, it is safe to assume that every wire will probably have a significantly higher activity factor that increases power consumption. Whether the initial design before C-slowing has a comparable activity factor is highly input and design dependent. If the initial design’s activity factor is low, C-slowing will significantly increase power consumption. But if that factor is high, C-slowing will not increase it. Thus, although the C-slowing transformation may have a minor affect on worst-case power (and can even result in significant savings through voltage scaling), the impact on average-case power may be substantial.
18.3
IMPLEMENTATIONS OF RETIMING Three significant academic retiming tools have been developed for FPGAs. The first, by Cong and Wu [3], combines retiming with technology mapping. This approach enables retiming to occur before placement without adding undue constraints on the placer, because the retimed registers are packed with their associated logic. The disadvantage is a lack of precision, as delays can only be crudely estimated before placement. This tool is unsuitable for significant C-slowing, which creates significantly more registers that can pose problems with logic packing and placement. The second tool, developed by Singh and Brown [6], combines retiming with placement, operating by modifying the placement algorithm to be aware that retiming is occurring and then modifying the retiming portion to enable permutation of the placement as retiming proceeds. Singh and Brown demonstrate how the combination of placement and retiming performs significantly better than retiming either before or after placement. The simplified FPGA model used by Singh and Brown has a logic block where the flip-flop cannot be used independently of the LUT, constraining the ability of postplacement retiming to allocate new registers. Thus, the need to permute the placement to allocate registers is significantly exacerbated in their target architecture. The third tool, developed by Weaver et al. [13], performs retiming after placement but before routing, taking advantage of the (mostly) independent register operation available on Xilinx FPGAs. (It would not apply to most Altera FPGAs.) It too also supports C-slowing. Some commercial HDL synthesis tools, notably the Synopsys FPGA compiler [9] and Synplify [8], also support retiming. Because this retiming occurs fairly early in the mapping and optimization processes, it suffers from a lack of precision regarding placement and routing delays. The Amplify tool [10] can produce a higher-quality retiming because it contains placement information. Since these
394
Chapter 18
I
Retiming, Repipelining, and C-slow Retiming
tools attempt to maintain the FPGA model of initial conditions, both on startup and in the face of a global reset signal, considerable logic is added to the design.
18.4
RETIMING ON FIXED-FREQUENCY FPGAs Fixed-frequency FPGAs differ from conventional FPGAs in that they have an intrinsic clock rate and commonly include pipelined interconnect and other design features to enable very high-speed operations. However, this fixed frequency demands a design modification to support the pipeline stages it requires. Retiming for fixed-frequency FPGAs, unlike that for their conventional counterparts, does not require the creation of a global critical path constraint, as simply ensuring that all local requirements are met guarantees that the final design meets the architecture’s required delay constraints. Instead, retiming attempts to solve these local constraints by ensuring that every path through the interconnect meets the delay requirements inherent in the FPGA. Once these local constraints are met, the final design will operate at the FPGA’s intrinsic clock frequency. Because there are no longer any global constraints, the W and D matrices are not created. A fixed-frequency FPGA does not require the global constraints, so having only to solve a set of local constraints requires linear, not quadratic, memory and O(n2 ), rather than O(n2 lg(n)), execution time. This speeds the process considerably. Additionally, only a single invocation of the constraint solver is necessary to determine whether the current level of pipelining can meet the constraints imposed by the target architecture. Unfortunately, most designs do not possess sufficient pipelining to meet these constraints, instead requiring a significant level of repipelining or C-slow retiming to do so. The level necessary can be discovered in two ways. The first approach is simply to allow the user to specify a desired level of repipelining or C-slowing. The retiming system then adds the specified number of delays and attempts to solve the system. If a solution is discovered, it is used. Otherwise, the user is notified that the design must be repipelined or retimed to a greater degree to meet the array’s clock cycle. The second approach requires searching to find the minimal level of repipelining or C-slowing necessary to meet the constraints. Although this necessitates multiple iterations of the constraint solver, fixed-frequency retiming only requires local constraints. Without having to check the global constraints, this process proceeds quickly. The resulting level of repipelining or C-slowing is then reported to the user. Fixed-frequency FPGAs require retiming considerably later in the tool flow. It is impossible to create a valid retiming until routing delays are known. Since the constraints required invariably depend on placement, the final retiming process must occur afterwards. Some arrays, such as HSRA [10], have deterministic routing structures that enable retiming to be performed either before or after routing. Other interconnect structures, such as SFRA [12], lack deterministic routing and require that retiming be performed only after routing.
18.5 C-slowing as Multi-threading
395
Finally, the fact that fixed-frequency arrays may use considerably more pipelining than conventional arrays makes retiming registers a significant architectural feature. Because these delay chains [10], either on inputs or on outputs, are programmable, the array can implement longer ones. A common occurrence after aggressive C-slow retiming is a design with several signals requiring considerable delay. Therefore, dedicated resources to implement these features are effectively required to create a viable fixed-frequency FPGA.
18.5 C-SLOWING AS MULTI-THREADING There have been numerous multi-threaded architecture designs, but all share a common theme: increasing system throughput by enabling multiple streams of execution, or threads, to operate simultaneously. These architectures generally fall into four classes: context switching always without bypassing (HEP [7] and Tera [2]), context switching on event (Intel IXP) [14], interleaved multi-threaded, and symmetric multi-threaded (SMT) [11]. The ideal goal of all of them is to increase system throughput by operating on multiple streams of execution. The general concept of C-slow retiming can be applied to highly complex designs, including microprocessors. Unlike a simple FIR filter bank or an encryption algorithm, it is not a simple matter of inserting registers and balancing delays. Nevertheless, the changes necessary are comparatively small and the benefits substantial: producing a simple, statically scheduled, higher clock rate, multi-threaded architecture that is semantically equivalent to an interleavedmulti-threaded architecture, alternating between a fixed number of threads in a round-robin fashion to create the illusion of a multiprocessor system. C-slowing requires three minor architectural changes: enlarging and modifying the register file and TLB, replacing the cache and memory interface, and slightly modifying the interrupt semantics. Beyond that, it is simply a matter of replacing every pipeline register in both the control logic and the datapath with C registers and then moving the registers to balance the delays, as is traditional in the C-slow retiming transformation and can be performed by an automatic tool. The resulting design, as expected, has full multi-threaded semantics and improved throughput because of a significantly higher clock rate. Figure 18.3 shows how this transformation can operate. The biggest complications in C-slowing a microprocessor are selecting the implementation semantics for the various memories through the design. The first type keeps the traditional C-slow semantics of complete independence, where each thread sees a completely independent view, usually by duplication. This applies to the register file and most of the state registers in the system. This occurs automatically if C-slowing is performed by a tool, because it represents the normal semantics for C-slowed memory. The second is completely shared memory, where every thread sees the same memory, such as the caches and main memory of the system. Most such memories exist in the non-C-slowed portion and so are unaffected by an automatic tool.
396
Chapter 18
Retiming, Repipelining, and C-slow Retiming
I
4
?=0
1
TC RA RB WB
P C
A L U
Enlarged register file
Data cache
I cache
IMM RD
(a)
4
?=0 +
P C I cache
A L U
RA RB Register WB file
Data cache
IMM RD
(b)
FIGURE 18.3 operation.
I
A traditional five-stage microprocessor pipeline, and its conversion to 3-slow
The third is dynamically shared, where a hardware thread ID or a software thread context ID is tagged to each entry, with only the valid tags used. This breaks the automatic C-slow semantics and is best employed for branch predictors and similar caches. Such memories need to be constructed manually, but offer potential efficiency advantages as they do not need to increase in size. Because they cannot be constructed automatically they may be subject to interference or synergistic effects between threads. The biggest architectural changes are to the register file: It needs to be increased by a factor of C, with a hardware thread counter to select which group of registers is being accessed. Now each thread will see an independent set of registers, with all reads and writes for the different threads going to separate memory locations. Apart from the thread selection and natural enlargement, the only piece remaining is to pipeline the register access. If necessary, the
18.5 C-slowing as Multi-threading
397
C independently accessed sections can be banked so that the register file can operate at a higher clock frequency. Naturally, this linearly increases the size of the register file, but pipelining the new larger file is not difficult since each thread accesses a disjoint register set, allowing staggered access to the banks if desired. This matches the automatic memory transformations that C-slowing creates: increasing the size and ensuring that each task has an independent view of memory. To maintain the illusion that the different threads are running on completely different processors, it is important that each thread have an independent translation of memory. The easiest solution is to apply the same transformations to the TLB that were applied to the register file: increasing the size by C, with each thread accessing its own set, and pipelining access. Again, this is the natural result of applying the C-slow semantics from an automatic tool. The other option is to tag each TLB entry. The interference effect may be significant if the associativity or size of the TLB is low. In such a case, and considering the generally small size of most TLBs, increasing the size (although perhaps by less than a factor of C) is advisable. Software thread ID tags are preferable to hardware ID tags because they reduce the cost of context switching if a shared TLB is used and may also provide some synergistic effects. In either case, a shared TLB requires interlocking between TLB writes to prevent synchronization bugs. If the caches are physically addressed, it is simply a matter of pipelining access to improve throughput without splitting memory. Because of the interlocked execution of the threads and the pipelined nature of the modified caches, no additional coherency mechanisms are required except to interlock any existing test-and-set or atomic read/write instructions between the threads to ensure that each instruction has time to be completed. Such cache modifications occur outside the C-slow semantics, suggesting that the cache needs to be changed manually. This means that the cache and memory controller must be manually updated to support pipelined access from the distinct threads, and must exist outside of the C-slowed core itself. Unfortunately, virtually addressed caches are significantly more complicated: They require that each tag include thread ownership (to prevent one thread from viewing another’s version of memory) and that a record of virtual-to-physical mappings be maintained to ensure coherency between threads. These complications suggest that a physically addressed cache would be superior when Cslowing a microprocessor to produce a simple multi-threaded design. A virtually addressed cache is one of the few structures that do not have a natural C-slow representation or that can easily exist outside a C-slowed core. The rest of the machine state registers, being both loaded and read, are automatically separated by the C-slow transformation. This ensures that each thread will have a completely independent set of machine registers. Combined with the distinct registers and TLB tagging, each thread will see an independent processor. The only other portion that needs to be changed is the interrupt semantics. Just as the rest of the control logic is pipelined, with control registers duplicated,
398
Chapter 18
I
Retiming, Repipelining, and C-slow Retiming
the same transformations need to be applied to the interrupt logic. Thus, every external interrupt is interpreted by the rules corresponding to every virtual processor running in the pipeline. Yet, since the control registers are duplicated, the OS can enforce policies where different interrupts are handled by different execution streams. Similarly, internally driven interrupts (such as traps or watchdog timers), when C-slowed, are independent between threads, as C-slowing ensures that each thread sees only its own interrupts. In this way, the OS can ensure that one virtual thread receives one set of externally sourced interrupts while another receives a different set. This also suggests that interrupts be presented to all threads of execution, enabling each thread (or even multiple threads) to service the appropriate interrupt. The resulting design has full multi-threaded semantics, with each of C threads being independent. Because C-slowing can improve the clock rate (by two times in the case of the LEON benchmark), this can easily and substantially improve the throughput of a very complex design.
18.6
WHY ISN’T RETIMING UBIQUITOUS? An interesting question is why retiming is not heavily used in FPGA tool flows. Although some FPGA vendors [1] and CAD vendors [8] support retiming, it is not universally available, and even when it is, it is usually optional. There are three major factors that limit the general adoption of retiming: It interacts poorly with many critical FPGA features; it can only optimize poor implementations yet is not a substitute for good implementation; and it is computationally intensive. As mentioned earlier, retiming does not work well with initial conditions or global resets—features that FPGA designers have traditionally relied on. Likewise, BlockRAMs, hardware clock eEnables, and other features can pin registers, limiting the ability of a retiming tool to move them. For these reasons, many FPGA designs cannot be effectively retimed. A related observation is that retiming helps only poor designs and, moreover, only fixes one common deficiency of a poor design, not all of them. Additionally, if the designer has enough savvy to work around the limitations of retiming, he will probably produce a naturally well-balanced design. Finally, although retiming is a polynomial time algorithm, its still superlinear. As designs continue to grow in size, O(n2 lg(n)) can still be too long for many uses. This is especially problematic as the Moore’s Law scaling for FPGAs is currently greater than that for single-threaded microprocessors.
References [1] Altera Quartus II eda (http://www.altera.com/ ). [2] R. Alverson, D. Callahan, D. Cummings, B. Koblenz, A. Porterfield, B. Smith. The Tera computer system. Proceedings of the 1990 International Conference on Supercomputing, 1990.
18.6 Why Isn’t Retiming Ubiquitous?
399
[3] J. Cong, C. Wu. Optimal FPGA mapping and retiming with efficient initial state computation. Design Automation Conference, 1998. [4] C. Leiserson, F. Rose, J. Saxe. Optimizing synchronous circuitry by retiming. Third Caltech Conference On VLSI, March 1993. [5] H. Schmit. Incremental reconfiguration for pipelined applications. Proceedings of the IEEE Symposium on Field-Programmable Gate Arrays for Custom Computing Machines, April 1997. [6] D. P. Singh, S. D. Brown. Integrated retiming and placement for field-programmable gate arrays. Tenth ACM International Symposium on Field-Programmable Gate Arrays, 2002. [7] B. J. Smith. Architecture and applications of the HEP multiprocessor computer system. Advances in laser scanning technology. SPIE Proceedings 298, Society for Photo-Optical Instrumentation Engineers, 1981. [8] Synplify pro (http://www.synplicity.com//products//synplifypro//index.html). [9] Synopsys, Inc. Synopsis FPGA Compiler II (http://www.synopsys.com). [10] W. Tsu, K. Macy, A. Joshi, R. Huang, N. Walker, T. Tung, O. Rowhani, V. George, J. Wawrzynek, A. DeHon. HSRA: High-speed, hierarchical synchronous reconfigurable array. Proceedings of the International Symposium on Field-Programmable Gate Arrays, February 1999. [11] D. M. Tullsen, S. J. Eggers, H. M. Levy. Simultaneous multi-threading: Maximizing on-chip parallelism. Proceedings 22nd Annual International Symposium on Computer Architecture, June 1995. [12] N. Weaver, J. Hauser, J. Wawrzynek. The SFRA: A corner-turn FPGA architecture. Twelfth International Symposium on Field-Programmable Gate Arrays, 2004. [13] N. Weaver, Y. Markovskiy, Y. Patel, J. Wawrzynek. Postplacement C-slow retiming for the Xilinx-Virtex FPGA. Eleventh ACM International Symposium on FieldProgrammable Gate Arrays, 2003. [14] Intel Corporation. The Intel IXP network processor. Intel Technology Journal 6(3), August 2002.
This page intentionally left blank
CHAPTER
19
CONFIGURATION BITSTREAM GENERATION Steven A. Guccione Cmpware, Inc.
While a reconfigurable logic device shares some of the characteristics of a fixed hardware device and some of a programmable instruction set processor, the details of the underlying architecture and how it is programmed are what distinguish these machines. Both a reconfigurable logic device and an instruction set processor are programmable by “software,” but the internal organization and use of this software are quite different. In an instruction set processor, the programming is a set of binary codes that are incrementally fed into the device during operation. These codes actually carry out a form of reconfiguration inside the processor. The arithmetic and logic unit(s) (ALU) is configured to perform a requested function and various control multiplexers (MUXes) that control the internal flow of data are set. In the instruction set machine, these hardware components are relatively small and fixed and the system is reconfigured on a cycle-by-cycle basis. The processor itself changes its internal logic and routing on every cycle based on the input of these binary codes. In a processor, the binary codes—the processor’s machine language—are fairly rigid and correspond to sequential “instructions.” The sequence of these instructions to implement a program is often generated by some higher-level automatic tool such as a high-level language (HLL) compiler from a language such as Java, C, or C++. But they may, in reality, come from any source. What is important is that the collection of binary data fits this rigid format. The collection of binary data goes by many names, most typically an “executable” file or even more generally a “binary program.” A reconfigurable logic device, or field-programmable gate array (FPGA), is based on a very different structure than that of an instruction set machine. It is composed of a two-dimensional array of programmable logic elements joined together by some programmable interconnection network. The most significant difference between FPGA and the instruction set architecture is that the FPGA is typically intended to be programmed as a complete unit, with the various internal components acting together in parallel. While the structure of its binary programming (or configuration) data is every bit as rigid as that of an instruction set processor, the data are used spatially rather than sequentially. In other words, the binary data used to program the reconfigurable logic device are loaded into the device’s internal units before the device is placed
402
Chapter 19
I
Configuration Bitstream Generation
in its operating mode, and typically, no changes are made to the data while the device is operating. There are some significant exceptions to this rule: The configuration data may in fact be changed while a device is operational, but this is somewhat akin to “self-modifying code” in instruction set architectures. This is a very powerful technique, but carries with it significant challenges. The collection of binary data used to program the reconfigurable logic device is most commonly referred to as a “bitstream,” although this is somewhat misleading because the data are no more bit oriented than that of an instruction set processor and there is generally no “streaming.” While in an instruction set processor the configuration data are in fact continuously streamed into the internal units, they are typically loaded into the reconfigurable logic device only once during an initial setup phase. For historical reasons, the somewhat undescriptive “bitstream” has become the standard term. As much as the binary instruction set interface describes and defines the architecture and functionality of the instruction set machine, the structure of the reconfigurable logic configuration data bitstream defines the architecture and functionality of the FPGA. Its format, however, currently suffers from a somewhat interesting handicap. While the format of the programming data of instruction set architectures is freely published, this is almost never the case with reconfigurable logic devices. Almost all of them that are sold by major manufacturers are based on a “closed” bitstream architecture. The underlying structure of the data in the configuration bitstream is regarded by these companies as a trade secret for reasons that are historical and not entirely clear. In the early days of reconfigurable logic devices, the underlying architecture was also a trade secret, so publishing the configuration bitstream format would have given too many clues about it. It is presumed that this was to keep competitors from taking ideas about an architecture, or perhaps even “cloning” it and providing a hardware-compatible device. It also may have reassured nervous FPGA users that, if the bitstream format was a secret, then presumably their logic designs would be difficult to reverse-engineer. While theft and cloning of device hardware do not appear to be a potential problem today, bitstream formats are still, perhaps out of habit alone, treated as trade secrets by the major manufacturers. This is a shame because it prohibits interesting experimentation with new tools and techniques by third parties. But this is perhaps only of interest to a very small number of people. The vast majority of users of commercial reconfigurable logic devices are happy to use the vendor-supplied tools and have little or no interest in the device’s internal structure as long as the logic design functions as specified. However, for those interested in the architecture of reconfigurable logic devices, trade secrecy is an important subject. While exact examples from popular industry devices are not possible because of this secrecy, much is publicly known about the underlying architectures, the general way a bitstream is generated, and how it operates when loaded into a device.
19.1 The Bitstream
19.1
403
THE BITSTREAM The bitstream spatially represents the configuration data of a large collection of small, relatively simple hardware components. Thus, we can identify these components and discuss the ways in which the bitstream is used to produce a working digital circuit in a reconfigurable logic device. Although there is really no limit to the types of units possible in a reconfigurable logic device, two basic structures make up the microarchitecture of most modern FPGAs. These are the lookup table (LUT) and the switch box. The LUT is essentially a very small memory element, typically with 16 bits of bit-oriented storage. Some early FPGAs used smaller 8-bit LUTs, and other more exotic architectures used non-LUT structures. In general, however, the vast majority of commercial FPGA devices sold over the last decade use the 16-bit LUT as a primary logic building block. The functionality of LUTs is very simple. Binary data are loaded into them to produce some Boolean function. In the case of the 16-bit LUT, there are four inputs, which can produce any arbitrary 4-input Boolean logic function. For instance, to provide the AND function of all four inputs, each bit in the memory except the bit at address A(1,1,1,1) is loaded with a binary 0 and the A(1,1,1,1) bit is loaded with a 1. The address inputs of the LUT are used as the inputs to the logic function, with the output of the LUT providing the output of the logic function. Figure 19.1 illustrates this mapping of a 2-input LUT to a 2-input AND gate. While the LUTs provide the logic for the circuit, the switch boxes provide the interconnection. These switch boxes are typically made up of multiplexers in various regular configurations. These multiplexers are controlled by bits of memory that select the inputs and send them to the multiplexer’s outputs. Figure 19.2 shows a typical configurable interconnect element constructed using a multiplexer. The multiplexer inputs in Figure 19.2 are controlled by two memory elements that are set during configuration. They select which input value is sent to the output. By connectiong large numbers of elements of this type, an interconnection
a b
0 0 0 1
out
=>
2-input LUT
FIGURE 19.1
I
A 2-input LUT configured as an AND gate.
a
out
b AND gate
404
Chapter 19
I
Configuration Bitstream Generation in3 in2
M U X
in1 in0
out
cfg1
cfgin1 DFF cfgin0
cfg0 DFF
FIGURE 19.2
I
A configurable 4-input multiplexer used in routing.
network of the kind typically used to construct modern reconfigurable logic devices can be made. In various topologies, the ouputs of the multiplexers in the switch boxes feed the address inputs of the LUTs; the outptus of the LUTs, in turn, feed the inputs of the switch box multiplexers. This provides a basic reprogrammable architecture capable of producing arbitrary logic functions, as well as the ability to interconnect these functions in a variety of ways. How complex a circuit a given reconfigurable logic device can implement is based on both the number of LUTs and the size and complexity of the interconnection fabric. In fact, the topology of the interconnect fabric and the implementation of the switch boxes is perhaps the defining characteristic of an FPGA architecture. Older FPGAs had a limited silicon area and few metal layers to supply wires. For this reason, the LUTs were typically “islands” of logic, with the interconnect wires running in the “channels” between them. Where these channels intersected were the switch boxes. How many wires to use and how to configure the switch boxes were the main work of the FPGA architect. Balancing the cost of more wires with the needs of typical digital circuit was important to making a cost-effective device that would be commercially successful. Covering as many potential circuit designs as possible at as high a speed as possible, but with the smallest silicon area, is still the challenge FPGA device architects must confront. In later silicon process generations, however, more metal layers were available, which resulted in a much higher ratio of wires to logic in FPGAs. Where older generations of FPGAs often had a scarcity of interconnection resources, more modern FPGA devices seldom encounter circuits they are unable to implement because of a lack of routing resources. And these wires now tend to run on top of the logic rather than in channels, which has led to higher circuit densities, a tighter integration between the switch boxes and the logic, and faster interconnect. The configuration bitstream data for the routing are essentially the multiplexer inputs in these switch boxes. The memory for these MUX inputs tends
19.1 The Bitstream
405
to be individual memory elements such as flip-flops scattered around the device as needed, establishing the basic bitstream for the FPGA: the LUT data plus the bits to control the routing multiplexers. While the multiplexer and switch boxes are the basic elements of modern FPGA devices, many other components are possible. One of the more popular is a configurable input/output block, or IOB. An IOB is typically connected to the end of one of the wires in the routing system on one side and to a physical device pin on the other. It is then configured to define the type of pin used by this device: either input or output. More complex IOBs can configure pin voltages and even parameters such as capacitance, and some even provide higher-level support for various serial communication protocols. Much like switch boxes, the configuration bitstream data for the IOBs are some collection of bits used to set flip-flops within them to select these features. In addition to IOBs, other, more special-purpose units have turned up in later generations of FPGA devices. Two prominent examples are block memory and multiplier units. Block memory (BlockRAM) is simply relatively large RAM units that are usually on the order of 1K bits but can be implemented in any number of ways. The actual data bits may be part of the bitstream, which initializes the BlockRAM upon power-up. To reduce the size of the bitstream, however, this data may be absent and internal circuitry may be required to reset and initialize the BlockRAM. In addition to the internal data, the BlockRAM is typically interfaced to the switch boxes in various ways. Its location and interfacing to the interconnection network is a major architectural decision in modern reconfigurable logic device design. Because the multiplication function has become more popular in FPGA designs and because FPGAs are so inefficient at implementing such circuits, the addition of hardwired multiplier units into modern FPGA devices has been increasing. These units typically have no internal state or configuration, but are interfaced to the interconnection network in a manner similar to the BlockRAM interface. As with the BlockRAM, where to locate these resources and how many to include are major architectural decisions that can have a large impact on the size and efficiency of modern FPGAs. Many other features also find control bits in the FPGA bitstream. Some of these are global control related to configuration and reconfiguration; others are ID codes and error-checking information such as cyclic redundancy check codes. How these features are implented is very architecture dependent and can vary widely from device family to device family. One common feature is basic control for bit-level storage elements, often in the form of flip-flops on the LUT output. Various control bits often set circuit parameters such as the flip-flop type (D, JK, T) or the clock edge trigger type (rising or falling edge). The ability to chage the flip-flop into a transparent D-type latch is also a popular option. Each of these bits also contributes to the configuration data, with one set of flip-flop configuration settings per LUT being typical. Finally, while the items just discussed are the major standard units used to construct modern FPGA devices and define the configuration bitstream, there
406
Chapter 19
I
Configuration Bitstream Generation
TABLE 19.1
I
Configuration bitstream sizes
Year
Device
Bits
1986 1988 1990 1994 1996 1998 2000 2003
XC2018 XC3090 XC4013 XC4025 XC4028 XCV1000 XCV3200 XC2V8000
18 Kbits 64 Kbits 248 Kbits 422 Kbits 668 Kbits 6.1 Mbits 16 Mbits 29 Mbits
is no limit to the types of circuits and configurations possible. For example, an interest in analog FPGAs has resulted in unique architectures to perform analog signal processing. Also, some coarser-grained reconfigurable logic devices have moved up in granularity from LUTs to ALUs, and these devices have somewhat different bitstream structures. Other architectures have gone in the other direction toward extremely fine-grained architectures. One notable device, the Xilinx XC6200, has a logic cell that is essentially a 2-input multiplexer. The balance of routing and logic in these devices has made them less attractive than coarsergrained devices, but they have not been reevaluated in the context of the denser routing available with newer multilayer metal processes and so may yet have some promise. As FPGA devices themselves have grown, so has the size of the configuration bitstreams. In fact, bitstream size can be a reasonable gauge of the size and complexity of the underlying device, which can be useful because it is a single number that is readily available. Table 19.1 gives some representative sizes of various bitstreams from members of the Xilinx family of FPGAs and the approximate dates they were introduced.
19.2
DOWNLOADING MECHANISMS The FPGA configuration bitstream is typically saved externally in a nonvolatile memory such as an EPROM. The data are usually loaded into the device shortly after the initial power-up sequence, most often bit-serially. (This loading mechanism may be the reason that many engineers perceive the configuration data as a “stream of bits.”) The reason for serial loading is primarily one of cost and convenience. Since there is usually no particular hurry in loading the FPGA configuration data on power-up, using a single physical device pin for this data is the simplest, cheapest approach. Once the data are fully loaded, this pin may even be put into service as a standard I/O pin, thus preventing the configuration downloading mechanism from consuming valuable I/O resources on the device. A serial configuration download is the norm, but some FPGA devices have a parallel download mode that typically permits the use of eight I/O pins to
19.3 Software to Generate Configuration Data
407
download configuration data in parallel. This may be helpful for designs that use an 8-bit memory device and for applications where reprogramming is common and speed is important—often the case when an FPGA is controlled by a host processor in a coprocessor arrangement. As with the serial approach, the pins may be returned to regular I/O duty once downloading is complete. One place where such high-bandwidth configuration is useful is in the device test in the factory. Testing FPGA devices after manufacture can be a very expensive task, mostly because of time spent attached to the test equipment. Thus, decreasing the configuration download time by a factor of eight may result in the FPGA manufacturer requiring substantially fewer pieces of test equipment, which can result in a significant cost savings during manufacture. Anecdotal evidence suggests that high-speed download is driven mostly by increased test efficiency and not by any customer requirements related to runtime reconfiguration. One type of device that is based on nonvolatile memory bears mention here. Rather than using RAM and flip-flops as the internal logic and control, commercially available devices from companies such as Actel use nonvolatile Flash-style internal configuration memory. These devices are programmed once and do not require reloading of configuration data on power-up, which can be important in systems that must be powered-up quickly. Such devices also tend to be more resistant to soft errors that can occur in volatile RAM devices. This makes them especially popular in harsh environments such as space and military applications.
19.3
SOFTWARE TO GENERATE CONFIGURATION DATA The software used to generate configuration bitstream data for FPGA devices is perhaps some of the most complex available. It usually consists of many layers of functionality and can run on the largest workstations for hours or even days to produce the output for a single design. While the details of this software are beyond the scope of this chapter, some of the way the software generates this bitstream will be briefly discussed in this section. The top-level input to the FPGA design software is most often a hardware description language (HDL) or a graphical circuit design created with a schematic capture package. This representation is usually then translated into a low-level description more closely related to the implementation technology. A common choice for this intermediate format is EDIF (Electronic Design Interchange Format). This translation is fairly generic and such tools are widely available from a variety of software vendors. The EDIF description is still not suitable for directly programming the reconfigurable logic device. In the typical FPGA, the underlying circuit must be “mapped” onto the array of LUTs and switch boxes. While the actual implementation may vary, the two basic processes for getting such abstract circuit descriptions into a physical representation of FPGA configuration data are placement/ routing and mapping. Figure 19.3 shows the basic flow of this process. Mapping refers to taking general logic descriptions and converting them into the bits used to fill in a LUT. This is sometimes referred to as “packing,” because
408
Chapter 19
Configuration Bitstream Generation
Verilog/ VHDL compiler
HDL
FIGURE 19.3
I
I
Place EDIF
Map Route
Binary configuration data
The tool flow for producing the configuration bitstream.
several small logic gates are often “packed” into a single LUT. There is also a notion of placement that decides which LUT should receive the data, but this may also be considered a part of the mapping process. Once the values for the LUTs have been decided, software can begin to decide how to interconnect the LUTs in a process called “routing.” There are many algorithms of varying sophistication to perform routing, and factors such as circuit timing may be taken into account in the process. The result of the routing procedure is eventually used to supply the configuration data for the switch boxes. Of course, this description is highly simplified, and mapping and routing can take place in various interleaved phases and can be optimized in a wide variety of ways. Still, this is the essential process used to produce the configuration bitstream. Finally, data for configuring the IOBs are typically input in some form that is aware of the particular package being used for the FPGA device. Once all of this data have been defined and collected, they can be written out to a single file containing the configuration bitstream. As mentioned, FPGA configuration bitstream formats have almost always been proprietary. For this reason, the only tools available to perform bitstream generation tasks have been those supplied by the device manufacturer. The one notable exception is the Xilinx XC6200, which had an “open” bitstream. One of the XC6200’s software tools was an application program interface (API) that permitted users to create configuration data or to even directly alter the configuration of an XC6200 in operation mode. Some of this technology was transferred to more mainstream Xilinx FPGAs and is available from Xilinx as a toolkit called JBits. JBits is a Java API into the configuration bitstream for the XC4000 and Virtex device families. With JBits, the actual values on LUTs and switch box settings, as well as all other microarchitectural components, could be directly programmed. While the control data could be used to produce a traditional bitstream file, they could also be accessed directly and changed dynamically. The JBits API not only permitted dynamic reconfiguration of the FPGA but also permitted third-party tools to be built for these devices for the first time. JBits was very popular with researchers and users with exotic design requirements, but it never achieved popular use as a mainstream tool, although many of its related toolkit components, including the debug tool and partial reconfiguration support, have found their way into more mainstream software.
19.4 Summary
19.4
409
SUMMARY While the generation of bitstream data to configure an FPGA device is a very common activity, there has been very little information available on the details of either the configuration bitstream or the underlying FPGA architecture. Thus, the FPGA can best be viewed as a collection of microarchitecture components, chiefly LUTs and switch boxes. These components are configured by writing data to the LUT values and to control memories associated with the switch boxes. Setting these bits to various values results in custom digital circuits. A variety of tools and techniques are used to program reconfigurable logic devices, but all must eventually produce the relatively small configuration “bitstream” data the devices require. This data is in as rigid a format as any binary execution data for a microprocessor, but this format is typically proprietary and unpublished. While direct examination of actual commercial bitstream data is largely impossible, the general structure and the microarchitecture components configured by this data can be examined, at least in the abstract.
References [1] Xilinx, Inc. Virtex Data Sheet, Xilinx, Inc., 1998. [2] S. A. Guccione, D. Levi, P. Sundararajan. JBits: A Java-based interface for reconfigurable computing. Second Annual Military and Aerospace Applications of Programmable Devices and Technologies Conference (MAPLD), Laurel, MD, September 1999. [3] E. Lechner, S. A. Guccione. The Java environment for reconfigurable computing. Proceedings of the Seventh International Workshop on Field-Programmable Logic and Applications, September 1997. [4] Xilinx, Inc. XAPP151: Virtex Series Configuration Architecture User Guide (version 1.7), (http://direct.xilinx.com/bvdocs/appnotes/xapp151.pdf), October 20, 2004. [5] P. Alfke. FPGA Configuration Guidelines (version 1.1) (http://direct.xilinx.com/bvdocs/ appnotes/xapp090.pdf), November 24, 1997. [6] Xilinx, Inc. XC6200 Field-Programmable Gate Arrays, Xilinx, Inc., 1997. [7] V. Betz, J. Rose. VPR: A new packing, placement, and routing tool for FPGA research. Proceedings of the Seventh International Workshop on Field-Programmable Logic and Applications, September 1997. [8] Xilinx, Inc. JBits 2.8 SDK for Virtex, Xilinx Inc., 1999.
This page intentionally left blank
CHAPTER
20
FAST COMPILATION TECHNIQUES Ken Eguro, Scott Hauck Department of Electrical Engineering University of Washington
Most users rely on sophisticated CAD tools to implement their circuits on field-programmable gate arrays (FPGAs). Unfortunately, since each of these tools must perform reasonably complex optimization, the entire process can take a long time. Although fairly slow compilation is fine for the majority of current FPGA users, there are many situations that demand more efficient techniques. Looking into the future, we see that faster CAD tools will become necessary for many different reasons. FPGA scaling. Modern reconfigurable devices have a much larger capacity compared to those from even a few years ago, and this trend is expected to continue. To handle the dramatic increase in problem size, while maintaining current usability and compilation times, smarter and more efficient techniques are required. Hardware prototyping and logic emulation systems. These are very large multi-FPGA systems used for design verification during the development of other complex hardware devices such as next-generation processors. They present a challenging CAD problem both because of the sheer number of FPGAs in the system and because the compilation time for the design is part of the user’s debug cycle. That is, the CAD tool time directly affects the usability of the system as a whole. Instance-specific design. Instance-specific designs are applications where a given circuit can only solve one particular occurrence of a problem. Because of this, every individual hardware implementation must be created and mapped as the problems are presented. Thus, the true solution time for any specific example includes the netlist compilation time. Runtime netlist compilation. Reconfigurable computing systems are often constructed with an FPGA or an array of FPGAs alongside a conventional processor. Multiple programs could be running in the system simultaneously, each potentially sharing the reconfigurable fabric. In some of the most aggressive systems, portions of a program are individually mapped to the FPGA while the instructions are in flight. This creates a need for almost real-time compilation techniques. For each of these systems, the runtime of the CAD tools is a clear concern. In this chapter, we consider each scenario and cover techniques to accelerate the
412
Chapter 20
I
Fast Compilation Techniques
various steps in the mapping flow. These techniques range from fairly cost-neutral optimizations that speed the CAD flow without greatly impacting circuit quality to more aggressive optimizations that can significantly accelerate compilation time but also appreciably degrade mapping quality. FPGA scaling The mere scaling of VLSI technology itself has created part of the burden for conventional FPGA CAD tools. Fulfilling Moore’s Law, improvements in lithography and manufacturing techniques have radically increased the capabilities of integrated circuits over the last four decades. Of course, just as these advancements have increased the performance of desktop computers, they have increased the logic capacity of FPGAs. Correspondingly, the size of desired applications has also increased. Because of this simultaneous scaling across the industry, reconfigurable devices and their applications become physically larger at approximately the same rate that general-purpose processors become faster. Unfortunately, this does not mean that the time required to compile a modern FPGA design on a modern processor stays the same. Over a particular period of time, desktop computers and compute servers will become twice as fast and, concurrently, FPGA architectures and user circuits will double in size. Since the complexity of many classical design compilation techniques scale super-linearly with problem size, however, the relative runtime for mapping contemporary applications using contemporary machines will naturally rise. To continue to provide reasonable design compilation time across multiple FPGA generations, changes must be made to prevent a gap between available computational power and netlist compilation complexity. However, although application engineers depend on compilation times of at most a few hours to meet fast production timelines, they also have expectations about the usable logic block density and achievable clock frequency for their applications. Thus, any algorithmic improvements or architectural changes made to speed up the mapping process cannot come at the cost of dramatically increased critical-path timing or reduced mapping density. Hardware prototyping and logic emulation systems The issue of nonscalable compilation is even more obvious in large prototyping or logic emulation systems. These devices integrate multiple FPGAs into a single system, harnessing tens to thousands. As Chapter 30 discusses in more detail, the fundamental size of typical circuits on these architectures suggests fast mapping techniques. However, even more critical, the compilation time of the netlists themselves may become a limiting factor in the basic usefulness of the entire system. Hardware prototyping is often employed for many reasons. One of the greatest advantages of hardware emulation over software simulation is its extremely fast validation time. During the design and debug cycle of hardware development, hundreds of thousands of test vectors may be applied to ensure that a given implementation complies with design specifications. Although an FPGA-based prototyping system cannot be expected to achieve anywhere near the clock rate of the dedicated final product, the sheer volume of tests that need to be performed
Chapter 20
I
Fast Compilation Techniques
413
every time a change is made to the system makes software simulation too slow to have inside the engineering design loop. That said, software simulation code can easily accommodate design updates and, more important, the changes have a predictable compilation time of minutes to hours, not hours to days. Still, since reconfigurable logic emulation systems maintain such a runtime advantage over software simulation, prototyping designers are willing to exchange some of the classical FPGA metrics of implementation quality, critical-path timing, and logical density for faster and more predictable compilation time. Instance-specific design Similar to logic emulation systems, the netlist compilation time of instancespecific circuits can greatly affect the overall value of an FPGA-based implementation. For example, although Boolean satisfiability is NP-complete, the massive parallelism offered by reconfigurable fabrics can often solve these problems extremely quickly—potentially on the order of milliseconds (see Chapter 29). Unfortunately, these FPGA implementations are equation-specific, so the time required to solve any given SAT problem is not determined by the vanishingly short runtime of the actual mapped circuit running on a reconfigurable device, but instead is dominated by the compilation time required to obtain the programming bitstream in the first place—potentially on the order of hours. Because of this reliance on netlist compilation, the Boolean satisfiability problem differs strongly from more traditional reconfigurable computing applications for two reasons. First, if we disregard compilation time, FPGA-based SAT solvers can obtain two to three orders of magnitude better performance than software-based solutions. Thus, the critical path and, by extension, the overall quality of the mapping in the classical sense are virtually irrelevant. As long as compilation results in any valid mapping, the vast majority of the performance benefit will be maintained. While some effort is required to reliably produce routable circuits, we can make huge concessions in terms of circuit quality in the name of speeding compilation. Mappings that are quickly produced, but possibly slow, will still drastically improve the overall solution runtime. Second, features of the SAT problem itself suggest that application-specific approaches might be worthwhile. For example, because SAT solvers typically have very structured forms, fast SAT-specific CAD tools can be created. One possibility is the use of preplaced and prerouted SAT-specialized macros that simply need to be assembled together to create the overall system. To extend the concept of application-specialized tuning to its logical end, architectural changes can even be made to the reconfigurable fabric itself to make the device particularly amenable to simple, fast mapping techniques. That said, the large engineering effort this would involve must be weighed against the possible benefits. Runtime netlist compilation All reconfigurable computing systems have a certain amount of overhead that eats away at their performance benefit. Although kernel execution might be blindingly fast once started on the reconfigurable logic, its overall benefit is limited by the
414
Chapter 20
I
Fast Compilation Techniques
need to profile operations, transfer data, and configure or reconfigure the FPGA. Reconfigurable computing systems that use dynamically compiled applications have the additional burden of runtime netlist compilation. These systems only map application kernels to the hardware during actual system execution, in the hope that runtime data, such as system loads, resource availability, and execution profiles, can improve the resultant speedups provided by the hardware. Their almost real-time requirements demand the absolutely fastest compilation techniques. Thus, even more so than instance-specific designs, these systems are only concerned with compilation speed. Mapping stages When evaluating mapping techniques for high-speed circuit compilation, we have to remember that the individual tools are part of a larger system. Therefore, any quality degradation in an early stage may not only limit the performance of the final mapping, but also make subsequent compilation problems more difficult. If these later mapping phases are more difficult, they may require a longer runtime, overwhelming the speedups achieved in earlier steps. For example, a poor-quality placement obtained very quickly will likely make the routing problem harder. Since we are interested in reducing the runtime of the compilation phase as a whole, we must ensure that we do not simply trade placement runtime for routing runtime. We may even run the risk of increasing total compilation time, since a very poor placement might be impossible to route, necessitating an additional placement and routing attempt. Although logic synthesis, technology mapping, and logic block packing are considered absolutely necessary parts of a modern, general-use FPGA compiler flow, the majority of research into fast compilation has been focused on efficient placement and routing techniques. Not only do the placement and routing phases make up a large portion of the overall mapping runtime, in some cases the other steps can be considered either unsuitable or unnecessary to accelerate. Sometimes high-level synthesis and technology mapping may be unnecessary because designs are assumed to be implemented in low-level languages, or it is assumed that they can be performed offline and thus outside the task’s critical path. Furthermore, although logic synthesis and technology mapping can be very difficult problems by themselves, they are also common to all hardware CAD tools—not just FPGA-based technologies. On the other hand, placement and routing tools for reconfigurable devices have to deal with architectural restrictions not present in conventional standard cell tools, and thus generally must be accelerated with unique approaches.
20.1
ACCELERATING CLASSICAL TECHNIQUES An obvious starting point to improve the runtime of netlist compilation is to make minor algorithmic changes to accelerate the classical techniques already in use. For example, simulated annealing placement has some obvious parameters that can be changed to reduce overall runtime. The initial annealing temperature
20.1 Accelerating Classical Techniques
415
can be lowered, the freezing point can be increased, the cooling schedule can be accelerated, or the number of moves per iteration can be reduced. These approaches all tend to speed up the annealing, but at some cost to placement quality.
20.1.1
Accelerating Simulated Annealing
Because of the adaptive nature of modern simulated annealing temperature schemes, any changes made to the structure of the cooling schedule itself can have unreliable runtime behavior. Not only have the settings of initial and final temperatures been carefully selected to thoroughly explore the solution space, changing these values may dramatically affect final placement quality while still not guaranteeing satisfactorily shorter runtime. As described in Chapter 14, VPR updates the current temperature based on the fraction of moves accepted out of those attempted during a given iteration. Thus, decreasing the initial temperature cuts off the phase in which sweeping changes can easily occur early in the annealing. Simply starting the system at a lower initial temperature may cause the annealing to compensate by lingering longer at moderately high temperatures. Similarly, modifying the cooling schedule to migrate toward freezing faster fundamentally goes against the basic premise of simulated annealing itself. This will have an unpredictable, and likely undesirable, effect on solution quality. It is generally accepted that the most predictable way to scale simulated annealing effort is by manipulating the number of moves attempted per temperature iteration. For example, in VPR the number of moves in a given iteration is always based on the size of the input netlist: O(n1.33 ). The annealing effort is simply adjusted by scaling up or down the multiplicative constant portion of this value. In VPR, the “fast” placement option simply divides the default value by 10, which in testing indeed reduces the overall placement time by a factor of 10 while affecting final circuit quality by less than 10 percent [3]. Furthermore, as shown by Mulpuri and Hauck [12], simply changing the number of moves per iteration allows a continuous and relatively predictable spectrum of placement effort versus placement quality results. Haldar and colleagues [11] exploited a very similar phenomenon to reduce mapping time by distributing the simulated annealing effort across multiple processors. In the strictest sense, simulated annealing is very difficult to parallelize because it attempts sequential changes to a given placement in order to slowly improve the overall wirelength. To be most faithful to this process while attempting multiple changes simultaneously, different processors must try nonoverlapping changes to the system; otherwise, multiple processors may try to move the same block to two different locations or two different blocks to the same location. Not only is this type of coordination typically very difficult to enforce, it also generally requires a large amount of communication between processors. Since all processors begin each move operating on the same placement, they all must communicate any changes that are made after each step. However, a slightly less faithful but far simpler approach can take advantage of
416
Chapter 20
I
Fast Compilation Techniques
the idea that reducing the number of moves attempted per temperature iteration can gracefully reduce runtime. In this case, all of the processors agree upon a single placement to begin a temperature iteration. At this point, though, each processor performs simulated annealing independently of the others. To reduce the overall runtime, given N processors, each only attempts 1/N of the originally intended moves per iteration. At the end of the iteration, the placements discovered by all of the processors are compared and the best one is broadcasted to the rest for use during the next iteration. This greatly reduces the communication overhead and produces nearly linear speedup for two to four processors while reducing placement quality by only 10 to 25 percent [11]. Wrighton and DeHon [19] also parallelized the simulated annealing process, but approached the problem in a completely different manner. In this case, instead of attempting to develop parallel software, they actually configure an FPGA to find its own placement for a netlist. They divide a large array into distinct processing elements that will each keep track of one node in a small netlist. In their testing, the logic required to trace the inputs and outputs of a single LUT required approximately 400 LUTs. Because every processing element represents the logic held at a single location in the array, a large emulation system consisting of approximately 400 FPGAs can place a netlist for one device at a time, or one large FPGA can place a netlist requiring approximately 1/400 of the array. Each processing element is responsible for keeping track of both the block in the netlist currently mapped to that location and the position of the sinks of the net sourced by this block. During a given timestep, each processing element determines the wirelength of its output net by evaluating the location of all of its sinks; the entire system is then perturbed in parallel by allowing each location to negotiate a possible swap with its neighbors. Just as in conventional simulated annealing, good moves are always accepted and bad moves are accepted with a probability dependent on the annealing temperature and how much worse the move makes the system as a whole. Similarly, although swaps can only be made one nearest neighbor to another, any block can eventually migrate to any other location in the array through multiple swaps. The system avoids having two blocks attempt to occupy the same location by always negotiating swaps pairwise. As shown in Figure 20.1, a block negotiates a swap with each of its neighbors in turn. Phases 1 and 2 may swap blocks to the left or right, while phases 3 and 4 may swap with a neighbor above or below. We should note that although very similar to the classical simulated annealing model, this arrangement does not necessarily calculate placement cost in the same way. The net bounding box calculated at each timestep cannot take into account the potential simultaneous movement of all the other blocks to which it is connected. That said, whatever inaccuracies might be introduced by this computation difference are relatively small. Of much greater importance is the problem caused by communication bandwidth. It is possible that in a given timestep every processing element decides to swap with its neighbor. If this is the case, the location of all sinks will change.
20.1 Accelerating Classical Techniques 1
2 4
4 1 3 1 4
FIGURE 20.1
I
4
4 1
3 2
4 1
1
2 3
3 1
4 2
417
4 1
Swap negotiation in hardware-assisted placement. (Source: Based on an illustration in
Wrighton and DeHon [19]).
FIGURE 20.2
I
Location update chain. (Source: Based on an illustration in Wrighton and DeHon [19]).
To keep completely consistent recordkeeping with conventional simulated annealing, this requires each processing element to notify its nets’ sources of the block’s new location. Of course, this creates a huge communication overhead. However, this can be avoided if the processing elements are allowed to calculate wirelength based on stale location information. As shown in Figure 20.2, instead of a huge broadcast each time a block is relocated, position information marches through the system in a linear fashion. As blocks are moved during the annealing process, new positions for each one are communicated to other blocks via a dedicated location update chain. Thus, if the system has N processing elements, it might take N clock cycles before all relevant processing elements see the new placement of that block. Since the
418
Chapter 20
I
Fast Compilation Techniques
processing elements are still calculating further moves, this means up to N cycles of stale data. Because of these inaccuracies, compared with a fast VPR run, this hardware-based simulated annealing system generally requires 36 percent more routing tracks to implement the same circuits. However, it also is three to four orders of magnitude faster. As mentioned earlier, classical simulated annealing techniques have been very carefully tuned to produce high-quality placements. Most of the methodologies we have covered to accelerate simulated annealing rely on reducing the number of moves attempted. Thus, while they can produce reasonable placements quickly for current circuits, they do not necessarily perform well for all applications. Mulpuri and Hauck [12] demonstrated that, while we may be able to reduce the number of moves per temperature iteration by a factor of 10 with little effect on routability, if we continue to reduce the placement effort, the quality of the placement drops off severely. The conclusion to be drawn is that, acceleration approaches, although reasonable for dealing with FPGA scaling in the short term, are not a permanent solution. Applying them on increasing netlist and device sizes will eventually lead to worse and worse placements, and, furthermore, they simply do not have the capability to produce useable placements quickly enough for either runtime netlist compilation or most instance-specific circuits. On the other hand, hardware-assisted simulated annealing seems far more promising. Although this technique introduces some inaccuracy in cost calculation because of both simultaneously negotiated moves and stale location information, the effect of these factors is relatively predictable. The error introduced by simultaneous moves will always be relatively small because all swaps are performed between nearest neighbors. Also, the error introduced by stale location information scales linearly with netlist size. This means not only that such information will likely cause the placement quality to degrade gracefully but also that we can reduce this inaccuracy relatively easily by adding additional update paths, perhaps even a bidirectional communication network that quickly informs both forward and backward neighbors of a moved element. Since we hope that the majority of nets will cover a relatively small area, this should considerably reduce inaccurate cost calculation due to stale location information. These trade-offs make hardware-assisted annealing an interesting possibility. Although it may impose a significant quality cost, that cost may not grow with increased system capacity, and it may be one of the only approaches that provide the drastic speedups necessary for both runtime netlist compilation and instance-specific circuits. This may make it of particular interest for future nanotechnology systems (see Chapter 38).
20.1.2
Accelerating PathFinder
Just as in placement, minor alterations can be made to classical routing algorithms to improve their runtime. Some extremely simple modifications may speed routing without affecting overall quality, or they may reduce routability in a graceful and predictable manner. Swartz et al. [15] suggest sorting the nets to be routed in order of decreasing fanout instead of simply arbitrarily. Although
20.1 Accelerating Classical Techniques
419
high fanout nets generally make up a small fraction of a circuit, they typically monopolize a large portion of the routing runtime. By routing these comparatively difficult nets first in a given iteration, they may be presented with the lowest congestion cost and thus take the most direct and easily found paths. Lower fanout nets tend to be more localized, so they can deal with congestion more easily and their search time is comparatively smaller. This tends to speed overall routing, but since no changes are made to the actual search algorithm, it is not expected to affect routability. Conversely, Swartz et al. [15] also suggest scaling present sharing and history costs more quickly between routing iterations. As discussed in Chapter 17, PathFinder gradually increases the cost of using congested nodes to discourage sharing over multiple iterations. Increasing present sharing and history costs more aggressively emphasizes removing congestion over route exploration. This may potentially decrease achievable routability, but the system may converge on a legal routing more quickly. One of the most effective changes that can be made to conventional Dijkstrabased routing approaches is limiting the expansion of the search. Ignoring congestion, in most island-style FPGAs it is unnecessary for a given net to use routing resources outside the bounding box formed by its terminals. Of course, congestion must be resolved to obtain a feasible mapping, but given the routing-rich nature of modern reconfigurable devices, and assuming that routing is performed on a reasonable placement, the area formed by a net’s bounding box is most likely to be used. However, traditional Dijkstra’s searches expand from the source of a net evenly in all directions. Given that the source of a 2-terminal net must lie on the edge of the bounding box, this is obviously wasteful since, again ignoring congestion costs, the search essentially progresses as concentric rings—most of which lie in the incorrect direction for finding the sink. As shown in Figure 20.3, it is unlikely that a useful route will require such a meandering path. If we would like to find K
S
FIGURE 20.3
I
A conventional routing search wave.
420
Chapter 20
I
Fast Compilation Techniques
a route between blocks S and K, it is most likely that we will be able to find a direct route between them. Thus, we should direct the majority of our efforts upward and to the right before exploring downward or to the left. As described in Chapter 17, this is the motivation for adding A∗ enhancements to the PathFinder algorithm. However, this concept can be taken even further by formally preventing searches from extending very far beyond the net’s bounding box. According to Betz et al. [3], a reasonable fixed limitation can prevent an exploration from visiting routing channels more than three steps outside of a net’s bounding box. Although this technique may degrade routability under conditions of very high congestion, such situations may not be encountered. An architecture might have sufficient resources so that high-stress routing situations are never created, particularly in scenarios where the user is willing to reduce the amount of logic mapped to an FPGA to improve compilation runtimes. Slightly more difficult to manage is the case of multi-terminal nets. Although the scope of a multisink search as a whole may be limited by the net’s bounding box, this only alleviates one source of typically unnecessary exploration. PathFinder generally sorts the sinks of a multi-terminal net by Manhattan distance. However, each time a sink is discovered, the search for the next sink is restarted based on the entire routing tree found up to that point. As shown in Figure 20.4, this creates a wide search ring that is explored and reexplored each time a new sink is discovered, which is particularly problematic for high-fanout nets. If we consider the new sink and the closest portion of the existing routing tree to be almost a 2-terminal net by itself, we can further reduce the amount of extraneous exploration. Swartz et al. [15] suggest splitting the bounding box of multi-terminal nets into gridlike bins. As shown in Figure 20.5, after a sink is found, a new search is launched for the next furthest sink, but explorations are only started from the portion of the routing tree contained in the bin closest to the new target. In our example, after a route to K1 is found, only the portion
K1
K2
S
K1
S K3
FIGURE 20.4
I
K2
PathFinder exploration and multi-terminal nets.
K3
20.1 Accelerating Classical Techniques
K1
K2
S
I
K2
S K3
FIGURE 20.5
K1
421
K3
Multi-terminal nets and region segmentation.
of the existing path in the topmost bin is used to launch a search for K2. The process of restricting the initialization of the search is repeated to find a route to K3. This may result in slightly longer branches, but, again, it is not an issue in low-stress routing situations. Although potentially very effective, all of these techniques only attempt to improve the time required to route a single net. As described in Chapter 17, however, the PathFinder algorithm is relatively amenable to parallel processing. Chan et al. [7] showed that we can simply split the nets of a given circuit among multiple processors and allow each to route its nets mostly independently of the others. Similarly to what happens in parallel simulated annealing, complete faithfulness to the original PathFinder algorithm requires a large amount of communication bandwidth. This is because we have no guarantees that one processor will not attempt to route a signal on the same wire as another processor during a given iteration unless they are in constant communication with each other. However, because PathFinder already has a mechanism to discourage the overuse of routing resources between different nets over multiple iterations, such continuous communication is unnecessary. We can allow multiple processors to operate independently of one another for an entire routing iteration. When all processors have routed all of their nets, we can simply determine which nodes were accidentally shared by different processors and increase their present sharing and history costs appropriately. Just as it discourages sharing between nets in classical single-processor PathFinder, this gradually discourages sharing between different processors over multiple iterations. We are using the built-in conflict-resolution mechanism in a slightly different way, but this allows us to reduce the communication overhead considerably. That said, after we have resolved the large-scale congestion in the system, the last few routing iterations likely must be performed on a single processor using conventional PathFinder.
422
Chapter 20
I
Fast Compilation Techniques
Overall, these techniques are extremely effective on modern FPGAs. Most of today’s reconfigurable architectures include a wealth of routing resources that are sufficient for a wide range of applications. Because of this, all of these approaches to accelerating PathFinder-style routing produce good results. Ordering of nets, fast growth of present sharing and history costs, and limiting the scope of exploration to net bounding boxes are common in modern FPGA routing tools. Unfortunately, however, they are still not fast enough for the most demanding applications such as runtime netlist compilation. Even the parallel technique outlined here has an unavoidable serial component. Thus, while such techniques may be adequate to produce results for next-generation FPGAs or hardware prototyping systems, they must be much faster if we are to make runtime netlist compilation practical.
20.2
ALTERNATIVE ALGORITHMS Although classical mapping techniques have proven that they can achieve highquality results, there is a limit to their acceleration through conventional means if we want to maintain acceptable quality for many applications. For example, in the case of placement the number of moves attempted in the inner loop of simulated annealing can only be reduced to a certain point before solution quality is no longer acceptable. While the runtime on a single processor can be cut by a factor of 10 with relatively little change in terms of routability or critical-path timing, even such modest degradation may not meet the most demanding design constraints. Furthermore, as discussed earlier, attempting to scale this technique beyond the 10x point generally results in markedly lower quality because the algorithm simply does not have sufficient time to adequately explore the solution space. To achieve further runtime improvements without resorting to potentially complex parallel implementations and without abandoning solution quality, we must make fundamental algorithmic changes.
20.2.1
Multiphase Solutions
One of the most popular ways to accelerate placement is to break the process into multiple phases, each handled by a different algorithm. Although many techniques use this method, a common thread among them all is that largescale optimization is performed first by a fast but relatively imprecise algorithm. Slower, more accurate algorithms are reserved for local, small-scale refinement as a secondary step. A good example of this approach is shown in papers such as that by Xu and Kalid [20]. Here, the authors use a quadratic technique to obtain a rough placement and then work toward a better solution with a short simulated annealing phase. In quadratic placement, the connections between blocks in the netlist are converted into linear equations, any valid solution to which indicates the position of each block. A good placement solution is found by solving the matrix equations while attempting to minimize another function: the sum of the squared
20.2 Alternative Algorithms
423
wirelength for each net. Unfortunately, one of the problems with this approach is that, in order for the equations to be solved quickly, they must be unconstrained. Thus, the placements found directly from the quadratic solver will likely have many blocks that overlap. Xu and Kalid [20] identify these overlapping cells and, over multiple iterations, slowly add equations that force them to move apart. This is a comparatively fast process, but the additional placement legalization factors are added somewhat arbitrarily. Thus, although the quadratic placement might have gotten all of the blocks in roughly the correct area, there is still quite a bit of room for wirelength and timing improvements. In contrast, while simulated annealing produces very good results, much of the runtime is devoted to simply making sense of a random initial placement. By combining the two approaches, and starting a low-temperature annealing only after we obtain a reasonable initial placement from the quadratic solver phase, we can drastically reduce runtime and still maintain the majority of the solution quality. Similar approaches can substitute force-directed placement for large-scale optimization or completely greedy optimization for small-scale improvement [12]. Another way to quickly obtain relatively high-quality initial placements is with partitioning-based approaches. As mentioned in Chapter 14, although recursive bipartitioning can be performed very quickly, reducing the number of signals cut by the partitions is not necessarily the same thing as minimizing wirelength or critical path delay. A similar but more sophisticated method is also discussed in Chapter 14. In hierarchical placement, as described by Sankar and Rose [13], the logical resources of a reconfigurable architecture are roughly divided into K separate regions. Multiple clustering steps then assign the netlist blocks into groups of approximately the correct size for the K logical areas. At this point, the clusters themselves can be moved around via annealing, assuming that all of the blocks in a cluster are at the center of the region. This annealing can be performed very quickly since the number of clusters is relatively small compared to the number of logic blocks in the netlist. We can obtain a relatively good logic block-level placement by taking the clusterlevel placement and decomposing it. Here, we can take each cluster in turn and arbitrarily place every block somewhere within the region assigned to it earlier. This initial placement can then be refined with a low-temperature annealing. Purely mechanical clustering techniques are not the only way to group related logic together and obtain rough placements very quickly. In fact, the initial design specification itself holds valuable information concerning how the circuit is constructed and how it might best be laid out. Unfortunately, this knowledge is typically lost in the conventional tool flow. Regardless of whether they are using a high-level or low-level hardware description language, the organizational methods of humans naturally form top-level designs by connecting multiple large modules together. These large modules are, in turn, also created from lower-level modules. However, information about the overall design organization is generally not passed down through logical synthesis and technology mapping tools.
424
Chapter 20
I
Fast Compilation Techniques
Packing, placement, and routing are typically performed on a completely flattened netlist of basic logic blocks. However, as suggested in works by Gehring and Ludwig and colleagues [10] and Callahan et al. [6], for example, for most applications this innate hierarchy can suggest which pieces are heavily interconnected and should be kept close together during the mapping process. Furthermore, information about multiple instances of the same module can be used to speed the physical design process. The datapath-oriented methodology described in Chapter 15 uses a closely related concept to help design highly structured computations. In datapath composition, the entire CAD toolflow, from initial algorithm specification to floorplanning to placement, is centered on building coarse-grained objects that have obvious, simple relationships to one another. The entire computation is built from regular, snap-together tiles that can be arranged in essentially the same order in which they appear in the input dataflow graph. Although many applications simply do not fit the restrictive nature of the datapath computation model, applications that can be implemented in this way benefit greatly from the highly regular structures these tools create. There may not be as much regularity in most applications, but we can still use organizational information to accelerate both placement and routing. At the very least, such information provides some top-level hints to reasonable clustering boundaries and can be used to roughly floorplan large designs. In some sense, this is exactly the aim of hierarchical placement, although it attempts to accomplish this without any a priori knowledge. Extending this idea, for very large systems we can use these natural boundaries to create multiple, more or less independent top-level placement problems. Even if we place each of the large system-level modules serially on a single processor, it is likely that, because of nonlinear growth in problem complexity, the total runtime will still be smaller than if we had performed one large, unified placement. We can also employ implicit organizational information on a smaller scale in a bottom-up fashion. For example, many modern FPGAs contain dedicated fast carry-chain logic between neighboring cells. To use these structures, however, the cells must be placed in consecutive vertical logic block locations. If we were to begin with a random initial placement for a multibit adder, we would probably not find the optimal single-column placement despite the fact that, based on higher-level information, the best organization is obvious. Such very common operations can be identified and then preplaced and routed with known good solutions. These blocks then become hard macros. Less common or larger calculations can be identified and turned into soft macros. As suggested by projects such as Tessier’s [17], using the high-level knowledge of macros within a hierarchical-style placement tool can improve runtime by a factor of up to 50 without affecting solution quality. Still, while macro identification can significantly improve placement runtime, its effect on routing runtime is likely negligible. Soft macros still need to be routed because each instance may be of a different shape. Furthermore, although hard macros do not need to be repeatedly routed, and may be relatively common, their nets represent a small portion of the overall runtime because
20.2 Alternative Algorithms
425
they are typically short and are simple to route. Rather, to substantially improve routing runtime we need to address the nets that consume the largest portion of the computational effort—high-fanout nets. As discussed earlier, multi-terminal nets present a host of problems for routers such as PathFinder. In many circuits, the routing time for one or two extremely high-fanout nets can be a significant portion of the overall routing runtime. However, this effort might be unnecessary since, even though these nets are ripped up and rerouted in every iteration, they go nearly everywhere within their bounding box. This means that virtually all legal routing scenarios will create a relatively even distribution of traffic within this region and none are markedly better than any other. For this reason, we can easily route these high-fanout nets once at the beginning of the routing phase and then exclude them from following a conventional PathFinder run without seriously affecting overall routability. At the very least, if we do not want to put these nets completely outside the control of PathFinder congestion resolution, we can rip up and reroute them less frequently, perhaps every other or every third iteration. Regardless of how the placement and routing problem is divided into simpler subproblems, multiphase approaches are the most promising way to deal with the issues associated with FPGA technology scaling. Of course, when possible it is best to gather implicit hierarchical information directly from the source hardware description language specification. This not only allows us to create both hard and soft macros very easily, but gives strong hints regarding how large designs might be floorplanned. That said, we may not have information regarding high-level module organization. In these cases we can fall back on hierarchical or partitioning placement techniques to make subsequent annealing problems much more manageable. All of these placement methodologies scale very well, and they represent algorithms that can solve the most pressing issues presented by growing reconfigurable devices and netlists. When applicable, constructive techniques, such as the datapath-oriented methodology described in Chapter 15, or macro-based approaches can be very useful for mapping hardware prototyping systems and instance-specific circuits. These methodologies naturally produce reasonable placements very quickly. Because hardware emulation systems and instance-specific circuits do not necessarily need optimal area or timing results, these techniques often produce placements that can be used directly without the need for subsequent refinement steps.
20.2.2
Incremental Place and Route
Incremental placement and routing techniques attempt to reduce compilation time by combining and extending the same ideas exploited by multiphase compilation approaches: (1) begin with a known reasonable placement and (2) avoid ripping up and rerouting as many nets as possible. In many situations, multiple similar versions of a given circuit might be placed and routed several times. In the case of hardware emulation, for example, it is unlikely that large portions of the circuit will change between consecutive
426
Chapter 20
I
Fast Compilation Techniques
designs. Far more likely is that small bug fixes or local modifications will be made to specific portions of the circuit, leaving the vast majority of the design completely unchanged. Incremental placement and routing methodologies identify those portions of a circuit that have not changed from a previous mapping and attempt to integrate the changed portions in the least disruptive manner. This allows successive design updates to be compiled very quickly and minimizes the likelihood of dramatic changes to the characteristics of the resultant mapping. The key to incremental mapping techniques is to modify an existing placement as little as possible while still finding good locations for newly introduced parts. The largest hurdle to this is merely finding a legal placement for all new blocks. If the changes reduce the overall size of the resulting circuit, any new logic blocks can simply fit into the void left by the old section. However, if the overall design becomes larger, the mapping process is more complex. Although the extra blocks can simply be dropped into any available location on the chip, this will probably result in poor timing and routability. Thus, incremental mapping techniques generally use simple algorithms to slightly move blocks and make vacant locations migrate toward the modified sections of the circuit. The most basic approaches, such as those described by Choy et al. [4], determine where the closest empty logic block locations are and then simply slide intervening blocks toward these vacancies to create space where it is needed. Singh and Brown [14] use a slightly more sophisticated approach that employs a stochastic hill-climbing methodology, similar to a restricted simulated annealing run. This algorithm takes into account where additional resources are needed, the estimated critical path of the circuit, and the estimated required wirelength. In this way, logic blocks along noncritical paths will preferentially be moved to make room for the added logic. Incremental techniques not only speed up the placement process, but can accelerate routing as well. Because so much of the placement is not disturbed, the nets associated with those logic blocks do not necessarily have to be rerouted. Initially, the algorithm can attempt to route only the nets associated with new or moved logic blocks. If this fails, or produces unacceptable timing results, the algorithm can slowly rip up nets that travel through congested or heavily used areas and try again. Either way, it will likely need to reroute only a very small portion of the overall circuit. Unfortunately, there are many situations in which we do not have the prior information necessary to use incremental mapping techniques. For example, the very first compilation of a netlist must be performed from scratch. Furthermore, it is a good idea to periodically perform a complete placement and routing run, because applying multiple local piecework changes, one on top of another, can eventually lead to disappointing global results. However, as mentioned earlier, incremental compilation is ideal for hardware prototyping systems because they are typically updated very frequently with minor changes. This behavior also occurs in many other development scenarios, which is why incremental compilation is a common technique to accelerate the engineering/debugging design loop.
20.3 Effect of Architecture
427
However, there are some situations in which it is very difficult to apply incremental approaches. For example, these techniques rely on the ability to determine what portions of a circuit do or do not change between design revisions. Not only can merely finding these similarities be a difficult problem, we must also be able to carefully control how high-level synthesis, technology mapping, and logic block packing are performed. These portions of the mapping process must be aware when incremental placement and routing is going to be attempted, and when major changes have been made to the netlist and placement and routing should be attempted from scratch.
20.3
EFFECT OF ARCHITECTURE Although we have considered many algorithmic changes that can improve compilation runtime, we should also consider the underlying reasons that the FPGA mapping problem is so difficult. Compared to standard cell designs, FPGAs are much more restrictive because the logic and routing are fixed. Technology mapping must target the lookup tables (LUTs) and small computational cores available on a given device, placement must deliver a legal arrangement that coincides with the array of provided logic blocks, and routing must contend with a fixed topology of communication resources. For these reasons, the underlying architecture of a reconfigurable device strongly affects the complexity of design compilation. For example, routing on a device that had an infinite number of extremely fast and flexible wires in the communication network would be easy. Every signal could simply take its shortest preferred path, and routing could be performed in a single Dijkstra’s pass. Furthermore, placement would also be obvious on such an architecture since even a completely arbitrary arrangement could meet design constraints. Granted, real-world physical limitations prevent us from developing such a perfect device, but we can reduce the necessary CAD effort with smart architectural design that emphasizes ease of compilation—potentially even over logic capacity and clock speed. The Plasma architecture [2] is a good example of designing an FPGA explicitly for simple mapping. Plasma was developed as part of the Teramac project [1]— an extremely large reconfigurable computing system slated to contain hundreds or thousands of individual FPGAs. Even given that a large design would be separated into smaller pieces that could be mapped onto individual FPGAs, contemporary commercial reconfigurable devices required tens of minutes to complete placement and routing for each chip. To further compound this issue, even after placement was completed once, there was no guarantee that all of the signals could be successfully routed, so the entire process might have to be repeated. This meant that a design that utilized thousands of conventional FPGAs could require days or weeks of overall compilation time. For the Teramac system to be useful in applications such as hardware prototyping, in which design changes might be made on a daily or even hourly basis, mapping had to be orders of
428
Chapter 20
I
Fast Compilation Techniques
magnitude faster. Thus, the Plasma FPGA architecture was designed explicitly with fast mapping in mind. Although Plasma differed from contemporary commercial FPGAs in several key ways, its most important distinction was high connectivity. Plasma was built from 6-input, 2-output logic blocks connected hierarchically by two levels of crossbars. As seen in Figure 20.6, logic blocks are separated into groups of 16 that are connected by a full crossbar that spans half the width of the chip. These groups are then connected to other groups by a central partial crossbar. The central vertical lines span a quarter of the height of the array, but have the capability to be connected together to span the entire distance. Since full crossbars would
Crossbar
Crossbar
Partial crossbar Crossbar
Crossbar
Crossbar
Crossbar
Partial crossbar Crossbar
Crossbar
Crossbar
Crossbar
Partial crossbar Crossbar
Crossbar
Crossbar
Crossbar
Partial crossbar Crossbar
FIGURE 20.6
I
The Plasma interconnect network.
Crossbar
20.3 Effect of Architecture
429
have been prohibitively large, the developers used empirical testing to determine what level of connectivity was typically used in representational benchmarks. In addition to high internal connectivity, Plasma also contained an unusually large number of off-chip I/O pins. Although this extremely dense routing fabric consumed 90 percent of the overall area, and its reliance on very long wires reduced the maximum operating frequency considerably, placement and routing could reliably be performed on the order of seconds on existing workstations. Given Teramac’s target applications, the dramatic increase in compilation speed and the extremely consistent place and route success rate was considered to be more important than logical density or execution clock frequency. Of course, not all applications can make such an extreme trade-off between ease of compilation and general usability metrics. However, manipulating the architecture of an FPGA does not necessarily require dramatically altering the characteristics of the device. For example, it is possible to make small changes to the interconnect to make routing simpler. One possibility is using a track domain architecture, which restricts the structure of the switch boxes in an island-style FPGA. As shown in Figure 20.7, the connectivity of an architecture’s switch boxes can affect routability. While each wire in both the top and bottom switch boxes have the same number of fanouts, the top switch box allows tracks to switch wire domains, eventually migrating to any track through multiple switch points.
1
2
1
1
2
L
L
L
2 1
2
1
2
1
1
2
2 1
FIGURE 20.7
L
I
2
Switch box style and routability.
430
Chapter 20
I
Fast Compilation Techniques
This allows a signal coming in on one wire on the left of the top architecture to reach all four wires exiting the right. However, the symmetric switch box shown on the bottom does not allow tracks to switch wire domains and forces a signal to travel along a single class of wire. This means that a signal coming in from the left of the bottom architecture can only reach two of the four wires exiting to the right. Although this may reduce the flexibility of the routing fabric somewhat [18], potentially requiring more wires to achieve the same level of routability [8], this effect is relatively minor. Even though we may need to increase the channel width of our architecture because of the restrictive nature of track domain switch boxes, routing on this type of FPGA can be dramatically faster than on more flexible systems. As shown by Cabral et al. [5], since the routing resources on track domain FPGAs are split into M different classes of wire, routing becomes a parallel problem. First, N processors are each assigned a small number of track domains from a given architecture. Then the nets from a circuit placed onto the architecture are simply split into N groups. Because each track domain is isolated from every other due to the nature of the architecture, each processor can perform normal PathFinder routing without fear that the paths found by one processor will interfere with the paths found by another. When a processor cannot route a signal on its allotted routing resources, it is given an additional unassigned track domain. Although load balancing between processors and track domains is somewhat of a problem, this technique has shown linear or even super-linear speedup with a very small penalty to routability. In this case, Cabral and colleagues [5] were able to solve the problems encountered by the parallel routing approaches that were discussed earlier by modifying the architecture itself. Another way to modify the physical FPGA to speed routing is by offering specialized hardware to allow the device to route its own circuits. Although similar to the approach discussed earlier in which simulated annealing is implemented on a generic FPGA to accelerate the placement of its own circuits, DeHon et al. [9] suggest that by modifying the actual switch points internal to an FPGA, we can create a specialized FPGA that can assist a host processor to perform PathFinder-like routing by performing its own Dijkstra searches. In this type of architecture, the switch points have additional hardware that gives them the ability to remember the inputs and outputs currently being used when the FPGA is put into a special compilation time-only “routing search” mode. After the placement of a given circuit is found, we configure the FPGA to perform routing on itself. This begins by clearing the occupancy markers on all of the switch points. During the routing phase, the host processor requests that each net in turn drive a signal from its source, which helps discover a path to each of its sinks. Every time this signal encounters a switching element, the switch allows the signal to propagate though unallocated resources but prevents it from continuing along occupied segments. In this way, the device explores all possible paths virtually instantaneously. When a route is found between the source and a sink, the switch point occupancy markers along this path are updated to reflect the “taken” status of these resources. When a route cannot be found for a given net, because all of the legal paths have been occupied
20.4 Summary
431
by earlier nets, the system simply victimizes a random previously routed path and rips it up until the blocked net can successfully route. Nets are continuously routed and ripped up in this round-robin fashion until all nets have been routed. Although this approach does not have the same sophistication as PathFinder, the experiments by DeHon and colleagues [9] show that hardware-assisted routing can obtain extremely similar track counts (only 1 to 2 additional tracks) with 4 to 6 orders of magnitude speedup in terms of runtime on the largest benchmarks. Of course, modifying an FPGA architecture can involve a great deal of engineering effort. For example, while hardware-assisted routing is one of the only approaches that is fast enough to make runtime netlist compilation feasible, it involves completely redesigning the communication network. That said, not all of our architecture modifications need to be that drastic. For example, commercial FPGA manufacturers have already made modifications to their architectures that accelerate routing. As mentioned earlier, commercial FPGAs offer a resource-rich, flexible routing fabric to support a wide range of applications. Their high bandwidth and connectivity naturally make the routing problem simpler and much faster to solve. Following this logic, it seems natural that FPGAs might switch to track domain architectures in the future. While such devices require only minor layout changes that slightly affect overall system routability, they enable very simple parallel routing algorithms to be used. This becomes more and more important as reconfigurable devices scale and as multi-threaded and multicore processors gain popularity.
20.4
SUMMARY In this chapter we explored many techniques to accelerate FPGA placement and routing. Ultimately, all of them have restrictions, benefits, and drawbacks. This means that our applications, architectures, and design constraints must dictate which methodologies can and should be used. Several of the approaches do not provide acceptable runtime given problem constraints, while some may not offer sufficient implementation quality. Some techniques may not scale adequately to address our issues, while we may not have the necessary information to use others. FPGA scaling. Although classical block-level simulated annealing techniques have been the cornerstone of FPGA CAD tools for decades, these methodologies must eventually be replaced. Hierarchical and macro-based techniques seem to scale much more gracefully while preserving the large-scale characteristics of high-quality simulated annealing. On the other hand, routing will likely depend on PathFinder and other negotiated congestion techniques for quite some time. That said, for compilation time to keep pace given newer and larger devices, FPGA developers need to make some architectural changes that simplify the routing problem. Track domain
432
Chapter 20
I
Fast Compilation Techniques
systems seem to be a natural solution given that modern desktops and workstations offer multiple types of parallel processing resources. Hardware prototyping and logic emulation systems. While these systems benefit greatly from incremental mapping techniques, they still require fast place and route algorithms when compilation needs to be performed from scratch. Hardware-assisted placement seems an obvious choice that can take full advantage of the multichip arrays present in these large devices. Furthermore, since optimal critical-path timing is not essential and application source code is generally available to provide hierarchical information, datapath and macro-based approaches can be very effective. Instance-specific designs. Datapath and macro-based approaches are even more important to instance-specific circuits because they cannot take advantage of many other techniques. However, the limited scope of these problems and the dramatic speedup made possible by these systems also make specialized architectures attractive. While the overhead imposed by architectures such as Plasma may not be practical for most commercial devices, these drawbacks are far less important to instance-specific circuits given the significant CAD tool benefits. Runtime netlist compilation. Reconfigurable computing systems that require runtime netlist compilation present an incredibly demanding real-time compilation problem. Correspondingly, these systems require the most aggressive architectural approaches to make this possible. Radical system-wide modifications that provide huge amounts of routing resources significantly simplify the placement problem. However, just providing more bandwidth does not necessarily accelerate the routing process. These systems need to provide communication channels that either do not need to be negotiated or, through hardware-assisted routing, can automatically negotiate their own connections. An open question is whether the advantages of runtime netlist compilation are worth the attendant costs and complexities they introduce.
References [1] R. Amerson, R. Carter, W. Culbertson, P. Kuekes, G. Snider. Teramac—configurable custom computing. Proceedings of the IEEE Symposium on FPGAs for Custom Computing Machines, 1995. [2] R. Amerson, R. Carter, W. Culbertson, P. Kuekes, G. Snider, L. Albertson. Plasma: An FPGA for million gate systems. Proceedings of ACM Symposium on FieldProgrammable Gate Arrays, 1996. [3] V. Betz, J. Rose, A. Marquardt. Architecture and CAD for Deep-Submicron FPGAs, Kluwer Academic, 1999. [4] C. Choy, T. Cheung, K. Wong. Incremental layout placement modification algorithm. IEEE Transactions on Computer-Aided Design 15(4), April 1996. [5] L. Cabral, J. Aude, N. Maculan. TDR: A distributed-memory parallel routing algorithm for FPGAs. Proceedings of International Conference on Field-Programmable Logic and Applications, 2002.
20.4 Summary
433
[6] T. Callahan, P. Chong, A. Dehon, J. Wawrynek. Fast module mapping and placement for datapaths in FPGAs. Proceedings of ACM Symposium on FieldProgrammable Gate Arrays, 1998. [7] P. Chan, M.D.F. Schlag, C. Ebeling. Distributed-memory parallel routing for fieldprogrammable gate arrays. IEEE Transactions on Computer-Aided Design 19(8), August 2000. [8] Y. Chang, D. F. Wong, C. K. Wong. Universal switch modules for FPGA design. ACM Transactions on Design Automation of Electronic Systems 1(1), January 1996. [9] A. DeHon, R. Huang, J. Wawrzynek. Hardware-assisted fast routing. Proceedings of the IEEE Symposium on FPGAs for Custom Computing Machines, 2002. [10] S. Gehring, S. Ludwig. Fast integrated tools for circuit design with FPGAs. Proceedings of ACM Symposium on Field-Programmable Gate Arrays, 1998. [11] M. Haldar, M. A. Nayak, A. Choudhary, P. Banerjee. Parallel algorithms for FPGA placement. Proceedings of the Great Lakes Symposium on VLSI, 2000. [12] C. Mulpuri, S. Hauck. Runtime and quality trade-offs in FPGA placement and routing. Proceedings of ACM Symposium on Field-Programmable Gate Arrays, 2001. [13] Y. Sankar, J. Rose. Trading quality for compile time: Ultra-fast placement for FPGAs. Proceedings of ACM Symposium on Field-Programmable Gate Arrays, 1999. [14] D. Singh, S. Brown. Incremental placement for layout-driven optimizations on FPGAs. Proceedings of the IEEE/ACM International Conference on Computer-Aided Design, 2002. [15] J. Swartz, V. Betz, J. Rose. A fast routability-driven router for FPGAs. Proceedings of the ACM Symposium on Field-Programmable Gate Arrays, 1998. [16] R. Tessier. Negotiated A* routing for FPGAs. Proceedings of the Canadian Workshop on Field-Programmable Devices, 1998. [17] R. Tessier. Fast placement approaches for FPGAs. Transactions on Design Automation of Electronic Systems 7(2), April 2002. [18] S. Wilton. Architecture and Algorithms for Field-Programmable Gate Arrays with Embedded Memory, Ph.D. thesis, University of Toronto, 1997. [19] M. Wrighton, A. DeHon. Hardware-assisted simulated annealing with application for fast FPGA placement. Proceedings of the ACM Symposium on FieldProgrammable Gate Arrays, 2003. [20] Y. Xu, M.A.S. Kalid. QPF: Efficient quadratic placement for FPGAs. Proceedings of the International Conference on Field-Programmable Logic and Applications, 2005.
This page intentionally left blank
PART
IV
APPLICATION DEVELOPMENT
Creating an efficient FPGA-based computation is similar to creating any other hardware. A designer carefully optimizes his or her computation to the needs of the underlying technology, exploiting the parallelism available while meeting resource and performance constraints. These designs are typically written in a hardware description language (HDL), such as Verilog, and CAD tools are then used to create the final implementation. Field-programmable gate arrays (FPGAs) do have unique constraints and opportunities that must be understood in order for this technology to be employed most effectively. The resource mix is fixed, and the devices are never quite fast enough or have high enough capacity for what we want to do. However, because the chips are reprogrammable we can change the system in response to bugs or functionality upgrades, or even change the computation as it executes. Because of the unique restrictions and opportunities inherent in FPGAs, a set of approaches to application development have proven critical to exploiting these devices to the fullest. Many of them are covered in the chapters that follow. Although not every FPGA-based application will use each of the approaches, a true FPGA expert will make them all part of his or her repertoire. Some of the most challenging questions in the design process come at the very beginning of a new project: Are FPGAs a good match for the application? If so, what problems must be considered and overcome? Will runtime reconfiguration be part of the solution? Will fixed- or floatingpoint computation be used? Chapter 21 focuses on this level of design, covering the important issues that arise when we first consider an application and the problems that must be avoided or solved. It also offers a quick overview of application development. Chapters 22 through 26 delve into individual concerns in more detail. FPGAs are unique in their potential to be more efficient than even ASICs for some types of problems: Because the circuit design is completely programmable, we can create a custom circuit not just for a given problem but for a specific problem instance. Imagine, for example, that we are creating an engine for solving Boolean equations (e.g., a SAT solver, discussed in Chapter 29 in Part V). In an ASIC design, we
436
Part IV
I
Application Development
would create a generic engine capable of handling any possible Boolean equation because each use of the chip would be for a different equation. In an FPGA-based system, the equation can be folded into the circuit mapping itself, creating a custom FPGA mapping optimized to solving that Boolean equation and no other. As long as there is a CPU available to dynamically create a new FPGA bitstream each time a new Boolean equation must be solved, a much more aggressively optimized design can be created. However, because this means that the time to create the new mapping is part of system execution, fast mapping algorithms are often the key (Chapter 20). This concept of instance-specific circuits is covered in Chapter 22. In most cases, the time to create a completely new mapping in response to a specific problem instance is too long. Indeed, if it takes longer to create the custom circuit than for a generic circuit to solve the problem, the generic circuit is the better choice. However, more restricted versions of this style of optimization are still valuable. Consider a simple FIR filter, which involves multiplication of an incoming datastream with a set of constant coefficients. We could use a completely generic multiplier to handle the constant * variable computation. However, the bits of the constant are known in advance, so many parts of this multiplication can be simplified out. Multipliers, for example, generally compute a set of partial products—the result of multiplying one input with a single bit of the other input. These partial products are then added together. If the constant coefficient provided that single bit for a partial product, we can know at mapping creation time whether that partial product will be 0 or equal to the variable input—no hardware is necessary to create it. Also, in cases where the partial product is a 0, we no longer need to add it into the final result. In general, the use of constant inputs to a computation can significantly improve most metrics in FPGA mapping quality. These techniques, called constant propagation and partial evaluation, are covered in Chapter 22. Number formats in FPGAs are another significant concern. For microprocessor-based systems we are used to treating everything as a 64-bit integer or an IEEE-format floating-point value. Because the underlying hardware is hardcoded to efficiently support these specific number formats, any other format is unlikely to be useful. However, in an FPGA we custom create the datapath. Thus, using a 64-bit adder on values that are at most 18 bits in length is wasteful because each bit position consumes one or more lookup tables (LUTs) in the device. For this reason, an FPGA designer will carefully consider the required wordlength of the numbers in the system, hoping to shave off some bits of precision and thus reduce the hardware requirements of the design.
Application Development
437
Fractional values, such as π or fractions of a second, are more problematic. In many cases, we can use a fixed-point format. We might use numbers in the range of 0. . . 31 to represent the values from 0 to 31 32 1 in steps of 32 by just remembering that the number is actually scaled by a factor of 32. Techniques for addressing each of the concerns just mentioned are treated in Chapter 23. Sometimes these optimizations simply are not possible, particularly for signals that require a high dynamic range (i.e., they must represent both very large and very small values simultaneously), so we need to use a floating-point format. This means that each operation will consume significantly more resources than its integer or fixed-point alternatives will. Chapter 31 in Part V covers floating-point operations on FPGAs in detail. Once the number format is decided, it is important to determine how best to perform the actual computation. For many applications, particularly those from signal processing, the computation will involve a large number of constant coefficient multiplications and subsequent addition operations, such as in finite impulse response (FIR) filters. While these can be carried out in the normal, parallel adders and multipliers from standard hardware design, the LUT-based logic of an FPGA allows an even more efficient implementation. By converting to a bit–serial dataflow and storing the appropriate combination of constants into the LUTs in the FPGA, the multiply–accumulate operation can be compressed to a small table lookup and an addition. This technique, called distributed arithmetic, is covered in Chapter 24. It is capable of providing very efficient FPGA-based implementations of important classes of digital signal processing (DSP) and similar operations. Complex mathematical operations such as sine, cosine, division, and square root, though less common than multiply–add, are still important in many applications. In some cases they can be handled by table lookup, with a table of precomputed results stored in memories inside the FPGA or in attached chips. However, as the size of the operand(s) for these functions grows, the size of the memory explodes, limiting this technique’s effectiveness. A particularly efficient alternative in FPGA logic is the CORDIC algorithm. By the careful creation of an iterative circuit, FPGAs can efficiently compute many of these complex functions. The full details of the CORDIC algorithm, and its implementation in FPGAs, are covered in Chapter 25. A final concern is the coupling of both FPGAs and central processing units (CPUs). In early systems, FPGAs were often deployed together with microprocessors or microcontrollers, either by placing an FPGA card in a host PC or by placing both resources on a single circuit board. With modern FPGAs, which can contain complete microprocessors
438
Part IV
I
Application Development
(either by mapping their logic into LUTs or embedding a complete microprocessor into the chip’s silicon layout), the coupling of CPUs and FPGAs is even more attractive. The key driver is the relative advantages of each technology. FPGAs can provide very high performance for streaming applications with a lot of data parallelism—if we have to apply the same repetitive transformation to a large amount of data, an FPGA’s performance is generally very high. However, for more sequential operations FPGAs are a poor choice. Sometimes long sequences of operations, with little or no opportunity for parallelism, come up in the control of the overall system. Also, exceptional cases do occur and must be handled—for example, the failure of a component, using denormal numbers in floating point, or interfacing to command-based peripherals. In each case a CPU is a much better choice for those portions of a computation. As a result, for many computations the best answer is to use the FPGA for the data-parallel kernels and a CPU for all the other operations. This process of segmenting a complete computation into software/CPU portions and hardware/FPGA portions is the focus of Chapter 26.
CHAPTER
21
IMPLEMENTING APPLICATIONS WITH FPGAS Brad L. Hutchings, Brent E. Nelson Department of Electrical and Computer Engineering Brigham Young University
Developers can choose various devices when implementing electronic systems: field-programmable gate arrays (FPGAs), microprocessors, and other standard products such as ASSPs, and custom chips or application-specific integrated circuits (ASICs). This chapter discusses how FPGAs compare to other digital devices, outlines the considerations that will help designers to determine when FPGAs are appropriate for a specific application, and presents implementation strategies that exploit features specific to FPGAs. The chapter is divided into four major sections. Section 21.1 discusses the strengths and weaknesses of FPGAs, relative to other available devices. Section 21.2 suggests when FPGA devices are suitable choices for specific applications/ algorithms, based upon their I/O and computation requirements. Section 21.3 discusses general implementation strategies appropriate for FPGA devices. Then Section 21.4 discusses FPGA-specific arithmetic design techniques.
21.1
STRENGTHS AND WEAKNESSES OF FPGAs Developers can choose from three general classes of devices when implementing an algorithm or application: microprocessor, FPGA, or ASIC (for simplicity, ASSPs are not considered here). This section provides a brief summary of the advantages and disadvantages of these devices in terms of time to market, cost, development time, power consumption, and debug and verification.
21.1.1
Time to Market
Time to market is often touted as one of the FPGA’s biggest strengths, at least relative to ASICs. With an ASIC, from specification to product requires (at least): (1) design, (2) verification, (3) fabrication, (4) packaging, and (5) device test. In addition, software development requires access to the ASIC device (or an emulation of such) before it can be verified and completed. As immediately available standard devices, FPGAs have already been fabricated, packaged, and tested by the vendor, thereby eliminating at least four months from time to market.
440
Chapter 21
I
Implementing Applications with FPGAs
More difficult to quantify but perhaps more important are: (1) refabrications (respins) caused by either errors in the design or late changes to the specification, due to a change in an evolving standard, for example, and (2) software development schedules that depend on access to the ASIC. Both of these items impact product production schedules; a respin can easily consume an additional four months, and early access to hardware can greatly accelerate software development and debug, particularly for the embedded software that communicates directly with the device. In light of these considerations, a conservative estimate of the time-to-market advantage of FPGAs relative to ASICs is 6 to 12 months. Such a reduction is significant; in consumer electronics markets, many products have only a 24-month lifecycle.
21.1.2
Cost
Per device, FPGAs can be much less expensive than ASICs, especially in lower volumes, because the nonrecurring costs of FPGA fabrication are borne by many users. However, because of their reprogrammability, FPGAs require much more silicon area to implement equivalent functionality. Thus, at the highest volumes possible in consumer electronics, FPGA device cost will eventually exceed ASIC device cost.
21.1.3
Development Time
FPGA application development is most often approached as hardware design: applications are described in Verilog or VHDL, simulated to determine correctness, and synthesized using commercial logic synthesis tools. Commercial tools are available that synthesize behavioral programs written in sequential languages such as C to FPGAs. However, in most cases, much better performance and higher densities are achieved using HDLs, because they allow the user to directly describe and exploit the intrinsic parallelism available in an application. Exploiting application parallelism is the single best way to achieve high FPGA performance. However, designing highly parallel implementations of applications in HDLs requires significantly more development effort than software development with conventional sequential programming languages such as Java or C++.
21.1.4
Power Consumption
FPGAs consume more power than ASICs simply because programmability requires many more transistors, relative to a customized integrated circuit (IC). FPGAs may consume more or less power than a microprocessor or digital signal processor (DSP), depending on the application.
21.1.5
Debug and Verification
FPGAs are developed with standard hardware design techniques and tools. Coded in VHDL or Verilog and synthesized, FPGA designs can be debugged
21.2 Application Characteristics and Performance
441
in simulators just as typical ASIC designs are. However, many designers verify their designs directly, by downloading them into an FPGA and testing them in a system. With this approach the application can be tested at speed (a million times faster than simulation) in the actual operating environment, where it is exposed to real-world conditions. If thorough, this testing provides a stronger form of functional verification than simulation. However, debugging applications in an FPGA can be difficult because vendor tools provide much less observability and controllability than, for example, an hardware description language (HDL) simulator.
21.1.6
FPGAs and Microprocessors
As discussed previously, FPGAs are most often contrasted with custom ASICs. However, if a programmable solution is dictated because of changing application requirements or other factors, it is important to study the application carefully to determine if it is possible to meet performance requirements with a programmable processor—microprocessor or DSP. Code development for programmable processors requires much less effort than that required for FPGAs or ASICs, because developing software with sequential languages such as C or Java is much less taxing than writing parallel descriptions with Verilog or VHDL. Moreover, the coding and debugging environments for programmable processors are far richer than their HDL counterparts. Microprocessors are also generally much less expensive than FPGAs. If the microprocessor can meet application requirements (performance, power, etc.), it is almost always the best choice. In general, FPGAs are well suited to applications that demand extremely high performance and reprogrammability, for interfacing components that communicate with many other devices (so-called glue-logic) and for implementing hardware systems at volumes that make their economies of scale feasible. They are less well suited to products that will be produced at the highest possible volumes or for systems that must run at the lowest possible power.
21.2
APPLICATION CHARACTERISTICS AND PERFORMANCE Application performance is largely determined by the computational and I/O requirements of the system. Computational requirements dictate how much hardware parallelism can be used to increase performance. I/O system limitations and requirements determine how much performance can actually be exploited from the parallel hardware.
21.2.1
Computational Characteristics and Performance
FPGAs can outperform today’s processors only by exploiting massive amounts of parallelism. Their technology has always suffered from a significant clock-rate disadvantage; FPGA clock rates have always been slower than CPU clock rates by about a factor of 10. This remains true today, with clock rates for FPGAs
442
Chapter 21
I
Implementing Applications with FPGAs
limited to about 300 to 350 MHz and CPUs operating at approximately 3 GHz. As a result, FPGAs must perform at least 10 times the computational work per cycle to perform on par with processors. To be a compelling alternative, an FPGA-based solution should exceed the performance of a processor-based solution by 5 to 10 times and hence must actually perform 50 to 100 times the computational work per clock cycle. This kind of performance is feasible only if the target application exhibits a corresponding amount of exploitable parallelism. The guideline of 5 to 10 times is suggested for two main reasons. First of all, prior to actual implementation, it is difficult or impossible to foresee the impact of various system and I/O issues on eventual performance. In our experience, 5 times can quickly become 2 times or less as various system and algorithmic issues arise during implementation. Second, application development for FPGAs is much more difficult than conventional software development. For that reason, the additional development effort must be carefully weighed against the potential performance advantages. A guideline of 5 to 10 times provides some insurance that any FPGA-specific performance advantages will not completely vanish during the implementation phase. Ultimately, the intrinsic characteristics of the application place an upper bound on FPGA performance. They determine how much raw parallelism exists, how exploitable it is, and how fast the clock can operate. A review of the literature [3–6, 11, 16, 19–21, 23, 26, 28] shows that the application characteristics that have the most impact on application performance are: data parallelism, amenability to pipelining, data element size and arithmetic complexity, and simple control requirements. Data parallelism Large datasets with few or no data dependencies are ideal for FPGA implementation for two reasons: (1) They enable high performance because many computations can occur concurrently, and (2) they allow operations to be extensively rescheduled. As previously mentioned, concurrency is extremely important because FPGA applications must be able to achieve 50 to 100 times the operations per clock cycle of a microprocessor to be competitive. The ability to reschedule computations is also important because it makes it feasible to tailor the circuit design to FPGA hardware and achieve higher performance. For example, computations can be scheduled to maximize data reuse to increase performance and reduce memory bandwidth requirements. Image-processing algorithms with their attendant data parallelism have been among the highestperforming algorithms mapped to FPGA devices. Data element size and arithmetic complexity Data element size and arithmetic complexity are important because they strongly influence circuit size and speed. For applications with large amounts of exploitable parallelism, the upper limit on this parallelism is often determined by how many operations can be performed concurrently on the FPGA device. Larger data elements and greater arithmetic complexity lead to larger
21.2 Application Characteristics and Performance
443
and fewer computational elements and less parallelism. Moreover, larger and more complex circuits exhibit more delay that slows clock rate and impacts performance. Not surprisingly, representing data with the fewest possible bits and performing computation with the simplest operators generally lead to the highest performance. Designing high-performance applications in FPGAs almost always involves a precision/performance trade-off. Pipelining Pipelining is essential to achieving high performance in FPGAs. Because FPGA performance is limited primarily by interconnect delay, pipelining (inserting registers on long circuit pathways) is an essential way to improve clock rate (and therefore throughput) at the cost of latency. In addition, pipelining allows computational operations to be overlapped in time and leads to more parallelism in the implementation. Generally speaking, because pipelining is used extensively throughout FPGA-based designs, applications must be able to tolerate some latency (via pipelining) to be suitable candidates for FPGA implementation. Simple control requirements FPGAs achieve the highest performance if all operations can be statically scheduled as much as possible (this is true of many technologies). Put simply, it takes time to make decisions and decision-making circuitry is often on the critical path for many algorithms. Replacing runtime decision circuitry with static control eliminates circuitry and speeds up execution. It makes it much easier to construct circuit pipelines that are heavily utilized with few or no pipeline bubbles. In addition, statically scheduled controllers require less circuitry, making room for more datapath operators, for example. In general, datasets with few or no dependencies often have simple control requirements.
21.2.2
I/O and Performance
As mentioned previously, FPGA clock rates are at least one order of magnitude slower than those of CPUs. Thus, significant parallelism (either data parallelism or pipelining) is required for an FPGA to be an attractive alternative to a CPU. However, I/O performance is just as important: Data must be transmitted at rates that can keep all of the parallel hardware busy. Algorithms can be loosely grouped into two categories: I/O bound and compute bound [17, 18]. At the simplest level, if the number of I/O operations is equal to or greater than the number of calculations in the computation, the computation is said to be I/O bound. To increase its performance requires an increase in memory bandwidth—doing more computation in parallel will have no effect. Conversely, if the number of computations is greater than the number of I/O operations, computational parallelism may provide a speedup. A simple example of this, provided by Kung [18], is matrix–matrix multiplication. The total number of I/Os in the computation, for n-by-n matrices, is 3n2 —each matrix must be read and the product written back. The total number of computations to be done, however, is n3 . Thus, this computation is
444
Chapter 21
I
Implementing Applications with FPGAs
compute bound. In contrast, matrix–matrix addition requires 3n2 I/Os and 3n2 calculations and is thus I/O bound. Another way to see this is to note that each source element read from memory in a matrix–matrix multiplication is used n times and each result is produced using n multiply–accumulate operations. In matrix–matrix addition, each element fetched from memory is used only once and each result is produced from only a single addition. Carefully coordinating data transfer, I/O movement, and computation order is crucial to achieving enough parallelism to provide effective speedup. The entire field of systolic array design is based on the concepts of (1) arranging the I/O and computation in a compute-bound application so that each data element fetched from memory is reused multiple times, and (2) keeping many processing elements busy operating in parallel on that data. FPGAs offer a wide variety of memory elements that can be used to coordinate I/O and computation: flip-flops to provide single-bit storage (10,000s of bits); LUT-based RAM to provide many small blocks of randomly distributed memory (100,000s of bits); and larger RAM or ROM memories (1,000,000s of bits). Some vendors’ FPGAs contain multiple sizes of random access memories, and these memories are often easily configured into special-purpose structures such as dynamic-length shift registers, content-addressable memories (CAMs), and so forth. In addition to these types of on-chip memory, most FPGA platforms provide off-chip memory as well. Increasing the I/O bandwidth to memory is usually critical in harnessing the parallelism inherent in a computation. That is, after some point, further multiplying the number of processing elements (PEs) in a design (to increase parallelism) usually requires a corresponding increase in I/O. This additional I/O can often be provided by the many on-chip memories in a typical modern FPGA. The work of Graham and Nelson [8] describes a series of early experiments to map time-delay SONAR beam forming to an FPGA platform where memory bandwidth was the limiting factor in design speedup. While the data to be processed were an infinite stream of large data blocks, many of the other data structures in the computation were not large (e.g., coefficients, delay values). In this computation, it was not the total amount of memory that limited the speedup but rather the number of memory ports available. Thus, the use of multiple small memories in parallel were able to provide the needed bandwidth. The availability of many small memories in today’s FPGAs further supports the idea of trading off computation for table lookup. Conventional FPGA fabrics are based on a foundation of 4-input LUTs; in addition, larger on-chip memories can be used to support larger lookup structures. Because the memories already exist on chip, unlike in ASIC technology, using them adds no additional cost to the system. A common approach in FPGA-based design, therefore, is to evaluate which parts of the system’s computations might lend themselves to table lookup and use the available RAM blocks for these lookups. In summary, the performance of FPGA-based applications is largely determined by how much exploitable parallelism is available, and by the ability of the system to provide data to keep the parallel hardware operational.
21.3 General Implementation Strategies for FPGA-based Systems
21.3
445
GENERAL IMPLEMENTATION STRATEGIES FOR FPGA-BASED SYSTEMS In contrast with other programmable technologies such as microprocessors or DSPs, FPGAs provide an extremely rich and complex set of implementation alternatives. Designers have complete control over arithmetic schemes and number representation and can, for example, trade precision for performance. In addition, reprogrammable, SRAM-based FPGAs can be configured any number of times to provide additional implementation flexibility for further tailoring the implementation to lower cost and make better use of the device. There are two general configuration strategies for FPGAs: configure-once, where the application consists of a single configuration that is downloaded for the duration of the application’s operation, and runtime reconfiguration (RTR), where the application consists of multiple configurations that are “swapped” in and out as the application operates [14].
21.3.1
Configure-once
Configure-once (during operation) is the simplest and most common way to implement applications with reconfigurable logic. The distinctive feature of configure-once applications is that they consist of a single system-wide configuration. Prior to operation, the FPGAs comprising the reconfigurable resource are loaded with their respective configurations. Once operation commences, they remain in this configuration until the application completes. This approach is very similar to using an ASIC for application acceleration. From the application point of view, it matters little whether the hardware used to accelerate the application is an FPGA or a custom ASIC because it remains constant throughout its operation. The configure-once approach can also be applied to reconfigurable applications to achieve significant acceleration. There are classes of applications, for example, where the input data varies but remains constant for hours, days, or longer. In some cases, data-specific optimizations can be applied to the application circuitry and lead to dramatic speedup. Of course, when the data changes, the circuit-specific optimizations need to be reapplied and the bitstream regenerated. Applications of this sort consist of two elements: (1) the FPGA and system hardware, and (2) an application-specific compiler that regenerates the bitstream whenever the application-specific data changes. This approach has been used, for example, to accelerate SNORT, a popular packet filter used to improve network security [13]. SNORT data consists of regular expressions that detect malicious packets by their content. It is relatively static, and new regular expressions are occasionally added as new attacks are detected. The applicationspecific compiler translates these regular expressions into FPGA hardware that matches packets many times faster than software SNORT. When new regular expressions are added to the SNORT database, the compiler is rerun and a new configuration is created and downloaded to the FPGA.
446
Chapter 21
21.3.2
I
Implementing Applications with FPGAs
Runtime Reconfiguration
Whereas configure-once applications statically allocate logic for the duration of an application, RTR applications use a dynamic allocation scheme that re-allocates hardware at runtime. Each application consists of multiple configurations per FPGA, with each one implementing some fraction of it. Whereas a configure-once application configures the FPGA once before execution, an RTR application typically reconfigures it many times during the normal operation. There are two basic approaches that can be used to implement RTR applications: global and local (sometimes referred to as partial configuration in the literature). Both techniques use multiple configurations for a single application, and both reconfigure the FPGA during application execution. The principal difference between the two is the way the dynamic hardware is allocated. Global RTR Global RTR allocates all (FPGA) hardware resources in each configuration step. More specifically, global RTR applications are divided into distinct temporal phases, with each phase implemented as a single system-wide configuration that occupies all system FPGA resources. At runtime, the application steps through each phase by loading all of the system FPGAs with the appropriate configuration data associated with a given phase. Local RTR Local RTR takes an even more flexible approach to reconfiguration than does global RTR. As the name implies, these applications locally (or selectively) reconfigure subsets of the logic as they execute. Local RTR applications may configure any percentage of the reconfigurable resources at any time, individual FPGAs may be configured, or even single FPGA devices may themselves be partially reconfigured on demand. This flexibility allows hardware resources to be tailored to the runtime profile of the application with finer granularity than that possible with global RTR. Whereas global RTR approaches implement the execution process by loading relatively large, global application partitions, local RTR applications need load only the necessary functionality at each point in time. This can reduce the amount of time spent downloading configurations and can lead to a more efficient runtime hardware allocation. The organization of local RTR applications is based more on a functional division of labor than the phased partitioning used by global RTR applications. Typically, local RTR applications are implemented by functionally partitioning an application into a set of fine-grained operations. These operations need not be temporally exclusive—many of them may be active at one time. This is in direct contrast to global RTR, where only one configuration (per FPGA) may be active at any given time. Still, with local RTR it is important to organize the operations such that idle circuitry is eliminated or greatly reduced. Each operation is implemented as a distinct circuit module, and these circuit modules are then downloaded to the FPGAs as necessary during operation. Note that, unlike global RTR, several of these operations may be loaded simultaneously, and each may consume any portion of the system FPGA resources.
21.3 General Implementation Strategies for FPGA-based Systems
447
RTR applications Runtime Reconfigured Artificial Neural Network (RRANN) is an early example of a global RTR application [7]. RRANN divided the back-propagation algorithm (used to train neural networks) into three temporally exclusive configurations that were loaded into the FPGA in rapid succession during operation. It demonstrated a 500 percent increase in density by eliminating idle circuitry in individual algorithm phases. RRANN was followed up with RRANN-2 [9], an application using local RTR. Like RRANN, the algorithm was still divided into three distinct phases. However, unlike the earlier version, the phases were carefully designed so that they shared common circuitry, which was placed and routed into identical physical locations for each phase. Initially, only the first configuration was loaded; thereafter, the common circuitry remained resident and only circuit differences were loaded during operation. This reduced configuration overhead by 25 percent over the global RTR approach. The Dynamic Instruction Set Computer (DISC) [29] used local RTR to create a sequential control processor with a very small fixed core that remained resident at all times. This resident core was augmented by circuit modules that were dynamically loaded as required by the application. DISC was used to implement an image-processing application that consisted of various filtering operations. At runtime, the circuit modules were loaded as necessary. Although the application used all of the filtering circuit modules, it did not require all of them to be loaded simultaneously. Thus, DISC loaded circuit modules on demand as required. Only a few active circuit modules were ever resident at any time, allowing the application to fit in a much smaller device than possible with global RTR.
21.3.3
Summary of Implementation Issues
Of the two general implementation techniques, configure-once is the simplest and is best supported by commercially available tool flows. This is not surprising, as all FPGA CAD tools are derivations of conventional ASIC CAD flows. While the two RTR implementation approaches (local and global) can provide significant performance and capacity advantages, they are much more challenging to employ, primarily because of a lack of specific tool support. The designer’s primary task when implementing global RTR applications is to temporally divide the application into roughly equal-size partitions to efficiently use reconfigurable resources. This is largely a manual process—although the academic community has produced some partitioning tools, no commercial offerings are currently available. The main disadvantage of global RTR is the need for equal-size partitions. If it is not possible to evenly partition the application, inefficient use of FPGA resources will result. The main advantage of local RTR over global RTR is that it uses fine-grained functional operators that may make more efficient use of FPGA resources. This is important for applications that are not easily divided into equal-size temporally exclusive circuit partitions. However, partitioning a local RTR design may require an inordinate amount of designer effort. For example, unlike global
448
Chapter 21
I
Implementing Applications with FPGAs
RTR, where circuit interfaces typically remain fixed between configurations, local RTR allows these interfaces to change with each configuration. When circuit configurations become small enough for multiple configurations to fit into a single device, the designer needs to ensure that all configurations will interface correctly one with another. Moreover, the designer may have to ensure not only structural compliance but physical compliance as well. That is, when the designer creates circuit configurations that do not occupy an entire FPGA, he or she will have to ensure that the physical footprint of each is compatible with that of others that may be loaded concurrently.
21.4
IMPLEMENTING ARITHMETIC IN FPGAs Almost since their invention, FPGAs have employed dedicated circuitry to accelerate arithmetic computation. In earlier devices, dedicated circuitry sped up the propagation of carry signals for ripple-carry, full-adder blocks. Later devices added dedicated multipliers, DSP function blocks, and more complex fixed-function circuitry. The presence of such dedicated circuitry can dramatically improve arithmetic performance, but also restricts designers to a very small subset of choices when implementing arithmetic. Well-known approaches such as carry-look-ahead, carry-save, signed-digit, and so on, generally do not apply to FPGAs. Though these techniques are commonly used to create very high-performance arithmetic blocks in custom ICs, they are not competitive when applied to FPGAs simply because they cannot access the faster, dedicated circuitry and must be constructed using slower, general-purpose user logic. Instead, FPGA designers accelerate arithmetic in one of two ways with FPGAs: (1) using dedicated blocks if they fit the needs of the application, and (2) avoiding the computation entirely, if possible. Designers apply the second option by, for example, replacing full-blown floating-point computation with simpler, though not equivalent, fixed-point, or block floatingpoint, computations. In some cases, they can eliminate multiplication entirely with constant propagation. Of course, the feasibility of replacing slower, complex functions with simpler, faster ones is application dependent.
21.4.1
Fixed-point Number Representation and Arithmetic
A fixed-point number representation is simply an integer representation with an implied binary point, usually in 2’s complement format to enable the representation of both positive and negative values. A common way of describing the structure of a fixed-point number is to use a tuple: n, m, where n is the number of bits to the left of the binary point and m is the number of bits to the right. A 16.0 format would thus be a standard 16-bit integer; a 3.2 format fixed-point number would have a total of 5 bits with 3 to the left of the implied binary point and 2 to the right. A range of numbers from +1 to −1A is common in digital signal-processing applications. Such a representation might be of the
21.4 Implementing Arithmetic in FPGAs
449
form 1.9, where the largest number is 0.111111111 = 0.99810 and the smallest is 1.000000000 = −110 . As can be seen, fixed-point arithmetic exactly follows the rules learned in grade school, where lining up the implied binary point is required for performing addition or subtraction. When designing with fixed-point values, one must keep track of the number format on each wire; such bookkeeping is one of the design costs associated with fixed-point design. At any point in a computation, either truncation or rounding can be used to reduce the number of bits to the right of the binary point, the effect being to simply reduce the precision with which the number is represented.
21.4.2
Floating-point Arithmetic
Floating-point arithmetic overcomes many of the challenges of fixed-point arithmetic but at increased circuit cost and possibly reduced precision. The most common format for a floating-point number is of the form seeeeeffffff, where s is a sign bit, eeeee is an exponent, and ffffff is the mantissa. In the IEEE standard for single-precision floating point, the number of exponent bits is 8 and the number of mantissa bits is 23, but nonstandard sizes and formats have also been used in FPGA work [2, 24]. IEEE reserves various combinations of exponent and mantissa to represent special values: zero, not a number (NAN), infinity (+8 and −8), and so on. It supports denormalized numbers (no leading implied 1 in the mantissa) and flags them using a special exponent value. Finally, the IEEE specification describes four rounding modes. Because supporting all special case number representations and rounding modes in hardware can be very expensive, FPGA-based floating-point support often omits some of them in the interest of reducing complexity and increasing performance. For a given number of bits, floating point provides extended range to a computation at the expense of accuracy. An IEEE single-precision floating-point number allocates 23 bits to the mantissa, giving an effective mantissa of only 24 bits when the implied 1 is considered. The advantage of floating point is that its exponent allows for the representation of numbers across a broad range (IEEE normalized single-precision values range from ≈ ± 3 × 1038 to ≈ ± 1 × 10−38 ). Conversely, while a 32-bit fixed-point representation (1.31 format) has a range of only −1 to ≈+1, it can represent some values within that range much more accurately than a floating-point format can—for example, numbers close to +1 such as 0.11111111111111111111111111111111. However, for numbers very close to +0, the fixed-point representation would have many leading zeroes, and thus would have less precision than the competing floating-point representation. An important characteristic of floating point is its auto-scaling behavior. After every floating-point operation, the result is normalized and the exponent adjusted accordingly. No work on the part of the designer is required in this respect (although significant hardware resources are used). Thus, it is useful in cases where the range of intermediate values cannot be bounded by the designer and therefore where fixed point is unsuitable.
450
Chapter 21
I
Implementing Applications with FPGAs
The use of floating point in FPGA-based design has been the topic of much research over the past decade. Early papers, such as Ligon and colleagues [15] and Shirazi et al. [24], focused on the cost of floating point and demonstrated that small floating-point formats as well as single-precision formats could be eventually implemented using FPGA technology. Later work, such as that by Bellows and Hutchings [1] and Roesler and Nelson [22], demonstrated novel ways of leveraging FPGA-specific features to more efficiently implement floating-point modules. Finally, Underwood [27] argued that the capabilities of FPGA-based platforms for performing floating point would eventually surpass those of standard computing systems. All of the research just mentioned contains size and performance estimates for floating-point modules on FPGAs at the time they were published. Clever design techniques and growing FPGA densities and clock rates continually combine to produce smaller, faster floating-point circuits on FPGAs. At the time of this writing, floating-point module libraries are available from a number of sources, both commercial and academic.
21.4.3
Block Floating Point
Block floating point (BFP) is an alternative to fixed-point and floating-point arithmetic that allows entire blocks of data to share a single exponent. Fixedpoint arithmetic is then performed on a block of data with periodic rescaling of its data values. A typical use of block floating point is as follows: 1. The largest value in a block of data is located, a corresponding exponent is chosen, and that value’s fractional part is normalized to that exponent. 2. The mantissas of all other values in the block are adjusted to use the same exponent as that largest value. 3. The exponent is dropped and fixed-point arithmetic proceeds on the resulting values in the data block. 4. As the computation proceeds, renormalization of the entire block of data occurs—after every individual computation, only when a value overflows, or after a succession of computations. The key is that BFP allows for growth in the range of values in the data block while retaining the low cost of fixed-point computations. Block floating point has found extensive use in fast Fourier transform (FFT) computations where an input block (such as from an A/D converter) may have a limited range of values, the data is processed in stages, and stage boundaries provide natural renormalization locations.
21.4.4
Constant Folding and Data-oriented Specialization
As mentioned Section 21.3.2, when the data for a computation changes, an FPGA can be readily reconfigured to take advantage of that change. As a simple example of data folding, consider the operation: a =?b, where a and b are 4-bit
21.4 Implementing Arithmetic in FPGAs
451
a0 b0 a1 b1
a=?b
a2 b2
a0 a1
a = ? 1011
a2 a3
a3 b3 (a)
FIGURE 21.1
I
(b)
Two comparator implementations: (a) with and (b) without constant folding.
numbers. Figure 21.1 shows two implementations of a comparator. On the left (a) is a conventional comparator; on the right (b) is a comparator that may be used when b is known (b = 1011). Implementation (a) requires three 4-LUTs to implement while implementation (b) requires just one. Such logic-level constant folding is usually performed by synthesis tools. A more complex example is given by Wirthlin [30], who proposed a method for creating constant coefficient multipliers. When one constant to a multiplier was known, a custom multiplier consuming far fewer resources than a general multiplier could usually be created. Wirthlin’s manipulations [30], going far beyond what logic optimization performed, created a custom structure for a given multiplier instance based on specific characteristics of the constant. Hemmert et al. [10] offer an even more complex example in which a pipeline of image morphology processing stages was created, each of which could perform one image morphology step (e.g., one iteration in an erosion operation). The LUT contents in each pipeline stage controlled the stage’s operation; thus, reconfiguring a stage required modifying only LUT programming. A compiler was then created to convert programs, written in a special image morphology language, into the data required to customize each pipeline stage’s operation. When a new image morphology program was compiled, a new bitstream for the FPGA could be created in a second or two (by directly modifying the original bitstream) and reconfigured onto the platform. This provided a way to create a custom computing solution on a per-program basis with turnarounds on the order of a few seconds. In each case, the original morphology program that was compiled provided the constant data that was folded into the design. Additional examples in the literature show the power of constant folding. However, its use typically requires specialized CAD support. Slade and Nelson [25] argue that a fundamentally different approach to CAD for FPGAs is the solution to providing generalized support for such data-specific specialization. They advocate the use of JHDL [1, 12] to provide deployment time support for data-specific modifications to an operating FPGA-based system. In summary, FPGAs provide architectural features that can accelerate simple arithmetic operations such as fixed-point addition and multiplication.
452
Chapter 21
I
Implementing Applications with FPGAs
Floating-point operations can be accelerated using block floating point or by reducing the number of bits to represent floating-point values. Finally, constants can be propagated into arithmetic circuits to reduce circuit area and accelerate arithmetic performance.
21.5
SUMMARY FPGAs provide a flexible, high-performance, and reprogrammable means for implementing a variety of electronic applications. Because of their reprogrammability, they are well suited to applications that require some form of direct reprogrammability, and to situations where reprogrammability can be used indirectly to increase reuse and thereby reduce device cost or count. FPGAs achieve the highest performance when the application can be implemented as many parallel hardware units operating in parallel, and where the aggregate I/O requirements for these parallel units can be reasonably met by the overall system. Most FPGA applications are described using HDLs because HDL tools and synthesis software are mature and well developed, and because, for now, they provide the best means for describing applications in a highly parallel manner. Once FPGAs are determined to be a suitable choice, there are several ways to tailor the system design to exploit their reprogrammability by reconfiguring them at runtime or by compiling specific, temporary application-specific data into the FPGA circuitry. Performance can be further enhanced by crafting arithmetic circuitry to work around FPGA limitations and to exploit the FPGA’s special arithmetic features. Finally, FPGAs provide additional debug and verification methods that are not available in ASICs and that enable debug and verification to occur in a system and at speed. In summary, FPGAs combine the advantages and disadvantages of microprocessors and ASICs. On the positive side, they can provide high performance that is achievable only with custom hardware, they are reprogrammable, and they can be purchased in volume as a fully tested, standard product. On the negative side, they remain largely inaccessible to the software community; moreover, high-performance application development requires hardware design and the use of standard synthesis tools and Verilog or VHDL.
References [1] P. Bellows, B. L. Hutchings. JHDL—An HDL for reconfigurable systems. Proceedings of IEEE Workshop on FPGAs for Custom Computing Machines, April 1998. [2] B. Catanzaro, B. Nelson. Higher radix floating-point representations for FPGAbased arithmetic. Proceedings of the IEEE Symposium on FPGAs for Custom Computing Machines, April 2005. [3] W. Culbertson, R. Amerson, R. Carter, P. Kuekes, G. Snider. Exploring architectures for volume visualization on the Teramac custom computer. Proceedings of IEEE Workshop on FPGAs for Custom Computing Machines, April 1996.
21.5 Summary
453
[4] A. Dandalis, V. K. Prasanna. Fast parallel implementation of DFT using configurable devices. Field-programmable logic: Smart applications, new paradigms, and compilers. Proceedings 6th International Workshop on Field-Programmable Logic and Applications, Springer-Verlag, 1997. [5] C. H. Dick, F. Harris. FIR filtering with FPGAs using quadrature sigma-delta modulation encoding. Field-programmable logic: Smart applications, new paradigms, and compilers. Proceedings 6th International Workshop on Field-Programmable Logic and Applications, Springer-Verlag 1996. [6] C. Dick. Computing the discrete Fourier transform on FPGA-based systolic arrays. ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, February 1996. [7] J. G. Eldredge, B. L. Hutchings. Density enhancement of a neural network using FPGAs and runtime reconfiguration. Proceedings of the IEEE Workshop on FPGAs for Custom Computing Machines, April 1994. [8] P. Graham, B. Nelson. FPGA-based sonar processing. ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, February 1998. [9] J. D. Hadley, B. L. Hutchings. Design methodologies for partially reconfigured systems. Proceedings of the IEEE Workshop on FPGAs for Custom Computing Machines, April 1995. [10] S. Hemmert, B. Hutchings, A. Malvi. An application-specific compiler for highspeed binary image morphology. Proceedings of the the 9th Annual IEEE Symposium on Field-Programmable Custom Computing Machines, 2001. [11] R. Hudson, D. Lehn, P. Athanas. A runtime reconfigurable engine for image interpolation. In Proceedings of the IEEE Symposium on FPGAs for Custom Computing Machines, IEEE, April 1998. [12] B. L. Hutchings, P. Bellows, J. Hawkins, S. Hemmert, B. Nelson, M. Rytting. A CAD suite for high-performance FPGA design. Proceedings of the IEEE Workshop on FPGAs for Custom Computing Machines, April 1999. [13] B. L. Hutchings, R. Franklin, D. Carver. Assisting network intrusion detection with reconfigurable hardware. Proceedings of the IEEE Symposium on FPGAs for Custom Computing Machines, IEEE, April 2002. [14] B. L. Hutchings, M. J. Wirthlin. Implementation approaches for reconfigurable logic applications. Field-Programmable Logic and Applications, August 1995. [15] W. B. Ligon III, S. McMillan, G. Monn, K. Schoonover, F. Stivers, K. D. Underwood. A re-evaluation of the practicality of floating-point operations on FPGAs. Proceedings of the IEEE Symposium on FPGAs for Custom Computing Machines, 1998. [16] W. E. King, T. H. Drayer, R. W. Conners, P. Araman. Using MORPH in an industrial machine vision system. Proceedings of the IEEE Workshop on FPGAs for Custom Computing Machines, April 1996. [17] H. T. Kung. Why Systolic Architectures? IEEE Computer 15(1), 1982. [18] S. Y. Kung. VLSI Array Processors, Prentice-Hall, 1988. [19] T. Moeller, D. R. Martinez. Field-programmable gate array based radar front-end digital signal processing. Proceedings of the IEEE Workshop on FPGAs for Custom Computing Machines, April 1999. [20] G. Panneerselvam, P. J. W. Graumann, L. E. Turner. Implementation of fast Fourier transforms and discrete cosine transforms in FPGAs. Fifth International Workshop on Field-Programmable Logic and Applications, September 1995. [21] R. J. Petersen. An Assessment of the Suitability of Reconfigurable Systems for Digital Signal Processing, Master’s thesis, Brigham Young University, 1995.
454
Chapter 21
I
Implementing Applications with FPGAs
[22] E. Roesler, B. Nelson. Novel optimizations for hardware floating-point units in a modern FPGA architecture. Proceedings of the 12th International Workshop on Field-Programmable Logic and Applications, August 2002. [23] N. Shirazi, P. M. Athanas, A. L. Abbott. Implementation of a 2D fast Fourier transform on an FPGA-based custom computing machine. Fifth International Workshop on Field-Programmable Logic and Applications, September 1995. [24] N. Shirazi, A. Walters, P. Athanas. Quantitative analysis of floating point arithmetic on FPGA-based custom computing machines. Proceedings of the IEEE Workshop on FPGAs for Custom Computing Machines, April 1995. [25] A. Slade, B. Nelson. Reconfigurable computing application frameworks. Proceedings of the IEEE Symposium on Field-Programmable Custom Computing Machines, April 2003. [26] L. E. Turner, P. J. W. Graumann, S. G. Gibb. Bit-serial FIR filters with CSD coefficients for FPGAs. Fifth International Workshop on Field-Programmable Logic and Applications, September 1995. [27] K. Underwood. FPGAs vs. CPUs: Trends in peak floating-point performance. Proceedings of the ACM/SIGDA 12th International Symposium on Field-Programmable Gate Arrays, 2004. [28] J. E. Vuillemin. On computing power. Programming languages and system architectures. Lecture Notes in Computer Science, vol. 781, Springer-Verlag, 1994. [29] M. J. Wirthlin, B. L. Hutchings (eds). A dynamic instruction set computer. Proceedings of the IEEE Workshop on FPGAs for Custom Computing Machines, April 1995. [30] M. J. Wirthlin. Constant coefficient multiplication using look-up tables. Journal of VLSI Signal Processing 36, 2004.
CHAPTER
22
INSTANCE-SPECIFIC DESIGN Oliver Pell, Wayne Luk Department of Computing Imperial College, London
This chapter covers instance-specific design, an optimization technique involving effective exploitation of information specific to an instance of a generic design description. Here we introduce different types of instance-specific designs with examples. We then describe partial evaluation, a systematic method for producing instance-specific designs that can be automated. Our treatment covers the application of partial evaluation to hardware design in general, and to fieldprogrammable gate arrays (FPGAs) in particular.
22.1
INSTANCE-SPECIFIC DESIGN FPGAs are an effective way to implement designs in computationally intensive datapath-orientated applications such as cryptography, digital signal processing, and network processing. The main alternative implementation technologies in these application areas are general-purpose processors, digital signal processors, and application-specific integrated circuits (ASICs). ASICs are integrated circuits designed to implement a single application directly in fixed hardware. Because they are specialized to a single application, they can be very efficient, with reduced resource usage and power consumption over processor-based software implementations. Reconfigurable logic offers similar advantages over general-purpose processors. However, the overhead of providing general-purpose logic and routing resources means that FPGA-based systems typically provide lower density and performance than ASICs. Still, reconfigurable logic can provide a level of specialization beyond what is possible for an ASIC: optimizing circuits not just for a particular problem but for a particular instance of it. For example, an encryption application can create custom FPGA mappings every time a new password is given, allowing any password to be supported yet providing very highly optimized circuitry. The basic concept of instance-specific design is to optimize a circuit for a particular computation. This can allow a reduction in area and/or an increase in processing speed by sacrificing the flexibility of the circuit. It is important to distinguish between the FPGA itself, which is inherently flexible and can be reconfigured to suit any application by loading a new bitstream, and the current configuration of the chip, which may have a certain level of flexibility in processing its inputs.
456
Chapter 22
I
Instance-specific Design
One common way of achieving instance-specific designs automatically is constant folding (Section 22.2.3), which involves propagating static input values through a circuit to eliminate unnecessary logic. Thus, in our encryption example, an exclusive-or (XOR) gate with one input driven by a password bit can be replaced with a wire or an inverter because the value of that bit is known for each specific password. To produce an instance-specific design, one first needs a means of providing a particular instance for a given design. In the previous encryption example, if all the passwords are known at design time, an instance-specific design specialized for each password can be produced, say by constant propagation followed by the usual tools such as placement (Chapter 14), routing (Chapter 17), and bitstream generation (Chapter 19). At runtime, a processor is often used to control the configuration of the FPGA by the appropriate bitstream at the right moment to support a particular password. However, if the passwords are known only at runtime, then the designer has to decide whether the benefits of having instance-specific designs outweigh the time to produce them, since, for instance, current place and route tools often take a long time to complete and their use is usually not recommended at runtime. Fortunately for some applications, differences between instances are so small that they can be generated realistically using runtime partial evaluation (Section 22.2). The ability to implement specialized designs, while at the same time providing flexibility by allowing different specialized designs to be loaded onto a device, can make reconfigurable logic more effective at implementing some applications than what is possible with ASICs. For other applications, performance improvements from optimizing designs to a particular problem instance can help shift the price/performance ratio away from ASICs and toward FPGAs. Specializing a Data Encryption Standard (DES) crypto-processor, for example, can save 60 percent in area, while replacing general multipliers with constant coefficient versions can save area and lead to speedups of two to four times. Instance-specific designs can also consume lower power. Bit-width optimization of digital filters, for example, has been shown to reduce power consumption by up to 98 percent [2]. Changing an instance-specific design at runtime is generally much slower than changing the inputs of a general circuit, because a new (or partial) configuration must be loaded. Because this may take many tens or hundreds of milliseconds, it is important to carefully choose how a design is specialized.
22.1.1
Taxonomy
Types of instance-specific optimizations We can divide the different approaches to optimizing a design for a particular problem instance into three main categories. Table 22.1 lists some examples of the different categories used. Constant folding Constant folding is the process of eliminating unnecessary logic that computes functions with some inputs that never change or that
22.1 Instance-specific Design TABLE 22.1
I
457
Examples of the uses of instance-specific designs Purpose
Example use
Impact
Constant folding
Optimize logic for static inputs
Key-specific DES
60% area reduction
Function adaptation
Optimize for desired quality of result
Accuracy-guaranteed bitwidth optimization [4]
26% area reduction, 12% latency reduction
Architecture adaptation
Achieve a specified performance, area, or power target
Custom instruction processors [3]
72% decrease in runtime for 3% more area
change only rarely. This logic can be specialized to increase performance and reduce area. Examples of circuits that can benefit from constant folding will be seen later, and a more detailed description of the technique can be found in Section 22.2.3. Function adaptation Function adaptation is the process of altering a circuit’s function to achieve a specific quality of result. Typically this involves varying the number of bits used to represent data values or switching between floating-point and fixed-point arithmetic functions. It can also involve adding or removing parts of processing units that affect accuracy—for example, adding or removing stages from a CORDIC circuit. Word-length optimization can be treated automatically (Chapter 23), modifying a circuit’s area to meet particular accuracy constraints. Architecture adaptation Architecture adaptation alters the way in which a circuit computes a result while keeping the overall function the same. This can entail introducing additional parallelism to increase speed, serializing existing parallel processing units to save area, or refining processing capabilities to exploit some expected characteristics of the input data. Custom instruction processors (see Figure 22.4 later) are one example of the latter type of architecture adaptation.
22.1.2
Approaches
Instance-specific circuits can be produced either by specializing a general-purpose circuit or by starting directly from a “template” that must be instantiated for a particular problem instance before use, as shown in Figure 22.1. Specialization has the advantage that it can often be performed automatically, using techniques such as partial evaluation (Section 22.2). The template approach probably requires the manual design of a template circuit substantially different from the general-purpose architecture, but it can possibly provide a greater level of optimization than what is possible through specializing a general-purpose circuit. It can also offer the advantage that the hardware compilation process may need to be
458
Chapter 22
I
Instance-specific Design
ASIC Generate circuit
Problem
FPGA (a)
Instance
Generate hardware
Problem
FPGA
(b)
Instance
Problem
Adapt to instance
Generate template
FPGA
(c)
FIGURE 22.1 I General-purpose hardware (a) can be implemented using FPGAs or ASICs. Instance information (b) can be incorporated at hardware generation to produce a specialized circuit. “Template” hardware (c) can be generated and then instantiated for particular problem instances. The reason for the differences between (b) and (c) are that, in (b) the time-consuming process of hardware compilation must be executed for each instance while in (c) hardware compilation may only need to be run once, after which the final circuit bitstream can be amended.
executed only once, with instance-specific information being annotated directly into the bitstream. In both cases, one or more instance-specific designs will be produced that can be converted into bitstreams through the FPGA design flow (see chapters in Part III). The appropriate bitstream can then be used to configure an FPGA, usually under the control of a general-purpose processor; during the reconfiguration process the FPGA will usually not be able to process data, although some partially reconfigurable devices can support the reconfiguration of some of its resources, while some of its other resources stay operational.
22.1 Instance-specific Design
22.1.3
459
Examples of Instance-specific Designs
The benefits of instance-specific design can be illustrated by considering a few examples of its use. In this section we present three examples of specialization by constant folding into an existing design, and two examples of architecture adaptation. Constant coefficient multipliers If using standard logic cells, multipliers are relatively expensive to implement on FPGAs. A standard combinational multiplier ANDs each bit of input B with all bits of input A (to perform the multiply by 0/1); an adder is then used to sum together the partial products. When one coefficient of the multiplication is constant, however, the required area can be reduced dramatically. The AND functions are unnecessary because multiplying by a fixed 0 or 1 is trivial, and the adders can be eliminated for bits of B that are 0 (and thus have a partial product of 0). Constant coefficient multiplication is a useful operation in many signal-processing applications. Finite impulse response (FIR) filters contain a set of multiply–add cells that multiply the value of the input signal across a number of cycles with filter coefficients and then sum these values. The multiplier coefficients are properties of the filter and do not change with the input data, but only need adjusting when different filter properties are required. Thus, the generic multipliers in a FIR filter circuit can often be replaced by smaller constant coefficient multipliers. (see Figure 22.2). Another application that requires multipliers with constant coefficients is conversion from RGB to YUV video signals. This is a matrix multiplication operation where one matrix is constant, allowing specialized multipliers to be used. Key-specific crypto-processors Cryptographic algorithms are often designed for efficient implementation in both hardware and software. Block ciphers, such as DES and its successor Advanced Encryption Standard (AES), have regular algorithmic structures consisting of simple operations, such as XOR and bit permutation, that are efficiently implemented in hardware. The DES algorithm consists of 16 “rounds,” or processing stages, that can be pipelined for parallel operation. Blocks of 64-bit data are input to the array along with a 56-bit key and processed through each round, with the same key required to decrypt the data at the other end of the communication channel. A single DES round is illustrated in Figure 22.3. In typical operation it is likely that a crypto-processor is used to process large blocks of data with the same key—for example, when transferring data between a single sender and receiver in a network or encrypting a large file to be saved to disk. It is therefore expected that, in contrast to the data input, the key value will change very slowly. The shaded area of Figure 22.3 is key generator circuitry that generates the round key from the master key and then uses it as an input to a set of 2-input XOR functions across the data bits.
460
Chapter 22
I
Instance-specific Design Filter coefficients
Input value
Multiplier
Multiplier
Multiplier
Multiplier
Adder
Adder
Adder
Result
(a) Input value
Constant coefficient multiplier
Constant coefficient multiplier
Constant coefficient multiplier
Constant coefficient multiplier
Adder
Adder
Adder
Result
(b)
FIGURE 22.2 I FIR filters utilizing (a) general multipliers with variable filter coefficients and (b) instance-specific multipliers specialized to filter coefficients.
When the key value is known, the key generation circuitry can be eliminated and the XOR functions replaced with either wires or inverters [5]. In fact, these inverters can be merged into the substitution stage, eliminating the inverter logic as well [11]. Key-specific crypto-processors can exhibit much higher throughput than general versions, even outperforming ASIC implementations. Area savings are also significant—a relatively simple specialization of a placed DES description can yield area savings of 60 percent when implemented on a Xilinx Virtex FPGA [9]. Network intrusion detection Network Intrusion Detection Systems (NIDS) perform deep packet inspection on network packets to identify malicious attacks. Normally, these systems are implemented in software, but on high-speed networks software alone is often unable to process all traffic at the full data rate.
461
Permutation
Shift
22.1 Instance-specific Design
Key outputs
Data inputs
Permutation
Substitution
Permutation
Shift
Key inputs
Data outputs
FIGURE 22.3 I A single round of a DES circuit. The shaded area contains key expansion circuitry that can be eliminated in a key-specific DES circuit, allowing the XOR function to be optimized.
The SNORT open source NDIS (see http://www.snort.org) uses a rule-based language to detect abnormal network activities. It contains thousands of rules, more than 80 percent of which contain signatures that must be matched against packet contents. Eighty percent of the CPU time for SNORT is consumed by this string-matching task [6]. String matching can be done efficiently in hardware and in particular can be easily optimized for particular search strings. While network data might be expected to arrive at high speed, the rule set changes much more slowly, so string-matching circuitry on FPGAs can be customized to match particular signatures. Section 22.2.5 illustrates in more detail how an instance-specific pattern matcher can be constructed. Further information about instance-specific designs for SAT solving applications can be found in Chapter 29. Customizable instruction processors General-purpose instruction processors are very flexible computational devices. Application-specific instruction processors, in contrast, have been customized to perform particularly well in a particular application area. This is a form of architecture adaptation that can improve performance for particular problem instances while maintaining the flexibility of the overall system.
462
Chapter 22
I
Instance-specific Design
Register file
ALU
Memory
Write back
Fetch Custom execution units
Branch forwarding
FIGURE 22.4 I A simplified architecture of a custom instruction processor. The standard arithmetic and logic operations are augmented by custom execution units that can accelerate particular applications.
Figure 22.4 illustrates the architecture of a simple custom instruction processor that has standard arithmetic and logic functions implemented by a standard ALU. These functions can be supported by additional custom execution units to accelerate particular applications. The automatic identification of instructions that can benefit from the custom execution units is a topic of active research [1]. Further information about partitioning sequential and parallel programs for software and hardware execution can be found in Chapter 26.
22.2
PARTIAL EVALUATION Partial evaluation is a process that automates specialization in software or hardware. In both cases the motivation is the same: to produce a design that runs faster than the original. In software, partial evaluation can be thought of as a combination of constant folding, loop unrolling, function inlining, and interprocedural analyses; in hardware, constant folding is mainly used as an optimization method. Partial evaluation is accomplished by detecting fragments of hardware that depend exclusively on variables with fixed values and then optimizing the hardware logic to reduce its area or even eliminate it totally from the design by precomputing the result.
22.2 Partial Evaluation
22.2.1
463
Motivation
Partial evaluation can simplify logic, and thus reduce area and increase performance. Figure 22.5 illustrates its impact on a 2-input XOR function. When both inputs are dynamic, the logical function must be implemented; however, when one input is known, a partial evaluator can simplify the circuit. If one input is fixed high, the XOR functions as an inverter and so can be replaced by a 1-input NOT gate; if the input is fixed low, the XOR serves as a wire and the logic can be completely eliminated. Constant folding propagates constants through a circuit and can substantially simplify logic functions. This can both reduce area (by allowing functions to be implemented using fewer LUTs) and increase performance (by reducing the number of logic levels between registers). In this chapter we highlight two related uses of partial evaluation for circuits. The first, at the beginning of Section 22.2.4, optimizes generic circuit descriptions for improved performance. That is, circuits are described using clear and easily maintainable but nonoptimal design patterns, which are then automatically optimized during synthesis. The second, in the middle of Section 22.2.4, specializes general circuits when some inputs are static, such as constant coefficient arithmetic.
A
B
C
0
0
0
0
1
1
1
0
1
1
1
0
A C B (a)
B
C
0
1
1
0
B
C
(b)
B
C
0
0
1
1
B
C (c)
FIGURE 22.5 I Partial evaluation of an XOR gate. (a) A 2-input XOR function can be specialized, when input A is to become static: (b) an inverter when A is true or (c) a wire when A is false.
464
Chapter 22
22.2.2
I
Instance-specific Design
Process of Specialization
Consider a general circuit C producing output R, whose inputs are partitioned into two sets S and D. R = C(S, D) This circuit can be specialized for a particular set of S inputs such that it computes the same result for all possible inputs D: R = CS = X (D) A partial evaluator is an algorithm that, when supplied with values for the set of inputs S and the circuit C, produces a specialized circuit CS = X . CS = X = P(C, S, X) where S is the set of static inputs that are known at compile time, and D is the set of dynamic inputs. The importance of partial evaluation is that the specialized circuit computes precisely the same result as the original circuit, though it may require less hardware to do so. Relating this framework to the XOR gate example, R = XOR(A,B), with S = {A} and D = {B}, the two possible simplified functions can be described as R = XORA = X (B) for the two possible values of A. XORA = 0 = P(XOR, A, 0) = NOT(B) XORA=1 = P(XOR, A,1) = B
22.2.3
Partial Evaluation in Practice
Constant folding in logical expressions Partial evaluation of logic is well understood and has been used to simplify circuit logic for many years. Figure 22.6 gives a simple partial evaluation function, P(S)[[X]], for optimizing Boolean logic expressions expressed using not, and, and or connectives. The function is parameterized by a set S of pairs mapping static variables to their values and a Boolean expression X represented as a tree. The function is defined recursively on the structure of Boolean expressions. Cases (1), (2), and (3) are base conditions, indicating that partial evaluation of the Boolean constants True and False always has no effect, and partial evaluation of a variable a returns either the constant value of that variable (if it is contained within the static inputs) or the variable name if it is not static (i.e., remains dynamic). Case (4) defines partial evaluation of a single-input not function. If the subexpression evaluates to logical truth or falsity, this is inverted by the conditional
22.2 Partial Evaluation
(1) P(S)[[True]]
=
True
(2) P(S)[[ False ]]
=
False
(3) P(S)[[ a ]]
=
if a ∈ dom(S) then P(S)[[ S(a) ]] else a
(4) P(S)[[ ¬ x ]]
=
Let y = P(S)[[ x ]] If y == True then False Else if y == False then True Else ¬ y
(5) P(S)[[ x & y ]]
=
Let x’ = P(S)[[ x ]] Let y’ = P(S)[[ y ]] if(x’ == False || y’ == False) then False Else if x’ == True then y’ Else if y’ == True Then x’ Else x’ & y’
(6) P(S)[[ x + y ]]
=
Let x’ = P(S)[[ x ]] Let y’ = P(S)[[ y ]] If(x’ == True || y’ == True) then True Else if x’ == False then y’ Else if y’ == False then x’ Else x + y
FIGURE 22.6
I
465
A partial evaluation algorithm for simplifying Boolean logic expressions.
check. Otherwise, the partially evaluated subexpression is returned with the not operation. Cases (5) and (6) define partial evaluation of 2-input and and or functions. The process is the same: Simplify the subexpressions, precompute the function result if possible, and, if not, return the function with simplified arguments. As an example, consider the application of this algorithm to the simplification of the XOR function in Figure 22.5. XOR can be described in terms of basic Boolean operators as a xor b = (a & ¬b) + (¬ a & b) Partially evaluating when a is asserted, the function is executed: (i)
P({a → True}) [[(a & ¬ b) + (¬ a & b)]]
Case (6) for simplifying logical-or is used, and the two subexpressions are partially evaluated separately: (ii) P({a → True}) [[a & ¬ b]] (iii) P({a → True}) [[¬ a & b]] Both (ii) and (iii) are partially evaluated by the case for logical-and. For (ii) the two subexpressions are first evaluated as (iv) P({a → True}) [[a]] = True (v) P({a → True}) [[¬ b]] = ¬ b
466
Chapter 22
I
Instance-specific Design
In (iv), the variable a is within the static inputs S and thus is simplified to True, while ¬ b is unchanged because it does not contain a. The results from partially evaluating (iii) are similar: (vi) (vii)
P({a → True}) [[¬ a]] = P({a → True}) [[¬ True]] = False P({a → True}) [[b]] = b
Equipped with the simplified subexpressions, the expression a & ¬ b is simplified to ¬ b and the expression ¬ a & b is simplified to False. At the top level this gives a logical-or: ¬ b + False: (viii)
P({a → True}) [[¬ b + False]] = ¬ b
The XOR function reduces to a single inverter; if supplied with {a → False} the partial evaluation function instead returns just b, indicating the simple wire. This is consistent with the truth tables in Figure 22.5. The partial evaluation function just given is quite simple and does not capture all possible optimizations. For example, the logic function a + ¬ a always evaluates to True, regardless of the value of a; however, this expression will not be simplified by this function. Unnecessary logic removal Another optimization that can be carried out during partial evaluation is removal of dead logic in a design, which does not affect any output and thus is unnecessary. This is a very important optimization because it allows generic hardware blocks computing many functions to be used in designs, with unused functions pruned during synthesis. As an algorithmic process, logic removal is quite simple and can be formulated in a number of different ways. One of the simplest is to identify each gate whose output is unconnected and eliminate it. By recursively applying this rule we can eliminate acyclic dead logic.
22.2.4
Partial Evaluation of a Multiplier
Optimizing a simple description Figure 22.7 shows a shift–add circuit designed for a Xilinx architecture to compute the 3-bit multiplication of two 3-bit inputs. This circuit appears semiregular, with x and y inputs propagating horizontally and vertically through a triangular array of processing cells. Each processing cell has common features; however, it contains slightly different logic depending on its position in the array. Creating and maintaining a circuit description that contains and correctly connects the different types of cell is quite complicated. A simpler approach is to exploit the regularity to describe the circuit as an array of a single type of cell that is then partially evaluated during synthesis to produce the circuit in Figure 22.7. The general cell of the multiplier can be described as shown in Figure 22.8. This cell implements a multiplication operation for 1 bit of x and 1 bit of y,
22.2 Partial Evaluation
467
y2 Sum2
y1 Sum1
y0 Sum0
x2
FIGURE 22.7 output.
yin
qout
x1 I
0
x0
0
A shift–add multiplier circuit that takes two 3-bit inputs and produces a 3-bit
xout
pout
yout
muxcy 3-LUT Sumin xorcy
Sumout
Mult_and
qin
FIGURE 22.8
xin I
pin
This cell design can be replicated in a grid arrangement to create a multiplier.
producing sum and carry-out bits, and can be arranged in a grid to generate a multiplication circuit identical in function to that shown in Figure 22.8. These cells can be implemented densely on Xilinx architectures by using the specialized mult_and, xorcy, and muxcy components in each slice.
468
Chapter 22
I
Instance-specific Design
Partial evaluation can automatically produce the optimized multiplication circuitry from the initial regular description. The four components within each cell each have their own logical formula. In the case of mult_and, xorcy, and muxcy, no simplification is possible unless we can totally eliminate these functions, because these are fixed resources on the device, compared with the LUT, which can flexibly implement any 4-input function. The logic of the standard cell can be represented as LUTout = (Yin & Xin ) xor Qin = (¬ (Yin & Xin ) & Qin ) + ((Yin & Xin ) & ¬ Qin ) ANDout = (Yin & Xin ) Pout = (LUTout & Pin ) + (¬ LUTout & ANDout ) SUMout = (¬ LUTout & Pin ) + (LUTout & ¬ Pin ) This logic can be simplified by two operations: removing unconnected logic and constant folding to optimize the logic that remains. Removal of disconnected logic transforms the grid into the triangular array, while constant folding can be performed by the partial evaluation function introduced in Figure 22.6. For example, for the cells along the bottom in Figure 22.8, inputs Qin and Pin are all zero. This allows the LUT contents to be optimized by LUTout = P({Qin → False, SUMin → False, Pin → False}) [[(¬ (Yin & Xin ) & Qin ) + ((Yin & Xin ) & ¬ Qin )]] = (Yin & Xin ) The function attempts to partially evaluate both branches of the OR expression. On the left branch, ¬ (Yin & Xin ) cannot be further optimized and so is left intact; however, Qin is known to be false, so the entire left branch must be false and thus is eliminated. On the right branch, ¬ Qin is evaluated to true and eliminated from the expression, leaving (Yin & Xin ) as the simplified function for the LUT contents. ANDout cannot be simplified because both Yin and Xin are unknown. Neither can Pout because, although it can be partially optimized (because Pin is false), it is a fixed component available on the FPGA that cannot be simplified. Partial evaluation of SUMout does succeed in eliminating logic: SUMout = P({Qin → False, SUMin → False, Pin → False}) [[(¬ LUTout & Pin ) + (LUTout & ¬ Pin )]] = LUTout The result of this partial evaluation is that the bottom cells of the multiplier are optimized to remove the unnecessary xorcy component and to simplify the 3-input LUT function into a basic 2-input AND function. Functional specialization for constant inputs If some of the input values to the multiplication circuit are known statically, we can apply constant folding to eliminate further logic. For example, assume that x1 is static and always zero. Partially evaluating the cell logic under the new assumption that {Xin → False} we find that the entire cell can be eliminated and replaced with pure routing. The simplified cell is shown in Figure 22.9.
22.2 Partial Evaluation
469
Because a single bit of the x input is shared with an entire column of the multiplier, this specialized cell can be used for the full column, replacing all the logic with routing, as shown in Figure 22.10; this arrangement in turn allows optimizations to be applied to the second LUT in the final column to eliminate the XOR function (not shown in the figure so that the routing can be seen). xout
yin
pout
yout
Sumin
Sumout
qin
FIGURE 22.9
pin
xin I
The impact of partial evaluation on multiplier cell logic when Xin = False.
y2 Sum2
y1 Sum1
y0 Sum0
x2
FIGURE 22.10 always zero.
x1 I
0
x0
0
Multiplier circuit specialized by eliminating the center column when xi is
470
Chapter 22
I
Instance-specific Design
When an x value is known to be true, partial evaluation can still carry out some optimizations. However, it does not offer the significant advantages that result when x is false. The LUT can again be optimized to a 2-input function and the mult_and component can be eliminated. This is not very significant, however—the mult_and component is already present on the device, so no area is saved, and it is utilized in parallel with the (slower) LUT so there is also no performance gain. Geometric specialization High-performance FPGA designs often include layout information to produce good placements with low routing delays (see Chapter 17). Specialization of placed designs may lead to nonoptimal results if the placement is not updated to reflect eliminated logic. Automatic placement is not affected, since partial evaluation is usually carried out at the synthesis stage prior to placement and routing. However, when hand-placed designs are specialized, the effect can be to introduce unnecessary delays by failing to compact components. These gaps can also prevent effective use of freed logic because it is fragmented among other components. To ensure a good placement of specialized designs it is necessary to optimize placement information, compacting the circuit. This can be achieved in a framework that allows partial evaluation prior to placement position generation [8] or by describing circuit layouts in a way that adapts when the circuit is specialized [12].
22.2.5
Partial Evaluation at Runtime
Pattern matching is a relatively simple operation that can be performed efficiently in hardware. It is useful in a range of fields but is of particular interest in networking for inspecting the contents of data packets. Figure 22.11 illustrates a simple general pattern matcher made up of a repeating bit-level matcher cell. Each cell contains a pattern and a mask value, which can be loaded separately from the data to be matched. Input data is streamed in 1 bit per cycle; if the mask value for a particular bit position is set, the cell for that position checks the current data value against the bit pattern. The pattern matcher requires one LUT and three registers for each bit in the data pattern. However, it is likely that the pattern and mask values will change much more slowly than the data input, so it is reasonable to investigate the potential for partial evaluation to optimize this circuit for fixed patterns. When the pattern and mask are fixed, the registers storing their values can be eliminated and the logic in the LUTs can be optimized. Figure 22.12 shows how the pattern matcher can be optimized for a pattern of “10X1” (the third pattern bit is a “don’t care,” as specified by the mask of “1101”). This circuit uses fewer registers and three LUTs rather than four. The significance of this particular way of optimizing is that the pattern matcher’s structure has mostly been maintained and thus this specialization can be carried out at runtime. Changes to the mask require routing changes—complex, though far from impossible at runtime; however, the pattern to be matched can be changed merely by updating the LUT contents.
22.2 Partial Evaluation Data
Q
D
Q
D
Q
D
Q
4-LUT
4-LUT
Q
Q
D
Q
4-LUT
471
Q
4-LUT Match
1 Load Pattern
Q
D
Q
D
Q
D
Q
Q
D
Q
Q Q
Mask D
Q
Q
D
Q
D
Q
Q Q
D
Q Q
FIGURE 22.11 I A general bit-level pattern matcher, shown for 4-bit patterns. The pattern matcher circuit is controlled by a pattern and a mask, which can be loaded by asserting the load signal. If the mask bit is set for a particular position, the matcher will attempt to detect a match between the pattern bit and the data bit. Data D
SET
Q
D SET Q
CLR
Q
CLR
2-LUT
Q
2-LUT
1
D SET Q CLR
Q
D SET Q CLR
Q
2-LUT Match
FIGURE 22.12 I An instance-specific pattern matcher optimized for a mask of 1101 and pattern of 10X1 requires only three LUTs and four registers.
22.2.6
FPGA-specific Concerns
LUT mapping Recall the pattern matcher example from the previous section, where we showed one partial evaluation of the circuit for a particular pattern. In this case partial evaluation significantly simplified the contents of each LUT, from a 4-input function to a much simpler 2-input function. It is important that, in contrast to ASICs, there is often no performance advantage to be gained by reducing the complexity of logic functions in an FPGA unless the number of LUTs required to implement those functions is reduced. The propagation delay of a LUT is independent of the function it implements; thus, there is no gain in reducing a 4-input function to a 2-input function within the same LUT (although it does allow routing resources to be freed for other uses). For runtime specialization, it may be desirable to maintain much of the original circuit structure. However, when partial evaluation is carried out at compile time it should be performed before logic is mapped to LUTs, giving more scope for improvements in circuit area and performance. Figure 22.13 shows that the
472
Chapter 22
I
Data
Instance-specific Design
D
Q
D
Q
Q Q
D
Q Q
D
Q Q
4-LUT Match
1
FIGURE 22.13 I The instance-specific pattern matcher from Figure 22.12 can be implemented using a single 4-LUT rather than three 2-LUTs.
specialized pattern matcher can indeed be implemented using one 4-LUT rather than three 2-LUTs, with higher performance and lower area requirements than the version partially evaluated at runtime. In fact, the static 1-input can also be eliminated from this LUT; however, it has been left to indicate that this LUT structure can be used as part of a chain in a larger pattern matcher. Static resources As alluded to in the multiplier example, the existence of specific resources on an FPGA in addition to LUTs, such as carry chain logic, poses a problem for automatic partial evaluation algorithms. Not only can this logic not be simplified (for example, the xorcy gate cannot be replaced with an inverter), in some cases it cannot be eliminated at all because of routing constraints (carry signals must propagate through muxcy multiplexers, for example, regardless of necessity). Furthermore, it is often important to maintain use of the dedicated carry chain, even though significantly simpler logic could perhaps be generated after partial evaluation, because the carry chain is designed to propagate carry signals very quickly—and much faster than the general routing fabric. Verification of runtime specialization Dynamic specialization at runtime poses additional verification problems over and above verification of an original design. While a circuit may have been verified through extensive simulation or formal methods prior to synthesis, when it is specialized at runtime it is possible for new errors to be introduced. To avoid this it is necessary to ensure that the algorithms that apply partial evaluation at runtime have themselves been verified. Formal proof is an appropriate methodology for this problem, since it is necessary to check a generic property of the algorithm applied to all circuits rather than any particular specialization operation.
22.3 Summary
473
Although formal verification has been applied to partial evaluation algorithms for specialization of FPGA circuits [7, 14], it remains a relatively unexplored area.
22.3
SUMMARY This chapter described instance-specific design, which offers the opportunity to exploit the reconfigurable nature of FPGAs to improve performance by tailoring circuits to particular problem instances. It can be broadly categorized into three techniques: constant folding, which can be applied when some inputs are static; function adaptation, which alters the function of circuitry to produce a certain quality of result; and architecture adaptation, in which the circuit architecture is adapted without affecting its functional behavior. The level of automation that can be applied varies among these approaches. Constant folding can often be carried out automatically using partial evaluation techniques. Function adaptation can be performed by varying bit widths and arithmetic methods in parameterized IP cores. Tools, such as Quartz (for low-level design) [12] or ASC (for stream architectures) [10], can produce highly parameterized circuit cores where design parameters can be traded off against each other to achieve the desired requirements in area, speed, and power consumption. Architecture adaptation, such as adding additional processing units to instruction processors, is typically much less automated. The designer must create separate implementations of the different architectures, optimizing each of them somewhat independently.
References ¨ ¨ [1] K. Atasu, R. Dimond, O. Mencer, W. Luk, C. Ozturan, G. Dundar. Optimizing instruction-set extensible processors under data bandwidth constraints. Proceedings of Design, Automation and Test in Europe Conference, 2007. [2] G. A. Constantinides. Perturbation analysis for word-length optimization. Proceedings of the IEEE Symposium on Field-Programmable Custom Computing Machines, 2003. [3] R. Dimond, O. Mencer, W. Luk. Application-specific customisation of multithreaded soft processors. IEE Proceedings on Computers and Digital Techniques, May 2006. [4] D. Lee, A. Abdul Gaffar, R.C.C. Cheung, O. Mencer, W. Luk, G. A. Constantinides. Accuracy guaranteed bit-width optimization. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, October 2006. [5] J. Leonard, W. Magione-Smith. A case study of partially evaluated hardware circuits: Key-specific DES. Proceedings of the International Workshop on FieldProgrammable Logic and Applications, 1997. [6] E. P. Markatos, S. Antonatos, M. Polychronakis, K. G. Anagnostakis. Exclusionbased signature matching for intrusion detection. Proceedings of IASTED International Conference on Communication and Computer Networks, 2002.
474
Chapter 22
I
Instance-specific Design
[7] S. McKeever, W. Luk. Provably-correct hardware compilation tools based on pass separation techniques. Formal Aspects of Computing, June 2006. [8] S. McKeever, W. Luk, A. Derbyshire. Towards verifying parametrised hardware libraries with relative placement information. Proceedings of the 36th IEEE Hawaii International Conference on System Sciences, 2003. [9] S. McKeever, W. Luk, A. Derbyshire. Compiling hardware descriptions with relative placement information for parameterised libraries. Proceedings of International Conference on Formal Methods in Computer-Aided Design, LNCS 2517, 2002. [10] O. Mencer. ASC: A stream compiler for computing with FPGAs. IEEE Transactions on Computer-Aided Design, August 2006. [11] C. Patterson. High performance DES encryption in Virtex FPGAs using JBits. Proceedings of the IEEE Symposium on Field-Programmable Custom Computing Machines, 2000. [12] O. Pell, W. Luk. Compiling higher-order polymorphic hardware descriptions into parametrised VHDL libraries with flexible placement information. Proceedings of the International Workshop on Field-Programmable Logic and Applications, 2006. [13] O. Pell, W. Luk. Quartz: A framework for correct and efficient reconfigurable design. Proceedings of the International Conference on Reconfigurable Computing and FPGAs, 2005. [14] K. W. Susanto, T. Melham. Formally analyzed dynamic synthesis of hardware. Journal of Supercomputing 19(1), 2001.
CHAPTER
23
PRECISION ANALYSIS FOR FIXED-POINT COMPUTATION George A. Constantinides Department of Electrical and Electronic Engineering Imperial College, London
Many values in a computation are naturally represented by integers, which have very efficient hardware implementations; basic operations are relatively cheap, and they map well to an FPGA’s underlying hardware. However, some computations naturally result in fractional values, that is, numbers where part or all of the value are less than 1—for example, 0.25, 3.25, and π—or that are so large that representation as integers is too costly—for example, 10120 . Handling these values is a significant concern because the hardware necessary to compute on scaled values can be significant in speed, power consumption, and area. In arithmetic for reconfigurable computing designs, it is common to employ fixed point instead of floating point to represent scaled values. This chapter explores the reason for this design decision and the associated analysis that must be performed in order to choose an appropriate fixed-point representation for a particular design. Since designs for reconfigurable logic can be customized for particular applications, it is appropriate to fit the number system to the underlying application properties.
23.1
FIXED-POINT NUMBER SYSTEM In general-purpose computing, floating-point representations are most commonly used for the representation of numbers containing fractional components. The floating-point representations standardized by the IEEE [22] have several advantages, the foremost being portability across different computational platforms. In general, we may consider a floating-point number X [t] at time t as made up of two components: a signed mantissa M[t] and a signed exponent E[t] (see equation 23.1). Within this representation, the ratio of the largest positive value of X to the smallest positive value of X varies exponentially with the exponent E[t] and hence doubly exponentially with the number of bits used to store the exponent. As a result, it is possible to store a wide dynamic range with only a few bits of exponent, while the mantissa maintains the precision of the
476
Chapter 23
I
Precision Analysis for Fixed-point Computation
representation across that range by dividing the corresponding interval for each exponent into equally spaced representable values. X [t] = M[t] · 2E[t]
(23.1)
However, the flexibility of the floating-point number system comes at a price. Addition or subtraction of two floating-point numbers requires the alignment of radix (“decimal”) points, typically resulting in a large, slow, and power-hungry barrel shifter. In a general-purpose computer, this is a minor concern compared to the need to easily support a wide range of applications. This is why processors designed for general-purpose computing typically have a built-in floatingpoint unit. In embedded applications, where power consumption and silicon area are of significant concern, the fixed-point alternative is more often used [24]. We can consider fixed point as a degenerate case of floating point, where the exponent is fixed and cannot vary with time (i.e., E[t] = E). The fixing of the exponent eliminates the need for a variable alignment and thus the need for a barrel shifter in addition and subtraction. In fact, basic mathematical operations on fixed-point values are essentially identical to those on integer values. However, compared to floating point, the dynamic range of the representation is reduced because the range of representable values varies only singly exponentially with the number of bits used to represent the mantissa. When implementing arithmetic in reconfigurable logic, the fixed-point number system becomes even more attractive. If a low-area fixed-point implementation can be achieved, space on the device can be freed for other logic. Moreover, the absence of hardware support for barrel shifters in current-generation reconfigurable logic devices results in an even higher area and power overhead compared to that in fully custom or ASIC technologies.
23.1.1
Multiple-wordlength Paradigm
For simplicity we will restrict ourselves to 2’s complement representations, although the techniques presented in this chapter apply similarly to most other common representations. Also, we will use dataflow graphs, also known as signal flow graphs in the digital signal processing (DSP) community, as a simple underlying model of computation [12]. In a dataflow graph, each atomic computation is represented by a vertex v ∈ V, and dataflow between these nodes is represented by a set of directed edges S ⊆ V × V. To be consistent with the terminology used in the signal-processing community, we will refer to an element of S as a signal; the terms signal and variable are used interchangeably. The multiple-wordlength paradigm is a design approach that tries to fit the precision of each part of a datapath to the precision requirements of the algorithm [8]. It can be best introduced by comparison to more traditional fixedpoint and floating-point implementations. Each 2’s complement signal j ∈ S in a multiple-wordlength implementation of a dataflow graph (V, S) has two parameters nj and pj , as illustrated in Figure 23.1(a). The parameter nj represents the
23.1 Fixed-point Number System
p
(n, 0)
S
+
(n, 0)
...
(n, 0)
n
+
(n, 0) (b)
(a) (n, p1[t ])
(n, 0)
477
(n, p2[t ])
(n, p5[t ])
(n, p4[t ]) (c)
(n, p3[t ])
(n1, p1)
+
(n2, p2)
(n5, p5)
(n3, p3)
(n4, p4) (d)
FIGURE 23.1 I The multiple-wordlength paradigm: (a) signal parameters (“s” indicates a sign bit); (b) fixed point; (c) floating point; (d) multiple wordlength. The triangle represents a constant coefficient multiplication, or “gain”; the rectangle represents a register, or unit sample delay.
number of bits in the representation of the signal (excluding the sign bit, by convention), and the parameter pj represents the displacement of the binary point from the least significant bit (LSB) side of the sign bit toward the LSB. Note that there are no restrictions on pj ; the binary point could lie outside the number representation (i.e., pj < 0 or pj > nj ). A simple fixed-point implementation is illustrated in Figure 23.1(b). Each signal j in this dataflow graph representing a recursive DSP algorithm is annotated with a tuple (nj , pj ) representing the wordlength scaling of the signal. In this implementation, all signals have the same wordlength and scaling, although shift operations are often incorporated in fixed-point designs in order to provide an element of scaling control [25]. Figure 23.1(c) shows a standard floating-point implementation, where the scaling of each signal is a function of time. A single systemwide wordlength is common to both fixed and floating point. This is a result of historical implementation on single, or multiple, predesigned arithmetic units. In FPGAs the situation is quite different. Different operations are generally computed in different hardware resources, and each of these computations can be built to any size desired. Such freedom points to an alternative implementation style, shown in Figure 23.1(d). This multiple-wordlength implementation style inherits the speed, area, and power advantages of traditional fixed-point implementations, since the computation is fixed point with respect to each individual computational unit. However, by potentially allowing each signal in the original specification to be encoded by binary words with different scaling and wordlength, the degrees of freedom in design are significantly increased.
478
Chapter 23
23.1.2
I
Precision Analysis for Fixed-point Computation
Optimization for Multiple Wordlength
Now that we have established the possibility of using multiple scalings and wordlengths for different variables, two questions arise: How can we optimize the scalings and wordlengths in a design to match the computation being performed, and what are the potential benefits from doing so? For FPGA-based implementation, the benefits have been shown to be significant: Area savings of up to 45 percent [8] and 80 percent [15] have been reported compared to the use of a single wordlength across the entire circuit. The main substance of this chapter is to describe suitable scaling and wordlength optimization procedures to achieve such savings. Section 23.2 shows that we can determine the appropriate scaling for a signal from an estimation of its peak value over time. One of two main techniques—simulation based and analytical—is then introduced to perform this peak estimation. While an analytical approach provides a tight bound on the peak signal value, it is limited to computations exhibiting certain mathematical properties. For computations outside this class, an analytical technique tends to be pessimistic, and so simulation-based methods are commonly used. Section 23.3 focuses on determining the wordlength for each signal in the computation. The fundamental issue is that, because of roundoff or truncation, the wordlength of different signals in the system can have different impacts on both the implementation area and the error observed at the computation output. Thus, any wordlength optimization system needs to perform a balancing act between these two factors when allocating wordlength to signals. The goal of the work presented in this section is to allocate wordlength so as to minimize the area of the resulting circuit while maintaining an acceptable computational accuracy at the output of the circuit.
23.2
PEAK VALUE ESTIMATION The physical representation of an intermediate result in a bit-parallel implementation of an algorithm consists of a finite set of bits, usually encoded using 2’s complement representation. To make efficient use of the resources, it is essential to select an appropriate scaling for each signal. Such a scaling should ensure that the representation is not overly wasteful in catering to rare or impossibly large values and that overflow errors, which lead to low arithmetic quality, do not occur often. To determine an appropriate scaling, it is necessary to determine the peak value that each signal can reach. Given a peak value P, a power-of-two scaling p is selected with p = log2 P + 1, since power-of-two multiplication is free in a hardware implementation. For some algorithms, it is possible to estimate the peak value that each signal could reach using analytic means. In the next section, such techniques for two different classes of system are discussed. The alternative, to use simulation to determine the peak signal value, is described in the following section.
23.2 Peak Value Estimation
479
Also discussed are some hybrid techniques that aim to combine the advantages of both approaches.
23.2.1
Analytic Peak Estimation
If the DSP algorithm under consideration is a linear time-invariant system, it is possible to find a tight analytic bound on the peak value reachable by every signal in it. This is the problem addressed in the section immediately following. If, on the other hand, the system is nonlinear or time varying, such an approach cannot be used. If the algorithm is nonrecursive—that is, the dataflow graph does not contain any feedback loops—data range propagation may be used to determine an analytic bound on the peak value of each signal. However, this approach, described in the next section, cannot be guaranteed to produce a tight bound. Linear time-invariant systems A linear time-invariant (LTI) system is one that obeys the distinct properties of linearity and time invariance. A linear system is one that obeys superposition— that is, if its output is the sequence y1 [t] in response to input x1 [t], and is y2 [t] in response to input x2 [t], then it will be αy1 [t] + βy2 [t] in response to input α x1 [t] + βx2 [t]. A time-invariant system is one that, given the input x[t] and the corresponding output y[t], will provide output y[t − t0 ] a given input x[t − t0 ]. In other words, shifting the input sequence in time merely shifts the output sequence by the same amount. From a practical perspective, any computation made entirely of addition, constant coefficient multiplication, and delay operations is guaranteed to be LTI. This class of algorithms, while restricted, is extremely important; it contains all the fundamental building blocks of DSP, such as finite impulse response (FIR) and infinite impulse response (IIR) filters, together with transformations such as the discrete cosine transform (DCT), the fast Fourier transform (FFT), and many color–space conversions. The remainder of this section assumes a basic knowledge of digital signal processing, in particular the z-transform and transfer functions. For the unfamiliar reader, Mitra [32] provides an excellent introduction. Readers unconcerned with the mechanics of peak estimation for LTI systems may simply take it as read that for such systems it is possible to obtain tight analytic bounds on peak signal values. Transfer function calculation The analytical scaling rules derived in this section rely on a knowledge of system transfer functions. A transfer function of a discrete-time LTI system between any given I/O pair is defined to be the ztransform of the sequence produced at that output, in response to a unit impulse at that input [32]; these transfer functions may be expressed as the ratio of two polynomials in z−1 . The transfer function from each primary input to each signal must be calculated for signal-scaling purposes. This section considers the practical problem of transfer function calculation from a dataflow graph.
480
Chapter 23
I
Precision Analysis for Fixed-point Computation
Given a dataflow graph G(V, S), let VI ⊆ V be the set of input nodes, VO ⊆ V be the set of output nodes, and VD ⊆ V be the set of unit sample delay nodes. For signal scaling, a matrix of transfer functions H(z) is required, with elements hiv (z) for i ∈ VI and v ∈ V representing the transfer function from the primary input i to the output of node v. Calculation of transfer functions for nonrecursive systems is a simple task, leading to a matrix of polynomials in z−1 ; a straightforward algorithm is presented by Constantinides et al. [12]. For recursive systems, it is necessary to identify a subset Vc ⊆ V of nodes whose outputs correspond to a system state. In this context, a state set consists of a set of nodes that, if removed from the dataflow graph, would break all feedback loops. Once such a state set has been identified, transfer functions can easily be expressed in terms of the outputs of these nodes using algorithms suitable for nonrecursive computations. Let S(z) be a z-domain matrix representing the transfer function from each input signal to the output of each of these state nodes. The transfer functions from each input to each state node output may be expressed as in equation 23.2, where A and B are matrices of polynomials in z−1 . Each of these matrices represents a z-domain relationship once the feedback has been broken at the outputs of state nodes. A(z) represents the transfer functions between state nodes and state nodes, and B(z) represents the transfer functions between primary inputs and state nodes. S(z) = AS(z) + B(z)
(23.2)
H(z) = CS(z) + D(z)
(23.3)
The matrices C(z) and D(z) are also matrices of polynomials in z−1 . C(z) represents the z-domain relationship between state node outputs and the outputs of all nodes. D(z) represents the z-domain relationship between primary inputs and the outputs of all nodes. It is clear that S(z) may be expressed as a matrix of rational functions (equation 23.4), where I is the identity matrix of appropriate size. This allows the transfer function matrix H(z) to be calculated directly from equation 23.3. S(z) = (I − A)−1 B
(23.4)
Example Consider the simple dataflow graph from Section 23.1.1, shown in Figure 23.1. Clearly, removal of any one of the four internal nodes (adder, gain, delay, or the signal branch) from it will break the feedback loop. Let us arbitrarily choose the adder node as a state node and choose the gain coefficient to be 0.1. The polynomial matrices A(z) to D(z) may then be calculated (equation 23.5). A(z) = 0.1z−1 B(z) = 1 C(z) = [0 1 0.1 0.1 0.1 0.1z−1 ]T D(z) = [1 0 0 0 0 0]T
(23.5)
23.2 Peak Value Estimation
481
Calculation of S(z) may then proceed following equation 23.4, yielding equation 23.6. Finally, the matrix H(z) can be constructed following equation 23.3, giving equation 23.7. S(z) = 1/(1 − 0.1z−1 )
(23.6)
H(z) = [1 1/(1 − 0.1z−1 ) 0.1/(1 − 0.1z−1 ) 0.1/(1 − 0.1z−1 ) 0.1/(1 − 0.1z−1 ) 0.1z−1 /(1 − 0.1z−1 )]T
(23.7)
The runtime of this algorithm grows significantly with the number of state signals |Vc |, and so selecting a small set of state signals is important. A simple approach is to select all of the delay elements in a circuit, assuming that it has no combinational cycles. Alternatively, techniques such as Levy and Low’s [30] can be employed. Scaling with transfer functions To produce the smallest fixed-point implementation, it is desirable to utilize as much as possible of the full dynamic range provided by each internal signal representation. The first step of the optimization process is therefore to choose the smallest possible value of pj for each signal j ∈ S in order to guarantee no overflow. Consider a dataflow graph G(V, S), annotated with wordlengths n and scalings p. Recall that VI ⊆ V denotes the set of input nodes, and let us say that each such node reaches peak signal values of ±Mi (Mi > 0) for i ∈ VI . Let H(z) be the scaling transfer function matrix defined before, with the associated impulse response matrix h[t] related to the transfer function matrix through the component-wise inverse z-transform. Then the worst-case peak value Pj reached by any signal j ∈ S is given by maximizing the well-known convolution sum (equation 23.8) [32], where xi [t] is the value of the input i ∈ VI at time index t. Solving this maximization problem provides the input sequence given in equation 23.9, and allowing Nij → ∞ leads to the peak response at signal j given in equation 23.10. Here sgn( ) is the signum function (equation 23.11). ⎛ Pj = ±
Nij −1
∑ max⎝ ∑
i∈VI xi [t ]
⎞ xi t − t hij [t]⎠
xi [t] = Mi sgn hij Nij − t − 1 ∞
Pj =
∑ Mi ∑ hij [t]
i∈VI
sgn (x) =
(23.8)
t=0
(23.9) (23.10)
t=0
1, x ≥ 0 −1, otherwise
(23.11)
This worst-case approach leads to the concept of l1 scaling, defined in the following paragraphs.
482
Chapter 23
I
Precision Analysis for Fixed-point Computation
The l1 -norm of a transfer function H(z) is given by equation 23.12, where Z−1 { } denotes the inverse z-transform. ∞
l1 {H (z)} = ∑ Z−1 {H (z)} [t]
(23.12)
t=0
A dataflow graph G(V, S) annotated with wordlengths n and scalings p is said to be l1 -scaled} if equation 23.13 holds for all signals j ∈ S. +1 (23.13) pj = log2 ∑ Mi l1 hij (z) i∈VI
The important point about an l1 -scaled algorithm is that the scalings used are optimal in the following sense. If any scaling is reduced lower than its value from equation 23.13, it is possible for overflow to result on that variable. If any scaling is increased beyond its value from equation 23.13, the area of the resulting implementation increases or stays the same without any matching improvement in arithmetic quality observable at the algorithm outputs. Data range propagation If the algorithm under consideration is not linear or time invariant, one mechanism for estimating the peak value reached by each signal is to consider the propagation of data ranges through the computation graph. This is generally possible only for nonrecursive algorithms. Forward propagation A naive way of approaching this problem is to examine the binary-point position that “naturally” results from each hardware operator. Such an approach, illustrated here, is an option in the Xilinx System Generator tool [20]. In the dataflow graph shown in Figure 23.2, if we consider that each input has a range (−1, 1), then we require a binary-point location of p = 0 at each input. Let us consider each of the adders in turn. Adder a1 adds two inputs with p = 0 and therefore produces an output with p = max(0, 0) + 1 = 1. Adder a2 adds one input with p = 0 and one with p = 1, and therefore produces an output with p = max(0, 1) + 1 = 2. Similarly, the output of a3 has p = 3, and the output of a4 has p = 4. While we have successfully determined a binary-point location for each signal that will not lead to overflow, the disadvantage of this approach
FIGURE 23.2
I
+
+
+
+
a1
a2
a3
a4
A dataflow graph representing a string of additions.
23.2 Peak Value Estimation
483
should be clear. The range of values reachable by the system output is actually 5∗ (−1, 1) = (−5, 5), so p = 3 is sufficient; p = 4 is an overkill of one MSB. A solution to this problem that has been used in practice is to propagate data ranges rather than binary-point locations [4, 40]. To understand this approach in practice, let us apply the technique to the example of Figure 23.2. The output of adder a1 is a subset of (−2, 2) and thus is assigned p = 1; the output of adder a2 is a subset of (−3, 3) and is thus assigned p = 2; the output of adder a3 is a subset of (−4, 4) and is thus assigned p = 3; and the output of adder a4 is a subset of (−5, 5) and is thus also assigned p = 3. For this simple example, the problem of peak value detection has been solved to optimality. However, such a tight solution is not always possible with data range propagation. Under circumstances where the dataflow graph contains one or more branches (fork nodes), which later reconverge, such a “local” approach to range propagation can be overly pessimistic. As an example, consider the computation graph representing a constant coefficient multiplication on complex numbers shown in Figure 23.3. In the figure, each signal has been labeled with a propagated range, assuming that the primary inputs have range (−0.6, 0.6). Under this approach, both outputs require p = 2. However, such ranges are overly pessimistic. The upper output in Figure 23.3 has the value y1 = 2.1x1 − 1.8(x1 + x2 ) = 0.3x1 − 1.8x2 . Thus, its range can also be calculated as 0.3(−0.6, 0.6) − 1.8(−0.6, 0.6) = (−1.26, 1.26). A similar calculation for the lower output provides a range of (−1.2, 1.2). By examining the global system behavior, we can therefore see that in reality p = 1 is sufficient for both outputs.
x1 [n]
2.1
(⫺0.6, 0.6) (⫺0.6, 0.6)
(⫺1.26, 1.26)
+
y1 [n] (⫺3.42, 3.42)
(⫺0.6, 0.6)
+
⫺1.8
(⫺2.16, 2.16) (⫺2.16, 2.16)
(⫺1.2, 1.2) ⫺1 (⫺0.6, 0.6) x2 [n] (⫺0.6, 0.6) (⫺0.6, 0.6)
(⫺2.16, 2.16)
⫺1.6 (⫺0.96, 0.96)
+
y2 [n] (⫺3.12, 3.12)
FIGURE 23.3 I Range propagation through a complex constant coefficient multiplier. Triangles represent (real) constant coefficient multiplication.
484
Chapter 23
I
Precision Analysis for Fixed-point Computation
Note that the analytic scheme described previously for linear time-invariant systems would calculate the tighter bound in this case. In summary, range propagation techniques may provide larger bounds on signal values than are absolutely necessary. This problem is seen in extremis with recursive computation graphs. In these cases, it is generally impossible to use range propagation to place a finite bound on signal values, even in cases when such a finite bound can analytically be shown to exist. Under these circumstances, it is standard practice to use some form of simulation to estimate the peak value of signals.
23.2.2
Simulation-based Peak Estimation
A completely different approach to peak estimation is to use simulation—that is, to actually run the algorithm with one or more provided input datasets and measure the peak values reached by each signal. In its simplest form, the simulation approach consists of measuring the peak signal value Pj reached by a signal j ∈ S and then setting p = log2 kPj + 1, where k > 1 is a user-supplied “safety factor” (typically 2 to 4). Thus, it is ensured that no overflow will occur so long as the signal value does not exceed kPj when excited by a different input sequence. Particular care must therefore be taken to select an appropriate test sequence. Kim et al. [25] extend the simulation approach by considering more complex forms of the safety factor. In particular, it is possible to extract information from the simulation relating to the class of probability density function followed by each signal. A histogram of the data values for each signal is built, and from it the distribution is classified as unimodal or multimodal, symmetric or nonsymmetric, and zero mean or nonzero mean. Different forms of safety factor are applied in each case. Simulation approaches are appropriate for nonlinear or time-varying systems, for which data range propagation, described in Section 23.1.2, provides overly pessimistic results (such as for recursive systems). The main drawback of simulation-based approaches is the significant dependence on the input dataset used for simulation; moreover, usually no general guidelines can be given for how to select an appropriate input. These approaches can, of course, be combined with the analytical techniques of Section 23.2.1 [13]. There has been some recent work [34] aiming to put the derivation of safety factors on a sound theoretical footing by using the statistical theory of extreme value distributions [26]. It is known that the distribution of the sum of a large number of statistically independent identically distributed (i.i.d.) random variables approaches the Gaussian distribution (the Central Limit Theorem). What is less well known is that the (scaled) maximum value of a large number of i.i.d. variables also approaches one of three possible distributions, no matter the distribution of the variables themselves. These are the Gumbel, Fre´ chet, and Weibull distributions [26]. Using this property, and making an assumption on ¨ the type of distribution converged to (Ozer and colleagues [34] assume Gumbel), provides a statistically sound way of estimating the safety factor required for a given arbitrarily small probability of overflow.
23.3 Wordlength Optimization
23.2.3
485
Summary of Peak Estimation
The optimization of a bit-parallel fixed-point datapath can be split into the two problems of determining an appropriate scaling and determining an appropriate wordlength for each signal. We have discussed the first of these two problems in detail. It has been shown that in the case of LTI systems, tight analytic bounds can be placed on the scaling required. Analytic scaling is also possible for non-LTI systems, at the cost of tightness in the bound— disastrously so in the case of recursive systems. The alternative to the analytical approach is the use of simulation on trusted input datasets; some progress has recently been made on the issue of statistically sound simulation-based peak determination.
23.3
WORDLENGTH OPTIMIZATION Once a scaling has been determined, it is necessary to find an appropriate wordlength for each signal. While optimizing the scaling usually improves circuit quality without changing circuit functionality (assuming no overflows occur), wordlength optimization trades circuit quality (area, delay, power) for result accuracy. The major problem in wordlength optimization is to determine the error at system outputs for a given set of wordlengths and scalings of all internal variables. We will call this problem error estimation. Once a technique for error estimation has been selected, the wordlength selection problem reduces to utilizing the known area and error models within a constrained optimization setting: Find the minimum area implementation satisfying certain constraints on arithmetic error at each system output. The majority of this section is taken up with the problem of error estimation (Section 23.3.1). Following on from this discussion, the problem of area modeling is addressed. Optimization techniques suitable for solving the wordlength determination problem are introduced (Section 23.3.2), with some discussion of the problem’s inherent computational complexity.
23.3.1
Error Estimation and Area Models
Traditionally, much of the research on estimating the effects of truncation and roundoff noise in fixed-point systems has focused on DSP uniprocessors. This leads to certain constraints and assumptions on quantization errors—for example, that the wordlength of all signals is the same, that quantization is performed after multiplication, and that the wordlength before quantization is much greater than that following it [36]. The multiple-wordlength paradigm allows a more general design space to be explored, free from these constraints. The effect of using finite register length in fixed-point systems has been studied for some time. Oppenheim and Weinstein [36] and Liu [29] lay down standard models for quantization errors and error propagation through LTI systems based on a linearization of signal truncation or rounding. Error signals, assumed to be uniformly distributed, uncorrelated with each other and
486
Chapter 23
I
Precision Analysis for Fixed-point Computation
with themselves over time, are added whenever a truncation occurs. This approximate model has served very well because quantization error power is dramatically affected by wordlength in a uniform wordlength structure, decreasing at approximately 6 dB per bit. This means that it is not necessary to have highly accurate models of quantization error power in order to predict the required signal width [35]. In a multiple-wordlength circuit, the implementation error power may be adjusted much more finely, and so the resulting implementation tends to be more sensitive to errors in estimation. This has led to a simple refinement of the model, which will be discussed soon. The most generally applicable method for error estimation is simulation: Simulate the system with a given “representative” input and measure the deviation at the system outputs when compared to an accurate simulation (usually “accurate” means IEEE double-precision floating point [22]). Indeed, this is the approach taken by several systems [6, 27]. Unfortunately, simulation suffers from several drawbacks, some of which correspond to the equivalent simulation drawbacks discussed in Section 23.2, and some of which are peculiar to the error estimation problem. First, there is the problem of dependence on the chosen “representative” input dataset. Second, there is the problem of speed: Simulation runs can take a significant amount of time, and during an optimization procedure a large number of simulation runs may be needed. Third, even the “accurate” simulation will have errors induced by finite wordlength effects that, depending on the system, may not be negligible. We will be using signal-to-noise ratio (SNR), sometimes referred to as signalto-quantization-noise ratio (SQNR), as a generally accepted metric for measuring the quality of a fixed-point algorithm implementation [32] (although other measures, such as maximum instantaneous error, exist). Conceptually, the output sequence at each system output resulting from a particular finite-precision implementation can be subtracted from the equivalent sequence resulting from an infinite-precision implementation. The difference is known as the fixed-point error. The ratio of the output power (i.e., the sum of squared signal values) resulting from an infinite precision implementation to the fixed-point error power of a specific implementation defines the SNR. For the purposes of this chapter, the signal power at each output is fixed because it is determined by a combination of the input signal statistics and the dataflow graph G(V, S). To explore different implementations of the dataflow graph, it is therefore sufficient to concentrate on noise estimation, which is the subject of this section. The approach taken to wordlength optimization should depend on the mathematical properties of the system under investigation. After briefly considering simulation-based estimation, we will examine analytic or semi-analytic techniques that may be applied to certain classes of system. Next we will describe one such method, which may be used to obtain high-quality results for linear time-invariant algorithms. Then we will generalize this approach to nonlinear systems containing only differentiable nonlinear components.
23.3 Wordlength Optimization
487
Simulation-based methods Simulation-based methods for wordlength optimization were first established at Seoul National University, and some of them have been integrated into the Signal Processing Worksystem of Cadence. In Kim et al. [25] and Kum and Sung [27], the search space is reduced by grouping together all variables involved in a multiply–add operation and optimizing them as a single-wordlength “block.” Within each block, the Oppenheim model of quantization noise is applied [35]. Although simulation is almost certainly the most widespread mechanism for estimating the impact of a given choice of wordlength, it suffers from the drawbacks discussed earlier. Indeed, the dependence of the result on the input dataset, while widely acknowledged, is rarely considered in depth. The class of algorithm for which simulation forms a suitable mechanism has also remained unclear. Recently, Alippi [1] proposed an analytical framework within which the question of simulation input dependence can be addressed. A mechanism for understanding the perturbation of Lebesgue-measurable functions, an extremely wide class of algorithmic behavior, has been proposed that uses the theory of randomized algorithms. The essential contribution of this work, for the purposes of fixed-point analysis, has been to demonstrate that simulation is an appropriate mechanism for analyzing fixed-point error. Moreover, Alippi [1] provides a theoretically sound guideline on the number of simulations required in order to be confident, to within a certain probability, that the SNR is within a given limit (alternative signal quality metrics are also Lebesgue measurable and hence can be used as well). An analytic technique for linear time-invariant systems We will first address error estimation for LTI systems. An appropriate noise model for truncation of LSBs is described in the subsection that follows. It is then shown that the noise injected through truncation can be analytically propagated through the system in order to measure the effect of such noise on system outputs. Noise model A common assumption in DSP design is that signal quantization (rounding or truncation) occurs only after a multiplication or multiply– accumulate operation. This corresponds to a uniprocessor viewpoint, where the result of an n-bit signal multiplied by an n-bit coefficient needs to be stored in an n-bit register. The result of such a multiplication is an n = 2n-bit word, which must therefore be quantized down to n bits. Considering signal truncation, the least area-expensive method of quantization [18], the lowest value of the trun cation error in 2’s complement with p = 0, is 2−n − 2−n ≈ − 2−n , and the highest value is 0 (2’s complement truncation error is always nonpositive). It has been observed that values between these values tend to be equally likely to occur in practice, so long as the 2n-bit signal has sufficient dynamic range [29, 36]. This observation leads to the formulation of a uniform distribution model [36] for the noise of variance σ2 = 2−2n /12 for the standard normalization of p = 0. It has also been observed that, under the same conditions, the
488
Chapter 23
I
Precision Analysis for Fixed-point Computation
spectrum of such errors tends to be white because there is little correlation between low-order bits over time even if there is a correlation between highorder bits. Similarly, different truncations occurring at different points within the implementation structure tend to be uncorrelated. When considering a multiple-wordlength implementation, or truncation at different points within the datapath, some researchers have opted to carry the uniform distribution model over to the new implementation style [25]. However, there are associated inaccuracies involved in such an approach [7]. First, quantizations from n bits to n bits, where n ≈ n, will suffer in accuracy because of the discretization of the error probability density function; for example, if p = 0, n = 2, n = 1, then the only possible error values are 0 and −1/4. Second, in such cases the lower bound on error can no longer be simplified in the preceding manner because 2−n − 2−n ≈ − 2−n no longer holds. These two issues may be resolved by considering a discrete probability distribution for the injected error signal. For 2’s complement arithmetic, the truncation error injection signal e[t] caused by truncation from (n, p) to (n, p) is bounded by equation 23.14. (23.14) −2p 2−n − 2−n ≤ e[t] ≤ 0 It is assumed that each possible value of e[t] has equal probability, as discussed earlier. For 2’s complement truncation, there is nonzero mean E{e[t]} (equation 23.15) and variance σ2e (equation 23.16). E {e [t]} = −
σe2 =
1 2
n −n
1
2n −n −1
2n −n
∑
i · 2p−n = −2p−1 2−n − 2−n
2n −n −1
∑
i=0
(23.15)
i=0
i · 2p−n
2
− E2 {e [t]} =
1 2p −2n 2 − 2−2n 2 12
(23.16)
Note that for n1 n2 and p = 0, equation 23.16 simplifies to σ2e ≈ 1/12 2−2n , which is the well-known predicted error variance of Oppenheim and Schafer [35] for a model with continuous probability density function. Noise propagation and power estimation If it is our aim to optimize the wordlengths used in a design, then it is important to be able to predict the arithmetic quality observable at the design outputs. Given a set of wordlengths and scalings, it is possible to use the truncation model described in the previous section to predict the variance of each injection input. For each signal j ∈ S, a straightforward application of equation 23.16 may be used, with n1 equal to the “natural” full-precision wordlength produced by the source component, n2 = nj , and p = pj . By constructing noise sources in this manner for the entire dataflow graph, a set F = {(σ2p , Rp )} of injection input variances σ2p , and their associated transfer function to each primary output Rp (z), can be constructed. From this set it is possible to predict the nature of the noise appearing at the system primary
23.3 Wordlength Optimization
489
outputs, which is the quality metric of importance to the user. Since the noise sources have a white spectrum and are uncorrelated with each other, it is possible to use L2 scaling to predict the noise power at the system outputs. The L2 norm of a transfer function H(z) is defined in equation 23.17, where Z−1 denotes the inverse z-transform. It can be shown that the noise variance Ek at output k is given by equation 23.18. 2 1/2 ∞ −1 L2 {H (z)} = ∑ Z {H (z)} [n] (23.17) n=0
Ek =
(σ
∑
σ2 L2 2 {Rk }
(23.18)
, R)∈F
2
A hybrid approach for nonlinear differentiable systems With some modification, some of the results from the preceding section can be carried over to the more general class of nonlinear time-varying systems containing only differentiable nonlinearities. In this section we address one possible approach to this problem, deriving from the type of small-signal analysis typically used in analogue electronics [12, 38]. Perturbation analysis To make some of the analytical results on error sensitivity for LTI systems applicable to nonlinear systems, the first step is to linearize these systems. The assumption is made that the quantization errors induced by rounding or truncation are sufficiently small not to affect the system’s macroscopic behavior. Under such circumstances, each system component can be locally linearized or replaced by its “small-signal equivalent” [38] in order to determine the output behavior under a given rounding scheme. We will consider one such n-input component, the differentiable function Y[t] = f (X1 [t], X2 [t], . . . , Xn [t]), where t is a time index. If we denote by xi [t] a small perturbation on variable Xi [t], then a first-order Taylor approximation for the induced perturbation y[t] on Y[t] is given by equation 23.19. ∂f ∂f + . . . + xn [ t ] (23.19) y [ t ] ≈ x1 [ t ] ∂X1 t ∂Xn t Note that this approximation is linear in each xi but that the coefficients may vary with time index t because, in general, ∂f /∂X1 is a function of X1 , X2 , . . . , Xn . Thus, by applying such an approximation, we have produced a linear timevarying small-signal model for a nonlinear time-invariant component. Such an analysis is readily extended to a time-varying component by expressing Y[t] = f(t, X1 [t], X2 [t], . . . , Xn [t]). The linearity of the resulting model allows us to predict the error at system outputs due to any linear scaling of a small perturbation of signal j ∈ S analytically, given the simulation-obtained error from a single such perturbation instance at j, which can be obtained by a single simulation run. Thus, this method can be considered to be a hybrid analytic/simulation error analysis [15].
490
Chapter 23
I
Precision Analysis for Fixed-point Computation
b dc_da a
b a
*
c
*
c
dc_db (a)
(b)
FIGURE 23.4 I A local graph transformation to insert derivative monitors: (a) multiplier node; (b) with derivative monitors.
Derivative monitors To construct the small-signal model, we must first evaluate the differential coefficients of the Taylor series model for nonlinear components. In general, methods must be introduced to calculate the differential of each nonlinear node type. This is performed by applying a graph transformation to the dataflow graph, introducing the necessary extra nodes and outputs to do this calculation. The general multiplier is the only nonlinear component considered explicitly in this section, although the approach is general; the graph transformation for multipliers is illustrated in Figure 23.4. Since f (X1 , X2 ) = X1 X2 , ∂f/∂X1 = X2 and ∂f/∂X2 = X1 . After insertion of the monitors (dc_da and dc_db, which capture the derivatives of c with respect to a and b, respectively), a simulation may be performed to write the derivatives to appropriate data files to be used by the linearization process, which is described next. Linearization Our aim is to construct a small-signal model, which can be simulated to determine the sensitivity to rounding errors. Once we have obtained the derivative monitors, the construction of the small-signal model may proceed, again through graph transformation. All linear components (adder, constant coefficient multiplier, fork, delay, primary input, primary output) remain unchanged as a result of the linearization process. Each nonlinear component is replaced by its first-order Taylor model. Additional primary inputs are added to the dataflow graph to read the Taylor coefficients from the derivative monitor files created by the previous large-signal simulation. As an example, the Taylor expansion transformation for the multiplier node is illustrated in Figure 23.5. The inputs dc_da and dc_db are themselves time-varying sequences, derived from the previous step of the procedure. Note that the graph portion of Figure 23.5(b) still contains multiplier “nonlinear” components, although one input of each multiplier node is now external to the model. This absence of feedback ensures linearity, although not time invariance. Noise injection In Section 23.3.1, L2 scaling was used to analytically estimate the noise variance at a system output through scaling of the (analytically
23.3 Wordlength Optimization
491
a
*
dc/da
+ b a
*
c
dc/db
c
* b
(a)
(b)
FIGURE 23.5 I A local graph transformation to produce a small-signal model: (a) multiplier node; (b) first-order Taylor model.
Noise a (a)
a
+ (b)
FIGURE 23.6 I A local graph transformation to inject perturbations: (a) original signal; (b) with noise injection.
derived) noise variance injected at each point of quantization. Such a purely analytic technique can be used only for LTI systems. In this section we discuss an extension of the approach for nonlinear systems. Because the small-signal model is linear, if an output exhibits variance V when excited by an error of variance σ2 injected into a given signal, then the output will exhibit variance αV when excited by a signal of variance ασ2 injected into the same signal (α ≥ 0). Herein lies the strength of the proposed linearization procedure: If the output response to a noise of known variance can be determined once only through simulation, this response can be scaled with analytically derived coefficients in order to estimate the response to any rounding or truncation scheme. Thus, the next step of the procedure is to transform the graph through the introduction of an additional adder node, and associated signals, and then simulate the graph with a known noise. In our case, to simulate truncation of a 2’s complement signal, the noise is independent and identically distributed √ 3, 0], chosen to have unit variwith a uniform distribution over the range [−2 √ ance (1/12(2 3)2 = 1), in this way making the measured output response an unscaled “sensitivity” measure. The graph transformation of inserting a noise injection is shown in Figure 23.6. One of these transformations is applied to a distinct copy of the linearized graph for each signal in the dataflow graph,
492
Chapter 23
I
Precision Analysis for Fixed-point Computation
after which zeros are propagated from the original primary inputs, to finalize the small-signal model. This is a special case of constant propagation [2] that leads to significantly faster simulation results for nontrivial dataflow graphs. The entire process is illustrated for a simple dataflow graph in Figure 23.7. The original graph is shown in (a). The perturbation analysis will be performed for the signals marked (∗ ) and (∗∗ ). After inserting derivative monitors
x
a
* b
c (*)
y
(**)
z21 (a)
dc/db x
*
(*)
y
(**)
dc/da x
dc/da dc/db
z21
dc/da
dc/da
dc/db
y
(**)
x
+
*
y
*
(**)
(c) Noise
+
(*)
*
(*)
Noise
*
+
z21
(b)
x
*
dc/db
+
+
*
z21
z21
(d)
(e)
Noise dc/db
*
+ z21 (f)
y
Noise
y (g)
FIGURE 23.7 I An example of perturbation analysis: (a) original dataflow graph; (b) transformed dataflow graph; (c) linearized dataflow graph; (d) variant for (∗ ) signal; (e) variant for (∗∗ ) signal; (f) simplified graph for (∗ ) signal; (g) simplified graph for (∗∗ ) signal.
y
23.3 Wordlength Optimization
493
for nonlinear components, the transformed DFG is shown in (b). The linearized DFG is shown in (c), and its two variants for the signals (∗ ) and (∗∗ ) are illustrated in (d) and (e), respectively. Finally, the corresponding simplified DFGs after zero propagation are shown in (f) and (g), respectively. High-level area models To implement a multiple-wordlength system, component libraries must be available to support multiple-wordlength arithmetic. These libraries can then be instantiated by the synthesis system and must be modeled in terms of area consumption to provide the wordlength optimization procedure with a cost metric. Integer arithmetic libraries are available from FPGA vendors (e.g., Xilinx Coregen or Altera LPM macros). Parameterizable macros for standard arithmetic functions operating on integer arithmetic form the basis of the multiplewordlength libraries synthesized to by wordlength optimization tools such as Right-Size [15] and Synoptix [8]. Blocks from each of these vendors may have slightly different cost parameters, but the general approach described in this section is applicable across all of them. Example external interfaces of multiplewordlength library blocks for constant coefficient multipliers (gain) and adders (add) written in VHDL are shown in Listing 23.1 [23]. Listing 23.1
I
Constant coefficient multipliers (gain) and adders (add) written in VHDL.
ENTITY gain IS GENERIC( INWIDTH, OUTWIDTH, NULLMSBS, COEFWIDTH : INTEGER; COEF : std_logic_vector( COEFWIDTH downto 0 ) ); PORT( data : IN std_logic_vector( INWIDTH downto 0 ); result : OUT std_logic_vector( OUTWIDTH downto 0 ) ); END gain; ENTITY add IS GENERIC( AWIDTH, BWIDTH, BSHL, OUTWIDTH, NULLMSBS : INTEGER ); PORT( dataa : IN std_logic_vector( AWIDTH downto 0 ); datab : IN std_logic_vector( BWIDTH downto 0 ); result : OUT std_logic_vector( OUTWIDTH downto 0 ) ); END add;
As well as an individually parameterizable wordlength for each input and output port, each library block has a NULLMSBS parameter that indicates how many most significant bits (MSBs) of the operation result are to be ignored (the converse of sign extension). Thus, each operation result can be considered to be made up of zero or more MSBs that are ignored, followed by one or more data bits, followed by zero or more LSBs that may be truncated depending on the OUTWIDTH parameter. For the adder library block, there is an additional BSHL generic that accounts for the alignment necessary for addition operands. BSHL represents the number of bits by which the datab input must be conceptually shifted left to align it with the dataa input. Note that, because this is fixed-point arithmetic, there is no physical shifting involved; the data is simply aligned in a
494
Chapter 23
Precision Analysis for Fixed-point Computation
I
skewed manner, as shown in Figure 23.8. Note, too, that dataa and datab are permuted as necessary to ensure that BSHL is always nonnegative. In the figure, (a) shows that the MSB of input b protrudes beyond that of input a and that all the output bits are drawn from the core integer addition of the overlap. Figure 23.8(b) shows that the MSB of input a protrudes beyond that of input b and that all output bits are drawn from the core integer addition of the overlap. Figure 23.8(c) shows that the MSB of input b protrudes beyond that of input a but that some of the output bits are drawn from the LSB overhang of input a and are thus produced “free.” Figure 23.8(d) shows that the MSB of input a protrudes beyond that of input b but that some of the output bits
na a:
S nb
b:
na
a:
+
s
S m _1
b:
S nb
+ m _1
noq S
noq S
no o:
no
o:
S
S
(b)
(a)
na
na a:
a:
S nb
b:
+
s
S m _1
b:
S nb
+
s
S m _1
noq
noq S
S
no o:
s
S
no
o:
S (c)
S (d)
FIGURE 23.8 I Four multiple-wordlength adder formats arising in practice: (a) MSB of input b protruding beyond MSB of input a; (b) MSB of input a protruding beyond MSB of input b; (c) MSB of input b protruding beyond MSB of input a, with “free” output bits; (d) MSB input a protruding beyond MSB of input b, with “free” output bits. (s denotes the value of the BSHL generic; m denotes the value of the NULLMSBS generic.)
23.3 Wordlength Optimization
495
are drawn from the LSB overhang of input a and are thus produced “free.” In q each case, the upper result shows the “error-free” wordlength no without further truncation, whereas the lower result shows the wordlength no after potential further truncation. Each of the library block parameters has an impact on the area resources consumed by the overall system implementation. It is generally assumed when constructing a cost model that each operator in the dataflow graph will map to a separate hardware resource and that the area cost of wiring is negligible [17]. These assumptions (relaxed by Constantinides et al. [12]) simplify the construction of an area cost model. It is sufficient to estimate separately the area consumed by each computation node and then sum the resulting estimates. In reality, of course, logic synthesis, performed after wordlength optimization, is likely to result in some logic optimization between the boundaries of two connected library elements. This may result in lower area than estimated, but experience shows that these deviations from the area model are small. The area model for a multiple-wordlength adder is reasonably straightforward. A ripple-carry architecture is used [21] since FPGAs provide good support for fast ripple-carry implementations. The only area-consuming component is the core (integer) adder constructed from the vendor library. This adder has a width of max(AWIDTH – BSHL, BWIDTH) – NULLMSBS + 2 bits. Depending on the FPGA architecture in question, each bit may not consume the same area; however, because some bits are required for the result port whereas others may be needed only for carry propagation, their sum outputs remain unconnected and therefore the sum circuitry is optimized away by logic synthesis. The cost model thus has two parameters k1 and k2 , corresponding to the area cost of a sum-andcarry full adder and to the area cost of a carry-only full adder, respectively. The area of an adder is expressed in equation 23.20. Aadd (AWIDTH, BWIDTH, BSHL, NULLMSBS, OUTWIDTH) = k1 (OUTWIDTH + 1) + k2 (max(AWIDTHBSHL, BWIDTH) − NULLMSBS − OUTWIDTH + 1)
(23.20)
Area estimation for general multipliers can proceed in a similarly straightforward way. However, the equivalent problem for constant coefficient multipliers is significantly more problematic. A constant coefficient multiplier is typically implemented as a series of additions through a recoding scheme such as the classic Booth technique [3]. This implementation style causes the area consumption to be highly dependent on the coefficient value. In addition, the exact implementation scheme used by the vendor integer arithmetic libraries is known only to the vendor. A simple area model has been proposed (equation 23.21) and the coefficient values k3 and k4 have been determined through the synthesis of several hundred multipliers of different coefficient values and widths [12]. The model has then been fitted to this data using a least-squares approach. Note that the model does not account for NULLMSBS because, for a properly scaled coefficient,
496
Chapter 23
I
Precision Analysis for Fixed-point Computation
NULLMSBS ≤ 1 for a constant coefficient multiplier and therefore has little impact on area consumption. Again (INWIDTH, OUTWIDTH, COEFWIDTH) = k3 COEFWIDTH(INWIDTH + 1) + k4 (INWIDTH + COEFWIDTH − OUTWIDTH) (23.21) More detailed area models for components are discussed by Chang and Hauck [14].
23.3.2
Search Techniques
A heuristic search procedure Because the wordlength optimization problem is NP-hard [16], several heuristic approaches have been developed to find feasible wordlength vectors having small, though not necessarily optimal, area consumption. An example heuristic is shown in Listing 23.2. After performing binary-point estimation using the techniques of Section 23.2, the algorithm determines the minimum uniform wordlength satisfying all error constraints. The design at this stage corresponds to a standard uniform wordlength design with implicit power-of-two scaling, such as may be used for an optimized uniprocessor-based implementation. Each wordlength is then scaled up by a factor k > 1, which represents a bound on the largest value that any wordlength in the final design may reach (in the Synoptix implementation of this algorithm [8], k = 2 has been used). The resulting structure forms a starting point from which one signal wordlength is reduced by one bit on each iteration. The signal wordlength to reduce is decided in each iteration by reducing each wordlength in turn until it violates an output noise constraint (Listing 23.2). At this point there is likely to have been some pay-off in reduced area, and the signal whose wordlength reduction provided the largest pay-off is chosen. Each signal’s wordlength is explored using a binary search. Listing 23.2
I
Algorithm wordlength falling.
Input: A Dataflow Graph G(V,S) and binary-point vector p. Output: An optimized wordlength vector n. begin Let the elements of S be denoted as S = { j1 , j2 , ..., j|S| } Determine u, the minimum uniform wordlength satisfying error criteria Set n ← 1ku do currentcost ← AREA(n) foreach ji ⑀ S do bestmin ← currentcost Set w to the smallest positive value where the error criteria are satisfied for wordlength [n1 ... ni-1 w ni+1 ... n|S| ] Set minval ← AREA([n1 ... ni-1 w ni+1 ... n|S| ]) if minval < bestmin, set bestsig ← i and bestmin ← minval end foreach
23.3 Wordlength Optimization
497
if bestmin < currentcost nbestsig ← nbestsig − 1 while bestmin < currentcost end
Alternative search procedures The algorithm described in Section 23.3.1 is a heuristic; it does not guarantee to produce the optimum area cost for a given set of error constraints. A technique to discover the true optimum-wordlength vectors has also been proposed [10] that uses integer linear programming (ILP) to model the constraint space and objective functions. This technique was able to demonstrate that the heuristic from Section 23.1.1 provides good-quality results for the small benchmark problems addressed by both approaches. Like all NP-hard problems [16], however, finding the optimum solution becomes computationally infeasible for large problem sizes. The methodology of Constantinides et al. [10] is applicable only for very small practical problems and is thus more of a theoretical than practical interest. Several other heuristic search procedures have been proposed in the literature, and we will review some of the more interesting ones (further comparisons are made in the brief survey by Cantin et al. [6]). An approach used by Kum and Sung [27] is based on the intuition that the error observable at a system output reduces monotonically with each wordlength in that system. This is a plausible conjecture, but is not always the case. Indeed, it was shown independently by Constantinides [9] and Lehtinen and Renfors [31] that this conjecture may be violated in practical situations. Nevertheless, if we accept it for the moment, a natural search procedure becomes apparent. We may divide the search into two phases. In the first phase, the system is simulated with all but one variable having a very large precision (e.g., double precision floating point). In this way, we can find the point at which the output constraints are violated because of quantization on this variable alone. Repeating this for all variables provides, under the conjecture, a lower bound on each element of the wordlength vector. The second phase of the algorithm is invoked if the constraints are violated when these lower bounds are used as the wordlength vector. In this case, the precision of all variables is increased by an equal number of bits until the constraints are satisfied. A variation on the second phase is to exhaustively explore all possibilities above this lower bound, until the constraints are satisfied [27]. The common meta-heuristics of simulated annealing and genetic algorithms have been used for this problem—for example, by Chang and Hauck [14]— (using a linear combination of area and error as an objective function [28, 40]). While there are practical advantages to using tried-and-tested meta-heuristics for combinatorial problems, the smooth nature of the constraints and objectives, as outlined previously, means that it is likely that better results can be obtained within a fixed computation time budget by using application-specific heuristic techniques.
498
23.4
Chapter 23
I
Precision Analysis for Fixed-point Computation
SUMMARY This chapter introduced the fundamental problems of designing optimized fixed-point arithmetic circuits in custom hardware, including FPGA devices. The fixed-point number system is of widespread interest in the FPGA community because of the highly efficient arithmetic implementations possible when compared to what can be achieved with floating-point arithmetic. However, much more than with floating point, working with fixed point requires designers to have a good grasp of the numerical robustness issues involved with their designs. Performing such design by hand is tedious and error prone, which has motivated the development of automatic procedures, some of which have been described in this chapter. The freedom in custom hardware to use multiple wordlengths in a design creates the possibility of shaping the circuit datapath to the requirements of the algorithm, leading to low-area, high-speed, and low-power implementations. This emerging paradigm throws up a new challenge, however: wordlength optimization. This chapter demonstrated that wordlength determination can be considered as a constrained optimization, and suitable models were presented for FPGAbased bit-parallel implementations, together with signal-to-noise ratio of linear time-invariant and differentiable nonlinear time-varying systems. In each case, we described at least one error estimation procedure in depth and discussed related procedures and their advantages and disadvantages. We will now consider some fruitful avenues for further research in this field, broken down into MSB-side optimization, error modeling, and search procedures. The work discussed in Section 23.2 either avoids overflow completely (e.g., l1 -scaling) or reduces the probability of overflow to an arbitrary level (e.g., extreme value theory) without considering the effect of overflow on signal-tonoise ratio or other accuracy metrics. In algorithms where the worst-case variable range is much larger than the average-case range, it may make sense to save area by allowing rare overflow and its consequent reduction in arithmetic accuracy. This problem was discussed by Constantinides et al. [11] using a simple model of the error induced by overflow, based on approximating all signals by Gaussian random variables. The results achieved were weakened, however, by an inability of the proposed method to accurately estimate the correlations between overflow errors at different points within the algorithm. Further work could provide much stronger bounds. The analytical error-modeling approaches discussed in Section 23.3.1 can adequately deal with linear time-invariant systems or with time-varying systems containing only differentiable nonlinearities. This still leaves open the problem of adequately modeling systems containing nondifferentiable nonlinearities. This is a serious omission, as it includes any algorithm containing conditionally executed statements, where the condition is a logical expression containing variables generated by the algorithm itself (in the case where the variables
23.4 Summary
499
are external inputs, this can be viewed as a time-varying differentiable system). Further work incorporating the results from the analysis of nonlinear dynamical systems is likely to shed new light here. Both heuristic and optimal search procedures were discussed in Section 23.3.2. One of the limitations of the optimal approach from Constantinides et al. [10] is that is has relied on coercing inherently nonlinear constraints into a linear form, resulting in a large ILP problem. Branch-and-bound, or other combinatorial search procedures, on top of bounding procedures from the more general field of nonlinear mathematical programming may be able to provide optimal results for significantly larger problems. Further effort is also called for in the development of heuristic search procedures. None of the heuristics presented thus far can guarantee a bounded distance to optimality, although under certain error metrics the wordlength optimization problem is approximatible in this sense. It would be useful to concentrate efforts on heuristics that do provide these guarantees. It is my belief that, apart from a practical design problem, the problem of wordlength optimization has much to offer in terms of understanding the numerical properties of algorithms. The earliest contributions to this subject can be traced back to two giants of computing, Alan Turing [39] and John von Neumann [33]. At the time, IEEE standard floating point was nonexistent, and it was necessary to carefully design the architecture around the algorithm. FPGAbased computing has reopened this method of design by giving an unprecedented degree of freedom in the implementation of numerical algorithms.
References [1] C. Alippi. Randomized algorithms: A system-level poly-time analysis of robust computation. IEEE Transactions on Computers 51(7), 2002. [2] A. V. Aho, R. Sethi, J. D. Ullman. Compilers: Principles, Techniques and Tools, Addison-Wesley, 1986. [3] A. D. Booth. A signed binary multiplication technique. Quarterly Journal Mechanical Applications of Mathematics 4(2), 1951. [4] A. Benedetti, P. Perona. Bit-width optimization for configurable DSPs by multiinterval analysis. Proceedings of the 34th Asilomar Conference on Signals, Systems and Computers, 2000. [5] M.-A. Cantin, Y. Savaria, P. Lavoie. An automatic word length determination method. Proceedings of the IEEE International Symposium on Circuits and Systems, 2001. [6] M.-A. Cantin, Y. Savaria, P. Lavoie. A comparison of automatic word length optimization procedures. Proceedings of the IEEE International Symposium on Circuits and Systems, 2002. [7] G. A. Constantinides, P. Y. K. Cheung, W. Luk. Truncation noise in fixed-point SFGs. IEE Electronics Letters 35(23), November 1999. [8] G. A. Constantinides, P. Y. K. Cheung, W. Luk. The multiple wordlength paradigm. Proceedings of the IEEE Symposium on Field-Programmable Custom Computing Machines, April–May 2001.
500
Chapter 23
I
Precision Analysis for Fixed-point Computation
[9] G. A. Constantinides. High-level Synthesis and Wordlength Optimization for Digital Signal Processing Systems, Ph.D. thesis, University of London, 2001. [10] G. A. Constantinides, P. Y. K. Cheung, W. Luk. Optimum wordlength allocation. Proceedings of the IEEE Symposium on Field-Programmable Custom Computing Machines, April 2002. [11] G. A. Constantinides, P. Y. K. Cheung, W. Luk. Synthesis of saturation arithmetic architectures. ACM Transactions on Design Automation of Electronic Systems 8(3), 2003. [12] G. A. Constantinides, P. Y. K. Cheung, W. Luk. Synthesis and Optimization of DSP Algorithms, Kluwer Academic, 2004. [13] M. Chang, S. Hauck. Precis: A design-time precision analysis tool. Proceedings of the IEEE Symposium on Field-Programmable Custom Computing Machines, 2002. [14] M. Chang, S. Hauck. Automated least-significant bit datapath optimization for FPGAs. Proceedings of the IEEE Symposium on Field-Programmable Custom Computing Machines, 2004. [15] G. A. Constantinides. wordlength optimization for differentiable nonlinear systems. ACM Transactions on Design Automation for Electronic Systems, January 2006. [16] G. A. Constantinides, G. J. Woeginger. The complexity of multiple wordlength assignment. Applied Mathematics Letters 15, 2002. [17] G. DeMicheli. Synthesis and Optimization of Digital Circuits, McGraw-Hill, 1994. [18] P. D. Fiore. Lazy rounding. Proceedings of the IEEE Workshop on Signal Processing Systems, 1998. [19] C. Fang, T. Chen, R. Rutenbar. Floating-point error analysis based on affine arithmetic. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. [20] J. Hwang, B. Milne, N. Shirazi, J. Stroomer. System level tools for DSP in FPGAs. In R. Woods and G. Brebner, eds., Processing Field Programmable Logic, SpringerVerlag, 2001. [21] K. Hwang. Computer Arithmetic: Principles, Architecture and Design, Wiley, 1979. [22] IEEE Standard for Binary Floating-point Arithmetic (ANSI/IEEE Standard 991), 1986. [23] IEEE Standard for VHDL Register Transfer Level (RTL) Synthesis (IEEE Standard 1076.6), 1999. [24] C. Inacio, D. Ombres. The DSP decision: Fixed point or floating? IEEE Spectrum 33(9), September 1996. [25] S. Kim, K. Kum, W. Sung. Fixed-point optimization utility for C and C++ based digital signal processing programs. IEEE Transactions on Circuits and Systems II 45(11), November 1998. [26] S. Kotz, S. Nadarajah. Extreme Value Distributions: Theory and Applications, Imperial College Press, 2000. [27] K.-I. Kum, W. Sung. Combined wordlength optimization and high-level synthesis of digital signal processing systems. IEEE Transactions on Computer-Aided Design 20(8), August 2001. [28] D.-U. Lee, A. Gaffar, R. Cheung, O. Mencer, W. Luk, G. A. Constantinides. Accuracy guaranteed bit-width optimization. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2006. [29] B. Liu. Effect of finite word length on the accuracy of digital filters—A review. IEEE Transactions on Circuit Theory 18(6), 1971. [30] H. Levy, D. W. Low. A contraction algorithm for finding small cycle cutsets. Journal of Algorithms 9, 1988.
23.4 Summary
501
[31] V. Lehtinen, M. Renfors. Truncation noise analysis of noise shaping DSP systems with application to CIC decimators. Proceedings of the European Signal Processing Conference, 2002. [32] S. K. Mitra. Digital Signal Processing, McGraw-Hill, 1998. [33] J. von Neumann, H. H. Goldstine. Numerical inverting of matrices of high order. Bulletin of the American Mathematics Society 53, 1947. ¨ [34] E. Ozer, A. Nisbet, D. Gregg. Stochastic bit-width approximation using extreme value theory for customizable processors. Proceedings of the International Conference on Compiler Construction, 2004. [35] A. V. Oppenheim, R. W. Schafer. Digital Signal Processing, Prentice-Hall, 1975. [36] A. V. Oppenheim, C. J. Weinstein. Effects of finite register length in digital filtering and the fast fourier transform. IEEE Proceedings 60(8), 1972. [37] W. Sung, K. Kum. Simulation-based wordlength optimization method for fixedpoint digital signal processing systems. IEEE Transactions on Signal Processing 43(12), December 1995. [38] A. S. Sedra, K. C. Smith. Microelectronic Circuits, Saunders, 1991. [39] A. Turing. Rounding-off errors in matrix processes. Quarterly Journal of Mechanics 1, 1948. [40] S. A. Wadekar, A. C. Parker. Accuracy sensitive wordlength selection for algorithm optimization. Proceedings of the International Conference on Computer Design, October 1998.
This page intentionally left blank
CHAPTER
24
DISTRIBUTED ARITHMETIC Rajeevan Amirtharajah Department of Electrical and Computer Engineering University of California–Davis
Distributed arithmetic (DA) [1, 2] is a computation algorithm that performs multiplication using precomputed lookup tables (LUTs) instead of logic. It is well suited to implementation on homogeneous field-programmable gate arrays (FPGAs) because of its high utilization of the available LUTs. It may also have advantages for modern heterogeneous FPGAs that contain built-in multipliers because it is area efficient for implementing long digital filters. DA targets the sumof-products (or vector dot product) operation, and many digital signal processing (DSP) tasks such as filter implementation, matrix multiplication, and frequency transformation can be reduced to one or more sum-of-products computations.
24.1
THEORY The theory behind DA is based on reorganizing the vector dot product operation around the binary representation of the vector elements [2]. Suppose that X is the vector of input samples and A is a constant vector of filter coefficients, corresponding to the taps of a finite impulse response (FIR) filter. Vectors X and A each consist of M elements Xk and Ak . The dot product y of X and A (corresponding to the convolution of X with the FIR impulse response) can be written as M−1
y=
∑
(24.1)
Ak Xk
k=0
We can represent each element of the input sample vector X in N-bit 2’s complement notation. Then equation 24.1 can be expressed as M−1
y=
∑
k=0
Ak −bk(N−1) 2N−1 +
N−2
∑ bkn 2n
(24.2)
n=0
where bk(N−1) is the sign bit of the input sample Xk in N-bit 2’s complement notation, and bkn is the nth bit of input sample Xk . The possible values of bki
504
Chapter 24
I
Distributed Arithmetic
are either 0 or 1. Equation 24.2 can be further rearranged into equation 24.3 by multiplying out the factors and changing the order of the summation: y=−
M−1
∑
Ak bk(N−1) 2N−1 +
N−2 M−1
∑ ∑
n=0
k=0
Ak bkn 2n = Zsign + Zn1
(24.3)
k=0
Consider each term in the brackets of the second summation in equation 24.3, labeled Zn0 in the following: M−1
Zn0 =
∑
Ak bkn
(24.4)
k=0
where term Zn0 has 2M possible values because bkn is either 1 or 0. Therefore, each summation term Ak bkn can have the value of either Ak or 0. Instead of using a multiplier to compute any of these 2M possible values whenever necessary, we can precompute them and store them in a LUT with depth 2M . The contents of the LUT are then addressed directly by the bit-serial input data, [b0n , b1n , b2n , . . . bMn ], corresponding to the nth bits of each element Xk of input vector X. Multiplication by the factor 2n in equation 24.3 can be realized by a shifter and the addressed LUT contents shifted and accumulated to form term Zn1 in (N − 1) cycles. The sign term Zsign can be handled in the same way with additional circuitry to implement subtraction; it takes one additional clock cycle. The final result y is formed after N cycles. Note that, if the filter length is greater than the bit width of the input data (i.e., M > N), DA computes the final result in fewer cycles than an implementation using a single multiply–accumulate functional unit. However, because the size of the LUT grows exponentially in the number of vector elements (2M ), most practical implementations use multiple LUTs and adders to combine partial dot products into the final result.
24.2
DA IMPLEMENTATION A simple DA implementation is shown in Figure 24.1. It requires a 16-bit shift register for the input vector, a 16-entry LUT, an adder/subtractor, and an accumulator (Result) for the output. The x2 operation is handled purely by wiring. This unit is a direct implementation of the DA algorithm described in the preceding section, and it is capable of computing the dot product of a 4-element vector X and a constant 4-element vector A. In the figure the four 4-bit-wide elements of X are fed into the address decoder in most significant bit (MSB) first order to select the appropriate LUT row contents. The selected content is added with the left-shifted version of the previous RESULT value to form the current RESULT value. Ts is the sign bit timing signal that controls the add/subtract operation; when Ts is high, the current LUT content is subtracted from the left-shifted version of the previous result. The final vector dot product is obtained in four cycles. Shifting in the bit vector
X3:
b30
b31
b32
b33
X2:
b20
b21
b22
b23
X1:
b10
b11
b12
b13
X0:
b00
b01
b02
b03
4x16 Address decoder
24.2 DA Implementation
505
0 A0 A1 A01A1 A2 A21A0 A21A1 A21A11A0 A3 A31A0 A31A1 A31A11A0 A31A2 A31A21A0 A31A21A1 A31A21A11A0
Ts
x2
1/2
Result
y
FIGURE 24.1
I
A simple implementation of distributed arithmetic.
least significant bit (LSB) first also produces the correct final value and has the advantage of eliminating long carry propagations when accumulating the intermediate results. The only modifications to Figure 24.1 required for this alternative are to reverse the bits of vector Xin , the shift register, and replace the left shift by 1 bit and the right shift by 1 bit. Various other modifications to this structure are possible. For example, the input sample shift register can be serial in/serial out or parallel in/serial out depending on the application. LUT size can be a determining factor in the total hardware cost of a DA implementation. It is possible to modify the structure in Figure 24.1 to reduce the table size by a factor of 2. To achieve this reduction, consider a different representation of the input data samples Xk : Xk =
1 [X − (−Xk )] 2 k
(24.5)
The 2’s complement representation of the negative of Xk can be expressed as −Xk = −bk(N−1) 2N−1 +
N−2
∑ bkn 2n + 1
n=0
(24.6)
506
Chapter 24
I
Distributed Arithmetic
where each bit of Xk has been complemented and a 1 has been added to the complemented bits. Plugging equation 24.6 into equation 24.5 yields N−2 1 N−1 n − bk(N−1) − bk(N−1) 2 + ∑ bkn − bkn 2 − 1 (24.7) Xk = 2 n=0 Each difference term bkn − bkn (for n = 0 to N − 1) in equation 24.7 can take on values of +1 or −1. This alternate representation for Xk is convenient because, in the resulting summation for the dot product, each linear combination of Ak has a corresponding negative linear combination. Only one of these combinations needs to be stored in the LUT, with the negative being applied during operation using the subtractor. Substituting equation 24.7 into equation 24.1 and rearranging terms yields the following new expression for the result of the dot product y: N−1
y=
∑ Q (bn ) + Q (0)
(24.8)
n=0
where Q (bn ) =
1 M−1 Ak bkn − bkn 2n , ∑ 2 k=0
Q (bN−1 ) = − Q (0) = −
n = N − 1
1 M−1 Ak bk(N−1) − bk(N−1) 2N−1 , ∑ 2 k=0 1 M−1 ∑ Ak 2 k=0
(24.9a)
n = N−1
(24.9b)
(24.9c)
Note that the expressions for Q(bn ) and Q(bN−1 ) have 2M−1 possible magnitudes, with signs determined by the input bits, and that the computation of y requires an additional register to hold the constant term Q(0). This leads to the reduced DA memory implementation shown in Figure 24.2, where the exclusiveor (XOR) gates are required to recode the addresses to access the appropriate LUT row and to control the timing of the sign bit into the adder/subtractor. The XOR gates, the initial condition register for Q(0), and a 2-input multiplexer are the only additional hardware required to reduce the memory size by a factor of 2. The implementations in both Figures 24.1 and 24.2 require N clock cycles to compute the final result, although additional cycles may be needed to match the throughput of the DA unit to other functional units in the system for a particular application. In Section 24.3 we will discuss mapping these basic structures onto FPGA fabrics. We will address the issue of performance improvement (by reducing the number of required clock cycles and increasing the clock frequency) in Section 24.4.
24.3 Mapping DA onto FPGAs
507
X3: b30 b31 b32 b33 X2: b20 b21 b22 b23 X1: b10 b11 b12 b13
X0: b00 b01 b02 b03
3x8 address decoder
21/2(A01A11A21A3) 21/2(A01A11A22A3) 21/2(A01A12A21A3) 21/2(A01A12A22A3) 21/2(A02A11A21A3)
Initial condition register Q0
21/2(A02A11A22A3) 21/2(A02A12A21A3) 21/2(A02A12A22A3)
Ts
x2
1/2 Result
y
FIGURE 24.2
24.3
I
Reduced DA memory implementation.
MAPPING DA ONTO FPGAs Consider mapping a 16-tap FIR filter (M = 16) operating on 16-bit data (N = 16) onto an FPGA fabric based on 4-input LUTs. As discussed earlier, DA’s primary drawback is that the size of the LUTs grows exponentially in the number of filter coefficients (or filter taps). If we want to use 16-bit data to represent the precomputed values, we need 16 × 216 = 1 Mbit of memory. To limit this growth, long filters can be partitioned into several smaller DA units whose outputs are then combined using a tree of 2-input adders, as shown in Figure 24.3. This partitions the 16 filter taps A0 to A15 among four DA units, each of which incorporates N 1-bit-wide 4-input LUTs. The partitioning is chosen to correspond to the LUT size of the individual logic elements or CLBs. If the filter taps are symmetric (which they often are for typical signal-processing applications), the memory size can be reduced by a further factor of 2 by summing the appropriate elements of the input vector Xk using serial addition and using the bits of the resulting sum to address the LUTs. In addition to the serial adder hardware, this memory reduction comes at the expense of an additional clock cycle of latency before the final result is valid. As CMOS technology has scaled and the complexity of individual CLBs has increased with succeeding FPGA generations, the hardware cost of implementing our example filter has shrunk dramatically. Based on an early implementation of an 8-tap, 8-bit filter using DA on a Xilinx 3042 FPGA [3], our example would consume approximately 120 CLBs, including control logic, even using the
508
Chapter 24
I
Distributed Arithmetic
(A02 A3)
(A42 A7)
(A82 A11)
(A122 A15)
DA Unit 0
DA Unit 1
DA Unit 2
DA Unit 3
+
+ + y
FIGURE 24.3
I
A 16-tap FIR filter mapped onto multiple DA units.
symmetry of the filter coefficients to reduce the memory requirements. This would consume roughly the entire FPGA chip. Resource usage would be dominated by the input shift registers (60 CLBs) since this older FPGA architecture only allowed the local CLB flip-flops to be used in a shift configuration. In contrast, a recent FPGA architecture encompasses four logic “slices” in each CLB, where two slices each roughly correspond to an entire CLB in the older architecture [6]. Because LUTs in Xilinx Spartan-3E FPGAs can be configured as 16 × 1 shift registers, the number of CLB resources to implement the data memory for DA is drastically reduced. Each logic slice also contains carry propagation logic for efficient implementation of adder circuits, which can be used to increase the speed of DA computation, as will be shown later. Implementing the example filter on a Spartan-3E FPGA requires approximately 113 slices, corresponding to 29 CLBs. This is under 12 percent of the total number of slices available in the smallest member of the 3S100E FPGA family. Further enhancements to the architecture building blocks may allow for more efficient DA implementation in the future. For example, the potential of heterogeneous or coarse-grained FPGAs to support DA more efficiently by incorporating small adders and accumulators directly in the CLB is currently being explored [7].
24.4
IMPROVING DA PERFORMANCE Two approaches can be taken to improve DA performance on an FPGA platform. First, the design can be modified to reduce the number of cycles required to compute the final result. Second, the cycle time can be decreased by reducing
24.4 Improving DA Performance
509
the number of logic stages in the critical path of the computation. Examples of both approaches will be discussed in this section. A simple approach to speeding up DA computation is to recognize that multiple bits from each input vector element Xk can be used to address multiple LUTs in each clock cycle (because addition is associative, we can perform the sum in equation 24.3 using any combination of partial sums that is convenient). This leads to an architecture like the one shown in Figure 24.4, which uses 2 bits of the input data vector elements at a time. The LUTs are identical because they contain the same linear combinations of filter coefficients Ak . The LUT outputs must be scaled by the correct exponent of 2 to maintain the significance of the bits added to the accumulated result (the x2 unit in Figure 24.4). Only two cycles are required to compute the result y for this implementation, instead of four cycles for the implementation in Figure 24.2. For longer bit-width input data, this idea can be extended to using more bits at a time. The modification just described provides the benefit of a linear decrease in the number of clock cycles at the expense of a linear increase in LUT memory size. In addition, the number of inputs and the bit width of the adder/subtractor must increase. Mapping this approach onto an FPGA involves a trade-off between the routing resources consumed and the speed of the computation, as the input data bit vectors must be divided into subwords and distributed to multiple CLBs. In addition, multiple LUT outputs must be accumulated at a single destination to form the result, which consumes further routing. Following a derivation similar to that presented by White [2], we can analyze this trade-off quantitatively. Suppose that we are implementing an M-tap filter
X2: b21 b23 X1: b11 b13
X0: b01 b03
21/2(A01A11A22A3)
21/2(A01A11A21A3)
X3: b30 b32
21/2(A01A12A21A3) 21/2(A01A12A22A3) 21/2(A02A11A21A3) 21/2(A02A11A22A3) 21/2(A02A12A21A3) 21/2(A02A12A22A3)
X2: b20 b22 X1: b10 b12 X0: b00 b02
3x8 address decoder
X3: b31 b33
3x8 address decoder
21/2(A01A11A21A3)
21/2(A01A11A22A3) 21/2(A01A12A21A3) 21/2(A01A12A22A3) 21/2(A02A11A21A3) 21/2(A02A11A22A3)
Initial condition register Q0
21/2(A02A12A21A3) 21/2(A02A12A22A3)
x2
Ts
x4
1/2 Result
y
FIGURE 24.4
I
Two-bit-at-a-time reduced memory DA implementation.
510
Chapter 24
I
Distributed Arithmetic
using an N-bit number representation and that the computation is proceeding L bits at a time. Further suppose that the LUT data is W bits wide. Computing the result requires that, in each cycle, MN bits are shifted in and WL bits are read out, and N ⁄ L clock cycles must pass. The number of wires NW is therefore NW =
MN + WL = (M + W) L N⁄L
(24.10a)
If we define the relative importance of minimizing routing resources to minimizing latency as the ratio r, then r=
N N⁄L = NW (M + W) L2
(24.10b)
and we can find the L that satisfies our design criterion of relative importance r: N L= (24.10c) r (M + W ) Now suppose that an application demands low latency and that routing resources are not too tightly constrained; then, for r = 2, 32-bit input data (N = 32), a 4-tap FIR filter (M = 4), and 4-bit LUT data (W = 4); this yields L = 2. The desired DA implementation takes the input data 2 bits at a time to address the LUTs, completing a dot product computation in 16 cycles. In addition to exploiting parallelism to speed up the DA computation, it is possible to employ various levels of pipelining. As we saw in Figure 24.1, the critical path involves decoding the address presented by the data shift registers, accessing the row from the LUT, and propagating the carry through the adder/subtractor while meeting the setup time constraints for the accumulator. If the implementation spans multiple CLBs, there is a potentially significant interconnect delay in this critical path in addition to the combinational logic delay. An obvious way to pipeline the simple implementation is to make the LUT synchronous and latch the outputs before they are fed to the adder/ subtractor. An alternative approach is to use carry save addition to reduce the carry propagation chain in the critical path [8]. The key modification to Figure 24.1 is to use a different structure for the adder/subtractor and to perform the computation in LSB first order. Instead of using a carry propagate adder to accumulate the entire result in one clock cycle, the adder/subtractor is pipelined at the bit level and the sum and carry outputs are stored in flip-flops at each cycle. Each full adder takes one input bit from the LUT output and one from the sum output of the next most significant full adder, automatically accounting for the x2 scaling required in Figure 24.1. Assuming that the accumulator is wider than N bits, after N clock cycles the least significant N bits of the final result are stored in the LSBs of the accumulator while the remaining MSBs require one more carry propagating addition to produce the final result. This operation adds one extra clock cycle to the latency of the DA computation.
24.5 An Application of DA on an FPGA
511
Most modern FPGA fabrics have dedicated paths for high-speed carry propagation. Given that most DA designs require accumulators with not too many more than N bits, the final carry propagation is typically not the critical path for the entire computation. The throughput is determined by the speed of the carry save addition in the accumulator. Although using carry save addition at the single-bit level results in the greatest speed improvement, it is also the most resource intensive in terms of logic slices and CLBs. A speed versus area trade-off can be achieved by partitioning the adder/subtractor into multiple subcircuits, each of which propagates a carry across p bits ( p = 1 in the example just described). Speedup factors of at least 1.5 have been observed over the traditional design shown in Figure 24.1 [8].
24.5
AN APPLICATION OF DA ON AN FPGA In addition to FIR filters, a common DA application on FPGAs is acceleration of frequency transformations such as the discrete cosine transform (DCT), which is a critical component of the MPEG video compression and JPEG image compression standards. The two-dimensional DCT can be implemented as two one-dimensional DCTs and a matrix transposition. Each DCT can be implemented as a matrix–vector multiplication, which is easy to implement on an FPGA using DA because it can be decomposed into a set of vector dot products. In one example, using DA instead of multiply–accumulate for the DCT resulted in a factor of 2.4 reduction in area for the FPGA implementation (on a Xilinx XC6200 FPGA) [9]. Using DA and pipelining of the routing to improve the algorithm performance, this implementation was fast enough to process VGA resolution images (640 × 480 pixels) at 25 frames per second—approximately four times faster than a full software implementation running on a microprocessor. The entire two-dimensional DCT consumed a 64 × 78 array of logic blocks on the chip (about 30 percent of the total FPGA area) and the DA portions of the DCT consumed 3648 logic blocks, or about 70 percent of the two-dimensional DCT total. The average utilization of each logic block for the DA components was 61 percent. This high level of utilization was a result of careful floorplanning in addition to DA’s inherent suitability to FPGA implementation.
References [1] Xilinx, Inc. The Role of Distributed Arithmetic in FPGA-based Signal Processing, Xilinx, Inc. (http://www.xilinx.com/appnotes/theory1.pdf), January 2006. [2] S. A. White. Applications of distributed arithmetic to digital signal processing: A tutorial review. IEEE ASSP Magazine 6(3), July 1989. [3] L. Mintzer. FIR filters with field-programmable gate arrays. Journal of VLSI Signal Processing 6, 1993. [4] G. Roslin. A guide to using field-programmable gate arrays (FPGAs) for applicationspecific digital signal processing performance. Xilinx white paper, 1995.
512
Chapter 24
I
Distributed Arithmetic
[5] W. Wolf. FPGA-based System Design (Modern Semiconductor Design Series), Prentice-Hall, 2004. [6] Xilinx, Inc. Spartan-3E FPGA Family: Complete Data Sheet, DS312 (v2.0) (http:// www.xilinx.com), November 2005. [7] B. Calhoun, F. Honore, A. Chandrakasan. A leakage reduction methodology for distributed MTCMOS. IEEE Journal of Solid-State Circuits 39(5), May 2004. [8] R. Grover, W. Shang, Q. Li. A faster distributed arithmetic architecture for FPGAs. Proceedings of the 10th ACM International Symposium on Field-Programmable Gate Arrays, February 2002. [9] R. Woods, D. Trainor, J.-P. Heron. Applying an XC6200 to real-time image processing. IEEE Design & Test of Computers 15(1), January/March 1998.
CHAPTER
25
CORDIC ARCHITECTURES FOR FPGA COMPUTING Chris Dick Advanced Systems Technology Group DSP Division of Xilinx, Inc.
Because field-programmable gate arrays (FPGAs) are often used for realizing complex mathematical calculations, the FPGA designer is in need of a set of math libraries to support such implementations. The literature is rich with algorithmic options for evaluating the type of math functions (e.g., sine, cosine, sinh, cosh, arctangent, atan2, logarithms) that are typically found in a math library for general-purpose and DSP processors. The enormous flexibility of the FPGA coupled with the vast suite of algorithmic options for computing math functions can make the development of an FPGA math library a challenging task. Common approaches to evaluating math functions include polynomial approximation-based techniques [13] and Newton-style iterations [13], to name a couple. One of the most useful and flexible approaches available to the hardware designer for developing high-performance computing hardware is the CORDIC (COordinate Rotation DIgital Computer) algorithm. CORDIC is unparalleled in its ability to encapsulate a diversity of math functions in one basic set of iterations. It can be viewed as the Swiss Army Knife, so to speak, of arithmetic—that is, a single hardware architecture, with very minimal control overhead, having the ability to compute sine, cosine, cosh, sinh, atan2, square root, and polar-to-rectangular and rectangular-to-polar conversions, to name only a few functions. It is in coordinate transformations that the algorithm comes into its own. In both, multi-operand input and multi-element output vectors are involved. There are a plethora of alternatives for realizing, say, division in an FPGA, and most of the CORDIC alternatives provide good hardware efficiency. However, the algorithm remains unrivaled when it comes to processing multi-element I/O vectors, as is the case when converting from Cartesian to polar coordinates or vice versa. CORDIC falls into the class of shift-and-add algorithms—it is a multiplierless method dominated by additions. FPGAs are very efficient at realizing arbitrary precision adders, and so the CORDIC algorithm is in many ways a natural fit for course-grained FPGA architectures such as the Xilinx Virtex-4 family of devices [41]. This chapter begins with a brief tutorial overview of the CORDIC algorithm. Because most hardware realizations of CORDIC employ fixed-point arithmetic,
514
Chapter 25
I
CORDIC Architectures for FPGA Computing
design considerations for quantizing the datapath and selecting a suitable number of iterations are provided. Approaches for architecting FPGA CORDIC processors are then presented. Various options are discussed that highlight the use of FPGA features such as embedded multipliers, embedded multiply– accumulator (MACC) tiles, and logic fabric to deliver hardware realizations that provide various trade-offs between throughput, latency, logic fabric utilization, and numerical accuracy. A brief overview of the System Generator [38] design flow used to produce our implementations is also provided. Design considerations for producing very high throughput (450–500 MHz) implementations in Virtex-4 [41] devices are presented as well.
25.1
CORDIC ALGORITHM The CORDIC algorithm was first published by Volder [35] in 1959 as a technique for efficiently implementing the trigonometric functions required for real-time aircraft navigation. Since first being published, the method has been extensively analyzed and extended to the point where a very rich set of functions is accessible from the one basic set of equations. The algorithm is dominated by bit shifts and additions and so was an ideal match for early-generation computing technology in which multiplication and division were expensive in terms of computation time and physical resources. Volder essentially presented iterative techniques for performing translations between Cartesian and polar coordinate systems (vectoring mode), and a method for realizing a plane rotation (rotation mode) using a series of arithmetic shifts and adds. Since its publication, the CORDIC algorithm has been applied to many different applications and has been used as the cornerstone of the arithmetic engine in many VLSI signal-processing implementations [34]. It has been used extensively for computing various types of transforms, including the fast Fourier transform (FFT) [10,11], the discrete cosine transform [4], and the discrete Hartley transform [3]. And it has found widespread use in realizing various classes of digital filters, including Kalman filters [31], adaptive lattice structures [21], and adaptive nulling [30]. A large body of work has been published on CORDICbased approaches for implementing various types of linear algebra operations, including singular value decomposition (SVD) [1], Given’s rotations [30], and QRD-RLS (recursive least squares) filtering [14]. A brief tutorial style treatment of the basic algorithm is provided here; its FPGA implementation will be discussed in subsequent sections.
25.1.1
Rotation Mode
The CORDIC algorithm has two basic modes: vectoring and rotation. These can be applied in several coordinate systems, including circular, hyperbolic, and linear, to compute various functions such as atan2, sine, cosine, and even division. We begin our treatment by considering the problem of constructing an efficient method to realize rotation of the vector (xs , ys ) through an angle a plane θ to produce a vector xf , yf , as shown in Figure 25.1.
25.1 CORDIC Algorithm y
515
Rotated vector (xf, yf)
(xs, ys) Input vector x
FIGURE 25.1
I
Plane rotation of the vector (xs , ys ) through an angle θ.
The rotation is formally captured in matrix form by equation 25.1. xf cos θ − sin θ xs xs = = ROT (θ) yf sin θ cos θ ys ys
(25.1)
which can be expanded to the set of equations in equation 25.2. xf = xs cos θ − ys sin θ yf = xs sin θ + ys cos θ
(25.2)
The development of a simplified approach for producing rotation through the angle θ begins by considering it not as one lumped operation but as the result of a series of smaller rotations, or micro-rotations, through the set of angles αi where ∞
θ = ∑ αi
(25.3)
i=0
The rotation can now be cast as a product of smaller rotations, or ∞
ROT (θ) = ∏ ROT (αi )
(25.4)
i
If these values αi are carefully chosen, we can provide a very efficient computation structure. Equation 25.2 can be modified to reflect a micro-rotation ROT (αi ), leading to equation 25.5. xi+1 = xi cos αi − yi sin αi yi+1 = xi sin αi + yi cos αi
(25.5)
where (x0 , y0 ) = (xs , ys ). Factoring permits the equations to be expressed as xi+1 = cos αi (xi − yi tan αi ) yi+1 = cos αi (yi + xi tan αi )
(25.6)
516
Chapter 25
I
CORDIC Architectures for FPGA Computing
which positions the iterative update as the product of two procedures: a scaling by the cos αi term and a similarity transformation, or scaled rotation. The next significant step that leads to an algorithm that lends itself to an efficient hardware realization is to place restrictions on the values that αi can take. If (25.7) αi = tan−1 σi 2−i where σi ∈ {−1, + 1}, then equation 25.6 can be written as xi+1 = cos αi xi − σi yi 2−i yi+1 = cos αi yi + σi xi 2−i
(25.8)
The purpose of σi will be explained shortly. With the exception of the scaling term, these equations can be implemented using only additions, subtractions, and shifts. In the set of equations that are typically presented as the CORDIC iterations, and following the lead of Volder [35], the scaling term is usually excluded from the defining equations to produce the modified set of equations xi+1 = xi − σi yi 2−i yi+1 = yi + σi xi 2−i
(25.9)
To determine the value of these σi we introduce a new variable, z (the angle variable). The recurrence on z is defined by equation 25.10. zi+1 = zi − σi tan−1 2−i
(25.10)
If the z variable is initialized with the desired angle of rotation θ—that is, z0 —it can be to 0 by conditionally adding or subtracting terms of the driven form tan−1 2−i from the state variable z. The conditioning is captured by the term σi as a test on the sign of the current state of the angle variable zi —that is, σi =
1 if zi ≥ 0 −1 if zi < 0
(25.11)
Driving z to 0 is actually an iterative process for decomposing θ into a weighted linear combination of terms of the form tan−1 2−i . As z goes to 0, the vector (x0 , y0 ) experiences a sequence of micro-rotation extensions that in the limit n → ∞ converge to the coordinates xf , yf .
25.1 CORDIC Algorithm
517
The complete algorithm is summarized in equation 25.12. i=0 x0 = xs y0 = ys z0 = θ xi+1 = xi − σi yi 2−i
(25.12)
yi+1 = yi + σi xi 2−i
zi+1 = zi − σi tan−1 2−i 1 if zi ≥ 0 σi = −1 if zi < 0
which is easily realized in hardware because of the simple nature of the arithmetic required. The only complex function is the tan−1 , which can be precomputed and stored in a memory. Because of the manner in which the updates are directed, this mode of the CORDIC algorithm is sometimes referred to as the z-reduction mode. Figure 25.2 shows the signal flow graph for the algorithm. Observe the butterfly-style architecture in the cross-addition update.
25.1.2
Scaling Considerations
Because the scaling term cos αi has not been carried over into equation 25.12, the input vector (x0 , y0 ) not only undergoes a rotation but also experiences scaling or growth by a factor 1/ cos αi at each iteration. That is, 1/2 1 Ri Ri = 1 + σ2i 2−2i Ri+1 = Kc,i Ri = cos αi 1/2 Ri = 1 + 2−2i
(25.13)
where Ri = |xi + jyi | designates the modulus of the vector at iteration i, and the subscript c associates the scaling constant with the circular coordinate system. Figure 25.3 illustrates the growth process at each of the intermediate CORDIC iterations as (x0 , y0 ), which is translated to its final location xf , yf . For an infinite number of iterations the scaling factor is 1/2 ∞ Kc = ∏ 1 + 2−2i ≈ 1.6468
(25.14)
i=0
It should also be noted that, since σi ∈ {−1, + 1}, the scaling term is a constant that is independent of the angle of rotation. As captured by equation 25.4, the angle of rotation θ is decomposed into an infinite number of elemental angles αi , which implies that an infinite number of iterations is theoretically required. In practice, a finite number of iterations, n, is selected to make the system realizable in software or hardware. Application of n iterations translates (x0 , y0 ) to (xn , yn ) rather than to xf , yf
518
Chapter 25
CORDIC Architectures for FPGA Computing
I
Initial condition
Micro-rotation angle storage i
2
ROM tan21(22i )
xi x0
z21
SGN
xi11
z21 2
22i Barrel shifter 22i yi
y0
yi11
z21
z21
FIGURE 25.2
I
Adder
Multiply
Unit delay (register)
A signal flow graph for CORDIC vector rotation.
y
(x1, y1)
Final rotation after infinite number of iterations (xf , yf)
(x3, y3)
(xn , yn) Final rotation after n iterations
an21
(x2, y2)
ˆ (x0, y0) Input vector
x
FIGURE 25.3 I Each iteration of a CORDIC rotation introduces vector growth by a factor of 1/2 1 = 1 + σ2i 2−2i . cos αi
25.1 CORDIC Algorithm
519
as shown in Figure 25.3. The rotation error arg xf + jyf − arg (xn + jyn ) has an upper bound of αn−1 , which is the smallest term in the weighted linear expansion of θ. For an infinite-precision arithmetic implementation of the system of equations, each iteration contributes one additional effective fractional bit to the result. Most hardware implementations of the CORDIC algorithm are realized using fixed-point arithmetic, and, as will be discussed soon, the relationship between the number of effective output binary result digits is very different from that of a floating-point realization of the algorithm.
25.1.3
Vectoring Mode
The CORDIC vectoring mode is most commonly used for implementing a conversion from a rectangular to a polar coordinate system. In contrast to rotation mode, where Z is driven to 0, in the vectoring mode the initial vector (x0 , y0 ) is rotated until the y component is driven to 0. The modification to the basic algorithm required to accomplish this goal is to direct the iterations using the sign of yi . As the y variable is reduced, the corresponding angle of rotation is accumulated in the z register. The complete vectoring algorithm is captured by equation 25.15. i=0 x0 = xs y0 = ys z0 = 0 xi+1 = xi − σi yi 2−i yi+1 = yi + σi xi 2−i
(25.15)
zi+1 = zi − σi tan−1 2−i 1 if yi < 0 σi = −1 if yi ≥ 0
This CORDIC mode is commonly referred to as y-reduction mode. Figure 25.4 shows the results of a CORDIC vector mode simulation for arg (xs + jys ) = 7π/8 and |xs + jys | = 1. The top plot (a) shows the true angle of the input vector (solid line) overlaid with arg (xi + jyi ) , i = 1, . . ., 16. We note the oscillatory behavior of (xi , yi ) about the true value of the angle. Overdamped or underdamped behavior will be produced depending on the system initial conditions. The lower plot (b) shows, for this case of initial conditions, how rapidly the algorithm can converge toward the correct solution. In fact, for many practical applications, a short CORDIC (small number of iterations) produces acceptable performance. For example, in a 16-QAM (quadrature amplitude modulation) carrier recovery circuit [29] employing a Costas Loop [23], a 5-iteration CORDIC usually provides adequate performance [12].
Angle (radians)
Chapter 25
CORDIC Architectures for FPGA Computing
I
2.8 2.6 2.4 2
4
6
8 10 Iteration number (a)
12
14
16
14
16
0.2 % Angle error
520
0
0.2 0
2
4
6 8 10 Iteration number (b)
12
FIGURE 25.4 I Convergence of CORDIC vectoring. The top plot (a) shows the true angle of the input vector arg(xs + jys ) (solid line) overlaid with arg(xi + jyi ), i = 1,. . .,16. The bottom plot (b) is the percentage angle error as a function of the iteration number.
25.1.4
Multiple Coordinate Systems and a Unified Description
Alternative versions of the CORDIC engine can be defined under the circular, hyperbolic, and linear coordinate systems [13]. These use a computation similar to that of the basic CORDIC algorithm, but can provide additional functions. It is possible to capture the vectoring and rotation modes of the CORDIC algorithm in all three coordinate systems using a single set of unified equations. To do this a new variable, m, is introduced to identify the coordinate system so that ⎧ ⎨ +1 circular coordinates 0 linear coordinates m= (25.16) ⎩ −1 hyperbolic coordinates The unified micro-rotation is xi+1 = xi − mσi yi 2−i yi+1 = yi + σi xi 2−i ⎧ −1 2−i if m = 1 ⎨zi − σi tan zi+1 = zi − σi tan h−1 2−i if m = −1 ⎩ if m = 0 zi − σi 2−i 1/2 . The scaling factor is Km,i = 1 + m2−2i
(25.17)
25.1 CORDIC Algorithm
TABLE 25.1
I
Functions computed by a CORDIC processor for the circular (m = 1), hyperbolic (m = −1), and linear (m = 0) coordinate systems
Coordinate system
Rotation/vectoring
Initialization
1
Rotation
1
Vectoring
0
Rotation
0
Vectoring
−1
Rotation
−1
Vectoring
x0 y0 z0 x0 y0 z0 x0 y0 z0 x0 y0 z0 x0 y0 z0 x0 y0 z0 x0 y0 z0 x0 y0 z0
TABLE 25.2
521
I
= xs = ys = θ = 1 K1,n =0 =θ = xs = ys =θ = xs = ys = zs = xs = ys = zs = xs = ys = θ = 1 K−1,n =0 =θ = xs = ys =θ
Result vector xn yn zn xn yn zn xn yn zn xn yn zn xn yn zn xn yn zn xn yn zn xn yn zn
= K1,n · (xs cos θ − ys sin θ) = K1,n · (ys cos θ + xs sin θ) =0 = cos θ = sin θ =0 √ = K1,n · sgn (x0 ) · x2 + y2 =0 = θ + tan−1 (ys /xs ) = xs = ys + xs ys =0 = xs =0 = zs + ys /xs = K−1,n · (xs cosh θ + ys sinh θ) = K−1,n · (ys cosh θ + xs sinh θ) =0 = cosh θ = sinh θ =0 √ = K−1,n · sgn (x0 ) · x2 − y2 =0 = θ + tanh−1 (ys /xs )
CORDIC shift sequences, ranges of covergence, and scale factor bound for circular, linear, and hyperbolic coordinate systems
Coordinate system
Shift sequence
Convergence
Scale factor
m 1 0 –1
sm,i 0, 1, 2, 3, 4,. . ., i,. . . 1, 2, 3, 4, 5,. . ., i+1,. . . 1, 2, 3, 4, 4, 5,. . .*
θMAX ≈1.74
Km (n → ∞ ) ≈1.64676
1.0
1.0
≈1.13
≈0.83816
∗ For m = −1, the following iterations are repeated: {4, 13, 40, 121, . . ., k, 3k + 1, . . . }.
Operating the two modes in the three coordinate systems, in combination with suitable initialization of the algorithm variables, generates a rich set of functions, shown in Table 25.1. Table 25.2 summarizes the shift sequences, maximum angle of convergence θMAX (elaborated on in a later section), and
522
Chapter 25
I
CORDIC Architectures for FPGA Computing
scaling function for the three coordinate systems. Note that each system requires slightly different shift sequences (the sequence of i values).
25.1.5
Computational Accuracy
One of the first design requirements for the fixed-point arithmetic implementation of a CORDIC processor is to define the numerical precision requirements of the datapath. This includes defining the numeric representation for the input operands and the processing engine internal registers, in addition to the number of micro-rotations that will be required to achieve a specified numerical quality of result. To guide this process it is useful to have an appreciation for the sources of computation noise in CORDIC arithmetic. While CORDIC processing can be realized with floating-point arithmetic [2, 7], we will restrict our discussion to fixed-point arithmetic implementations, as they are the most commonly used numeric type employed in FPGA realizations. Two primary noise sources are to be considered. One is associated with the weighted and finite linear combination of elemental angles that are used to represent the desired angle of rotation θ; the second source is associated with the rounding of the datapath variables x, y, and z. These noise sources are referred to as the angle approximation and the rounding error, respectively. Angle approximation error In this discussion we assume that all finite-precision quantities are represented using fixed-point 2’s complement arithmetic, so the value F of a normalized number u represented using m binary digits (um−1 uu−2 . . . u0 ) is F = −um−1 +
m−2
∑ uj · 2−m+j+1
(25.18)
j=0
As will be presented next, there is a requirement in the CORDIC algorithm to accommodate bit growth in both the integer and fractional fields of the x and y variables. To accommodate this, the data format is enhanced with an additional GI and GF integer and fractional guard bits, respectively, so that a number with BI + GI and B fields s and r, F + GF bits allocated to the integer and fractional respectively sBI +GI −1 sBI +GI −2 . . . s0 rBF +GF −1 rBF +GF −2 . . . r0 , is expressed as F = −rBI +GI −1 · 2BI +GI −1 +
BI +GI −2
∑
j=0
sj · 2j +
BF +GF −1
∑
rj · 2−(BF +GF )+j
(25.19)
j=0
Figure 25.5 illustrates the extended data format. The integer guard bits are necessary to accommodate the vector growth experienced when operating in circular coordinates. The fractional guard bits are required to support the word growth that occurs in the fractional field of the x and y registers due to the successive arithmetic shift-right operations employed in the iterative updates. It is assumed that the input samples are represented as normalized (1 · BF ) quantities.
25.1 CORDIC Algorithm MSBs guard bits GI
BI
Input data format BF
523
Fractional guard bits GF
Binary-point
FIGURE 25.5 I The fractional fixed-point data format used for internal storage in the quantized CORDIC algorithm.
There are n fixed rotation angles αm,i employed to approximate a desired angle of rotation θ. Neglecting all other error sources, the accuracy of the calculation is governed by the nth and final rotation, which√ limits the angle approx imation error to αm,n−1 . Because αm,n−1 = √1m tan−1 m · 2−sm,n−1 , the angle approximation error can be made arbitrarily small by increasing the number of micro-rotations n. Of course, the number of bits allocated to represent the elemental angles αm,i needs to be sufficient to support the smallest angle αm,n−1 . The number representation defined in equation 25.19 results in a least significant digit weighting of 2−(BF +GF ) . Therefore, αm,n−1 ≥ 2−(BF +GF ) must hold in order to represent αm,n−1 . Approximately n + 1 iterations are required to generate BF significant fractional bits. Datapath rounding error As discussed earlier, most FPGA realizations of CORDIC processors employ fixed-point arithmetic. The update of the x, y, and z state variables according to equation 25.12 produces a dynamic range expansion, which is ideally supported by precisions that accommodate the worst-case bit growth. The number of additional guard bits beyond the original precision of the input operands can be very large, and carrying these additional bits in the datapath is generally impractical. For example, in the circular mode of operation the number of additional fractional bits required to support a full-precision calculation is determined by the sum of the shift sequence sm,i . If the input operands are presented as a 16.15 value (a 16-bit field width with 15 fractional bits) and 16 micro-rotations are performed, the bit growth for the 15
fractional component of the datapath is ∑ i = 120 bits. Thus, the total number i=0
of fractional bits required for a full-precision calculation is 120 + 15 = 135. While FPGAs certainly provide the capability to support arbitrary precision arithmetic, it would be highly unusual to construct a CORDIC processor with such a wide datapath. In fact, the error in the CORDIC result vector can be maintained to a desired value using far few fractional guard bits, as discussed next. Rather than by accommodating the bit growth implied in the algorithm, the dynamic range expansion is better handled by rounding the newly computed state variables. Control over wordlength can be achieved using unbiased rounding, simple truncation, or other techniques [26]. True rounding, while the
524
Chapter 25
I
CORDIC Architectures for FPGA Computing
preferred approach because of the smaller error introduced when compared to truncation, can be the most area consuming because a second addition is potentially required. In some cases, the cost of rounding can be significantly reduced by exploiting the carry-in port of the adders used in the implementation. Truncation is obviously the simplest approach, requiring only the extraction of a bit field from the full-precision value, but it introduces an undesirable positive bias in the final result and an error component that is twice the magnitude of unbiased rounding. Nevertheless, truncation arithmetic is the option most frequently employed in FPGA CORDIC datapath design. A simple approach to understanding the quantization effects of the CORDIC algorithm was first presented by Walther [36]. A very complete analysis was later published by Hu [16], with further work reported by Park and Cho [28] and Hu and Bass [17]. For many practical applications Walther’s method produces acceptable results, and this is the approach we will use to design the FPGA implementations. A brief summary of the method is presented here. Analysis of the rounding error for the z variable is straightforward because there are no data shifts involved in the state update, as there are with the x and y variables. The rounding error is simply due to the quantization of the rotation angles. The upper bound on the error is then the accumulation of the absolute values of the rounding errors for the quantized angles αm,i . Datapath pruning and its associated quantization effects for the x and y variables is certainly a more challenging analysis than that for the angle variable because the scaling term involved in the cross-addition update. Nevertheless, several extensive treatments have been published. The effects of error propagation in the algorithm were reported by Hu in a Cray Research publication [5] and later extended by Hu and Bass [17]. Walther’s treatment takes a slightly simplified approach and assumes that the maximum rounding error for n iterations is the sum of the absolute value of the maximum rounding error associated with each micro-rotation and the subsequent quantization that is performed to control word growth. The format for the CORDIC variables was shown in Figure 25.5. B = BI + BF + GF +GI bits are used to for internal storage, with BF +GF of these bits assigned to the fractional component of the representation. The maximum error for one iteration is therefore of magnitude 2−(BF +GF ) . In the simplified analysis, the rounding error e (n) in the final result, and after all n iterations, is simply n times this quantity, which is e (n) = n2−(BF +GF ) . If BF accurate fractional bits are required in the result word, the required resolution is 2−(BF −1) . If BF is selected such that e(n) ≤ 2−BF , the datapath quantization can effectively be ignored. This implies that n2−(BF +GF ) ≤ 2−BF , which requires BF ≥ log2 (n). Therefore, GF = log2 (n) fractional guard bits are required to produce a result that has an accuracy of BF fractional bits. This simplified treatment of the computation noise is a reasonable approximation that can help guide the definition of the datapath width required to meet a specified numerical fidelity. Figure 25.6 shows the results of a simulation using different data representations for the x, y, and z variables of a CORDIC vectoring algorithm in circular
25.1 CORDIC Algorithm log2 (n) fractional guard bits
525
log2(n) fractional guard bits
11
BF 5 3
10
BF 5 4
9
BF 5 5 BF 5 6
8
BF 5 7
7
BF 5 8
6
BF 5 9
5
BF 5 10
4
BF 5 10
3
4
6 8 10 12 14 Number of iterations
20 Effective result bits – u
Effective result bits – u
12 BF 5 12 BF 5 13
18
BF 5 14 BF 5 15
16
BF 5 16 BF 5 17
14
BF 5 18 BF 5 19
12 12
BF 5 20
14 16 18 20 22 Number of iterations
24
FIGURE 25.6 I The effiective number of result bites for a CORDIC vector processor (circular coordinates). The number of fractional guard bites is GF = log2 (n) .
coordinates. Unit modulus complex vectors with random angles were generated and projected onto the CORDIC input sample (x0 , y0 ). Each sample point in the plot represents the maximum absolute error of the angle estimate resulting from 4000 trials. We note that in all of the simulations the effective number of fractional output bits is matched to the number of fractional bits in the input operand. The simplified treatment of the rounding noise generated in the update equations is certainly pessimistic and produces a requirement on the number of guard bits that is biased slightly higher than what might typically be required. Selecting GF = log2 (n) is certainly a safe, if not a slightly overengineered, choice. In the context of an FPGA realization, an additional bit of precision carried by the variables has almost negligible impact on the area and maximum operating clock frequency of the design. An additional observation from the plots in Figure 25.6 is that the production of BF effective output digits requires more iterations than the BF + 1 iterations required for a full floating-point implementation—an additional three iterations are, in general, necessary. The implication of this is that two additional bits must be allocated to represent the elemental angles to provide the angle resolution implied by the adjusted iteration count. Defining the number of guard bits GI is very straightforward based on the number of integer bits BI in the input operands, the coordinate system to be employed (e.g., circular, hyperbolic, or linear), and the mode (vectoring or rotation). For example, if the input data is in standard 2’s complement format and bounded by ±1, then BI = 1. This means that the l2 norm of the input (x0 , y0 ) √ is 2. For the CORDIC vectoring mode, the range extension introduced by the iterations is approximately K1 ≈ 1.6468 for any reasonable number of iterations. The √ maximum that the final value of the x register can assume is approximately 2 · 1.6468 ≈ 2.3289, which requires that GI = 2.
526
Chapter 25
I
TABLE 25.3
CORDIC Architectures for FPGA Computing I
Number of rotations and required CORDIC processor datapath format required to achieve a desired number of effective output bits
Number of effective fractional result bits
Micro-rotations: n
Internal storage data format: x and y
Internal storage data format: z
10 15 19 27
(15.12) (19.16) (24.21) (32.29)
(15.14) (19.18) (24.23) (32.31)
8 12 16 24
Based on this approach, a reasonable procedure for selecting the number of CORDIC micro-rotations and a suitable quantization for the x, y, and z variables, given the effective number of fractional bits required in the output, is the following: 1. Define the number of iterations as n = BF + 3. 2. Select the field width for the x and y variables as 2 + BI + BF + log2 (n) for the vectoring mode in circular coordinates—BF + log2 (n) of these bits are of course allocated to the fractional component of the register. 3. Select the fractional precision of the angle register z to be BF + log2 (n) + 2, while maintaining 1 bit for the integer portion of the register. 4. Apply similar reasoning to select n and GI for the other coordinate systems and modes. Based on this approach, Table 25.3 shows the number of micro-rotations n and the internal data storage format corresponding to 8, 12, 16, 24, and 32 effective fractional result bits. The notation (p · q) indicates a bit field width of p bits, with q of these bits allocated to the fractional component of the value.
25.2
ARCHITECTURAL DESIGN There are many hardware architecture options to evaluate when considering FPGA CORDIC datapath implementation. A particular choice is determined by the design specifications of numerical accuracy, throughput, and latency. At the highest level are key architectural decisions on whether a folded [27] or fully parallel [27] pipelined (or nonpipelined) architecture is to be used. At a lower, technology-specific level, FPGA features associated with a particular FPGA family are also a factor in the decision process. For example, later-generation FPGAs such as the Virtex-4 family [41] include an array of arithmetic units called the XtremeDSP Slice [43] (referred to as the DSP48 in the remainder of the chapter). As discussed later, a CORDIC implementation can be realized that is mostly based on the DSP48 embedded tile. Thus, with this particular family of devices
25.3 FPGA Implementation of CORDIC Processors
527
the designer has a choice of producing an implementation that is completely logic slice based [40] or biased toward the use of DSP48 elements. The process that guides such decisions is elaborated in the next section.
25.3
FPGA IMPLEMENTATION OF CORDIC PROCESSORS One of the elegant properties of FPGA computing is the ability to construct a compute engine closely tailored to the problem specifications, including processing throughput, latency, and numerical accuracy. Consider, for example, the throughput requirement. At one end of the architecture spectrum, and when modest processing rates are involved, a fully folded [27] implementation, where the same logic is used for all iterations (folding factor = n), is one option. In this case, new operands are delivered, and a new result vector is produced, every n clock cycles. This choice of implementation results in the smallest FPGA footprint at the expense of processing rate. If a high-throughput unit is required, a fully parallel, or completely unfolded implementation (folding factor = 1) that allocates a complete hardware PE to each iteration is appropriate. This will of course result in the largest area, but provides the highest compute rate.
25.3.1
Convergence
One of the design considerations for the CORDIC engine is the region of convergence that needs to be supported by the implementation, as the basic form of the algorithm does not converge for all input coordinates. For the rotation mode, the CORDIC algorithm converges provided that the absolute value of the rotation angle is no larger than θMAX ≈ 1.7433 radians, or approximately 99.88◦ . In many applications we need to support input arguments that span all four quadrants of the complex plane—that is, a so-called full-range CORDIC. Much published work addresses this requirement [8, 19, 25], and many elegant extensions to the basic set of CORDIC iterations have been produced. Some of them introduce additional iterations and, while maintaining the basic shift-and-add property of the algorithm, result in a significant time or area penalty. The most straightforward approach for handling the convergence issue in FPGA hardware is to first note that the natural range of convergence extends beyond the angle π/2. That is, the basic set of equations converges over the interval [−π/2, π/2]. To extend the implementation to converge over [−π, π], we can simply detect when the input angle extends beyond the first quadrant, map that angle to either the first or fourth quadrants, and make a post-micro-rotation correction to account for the input angle mapping. This architecture is illustrated in Figure 25.7. The input mapping is particularly simple. Referring to Figure 25.7, if x0 is negative, the quadrants must be changed by applying a ± π/2 (±90◦ ) rotation. Whether it is a positive or negative rotation is determined by the sign of y0 . To compensate for the input mapping, an angle rotation is conditionally applied to the micro-rotation engine result zn to produce the final output value zn . Details
528
Chapter 25
I
CORDIC Architectures for FPGA Computing z2L2 x90
Quadrant mapping
x0
y90
y0 z0
z2L1
xn yn
Microrotation engine
z9n
Quadrant demapping
zn
Input quadrant mapping y
(x90, y90) 5 (x0, y0)
(x0, y0) y (x90, y90) 5 (y0, 5 x0) x
x
y
(x0, y0)
y
x
x (x90, y90) 5 (x0, y0)
(x90, y90) 5 (2y0, x0)
Quadrant mapping operator
FIGURE 25.7 I A full-range CORDIC processor showing input quadrant mapping, micro-rotation engine, and quadrant correction.
of the course angle rotator and matching quadrant correction circuit are shown in Figure 25.8. The area cost for an FPGA implementation of the circuits is modest [40].
25.3.2
Folded CORDIC
The folded CORDIC architecture allocates a single PE to service all of the required micro-rotations. At one architectural extreme a bit-serial implementation employing a single 3-2 full adder, with appropriate control circuitry and state storage, can address all of the required updates for x, y, and z. However, our treatment employs a word-oriented architecture that associates unique functional units (FU) with each of the x, y, and z processing engines, as shown in Figure 25.9. Multiple mapping options are available when projecting the dependency graph onto an FPGA architecture. In the Xilinx Virtex-4 family [41], one option for supporting the adder/subtractor FUs is to utilize the logic fabric and realize these modules at the cost of one lookup table (LUT) per result digit. So for example, the addition of two 16-bit operands to generate a 17-bit sum requires 17 LUTs. An alternative is to use the 48-bit adder in the DSP48 tile.
25.3 FPGA Implementation of CORDIC Processors
Address bus Address 0
x0
R1
y0
R2
ROM 0 1 Data 0 2 R7 N
MUX
Address 1
529
x09
R5
R3
x0
R4
A1
R8 sgn(x0) N
MUX
y0
R3
R1
y09
zn9
A2
≡
a
A1 R2
02a
sgn(y0)
0
MUX
R6
N
zn9
Add/Sub /2
Register
(b)
(a)
FIGURE 25.8 I A course angle rotator preceding a micro-rotation engine for a full-range CORDIC processor (a). A post-micro-rotation quadrant correction circuit (b).
Rx
xi
Bx
Mx
xi 11 Ax
i
My
Ry
By
Ay
yi 11
yi Mz
zi
Az
zi 11
Rz αm,si Mα
Adder/subtractor control
a a22i
a±b
a
i
b Directed adder/subtractor
Register
Finite-state machine to generate direction for adder/subtractor
Wire-based barrel shifter
Multiplexer
FIGURE 25.9 I A folded CORDIC architecture with separate functional units for each of the x, y, and z updates. Only the micro-rotation engine is shown.
530
Chapter 25
I
CORDIC Architectures for FPGA Computing
There are also several mapping options for the barrel shifter: It can be realized in the logic fabric, with the multiplier in the DSP48 tile, or, for that matter, using an embedded multiplier in any FPGA family that supports this architectural component (e.g., Virtex-II Pro [39] or Spartan-3E [37]). Consider a fabric-only implementation of a vectoring CORDIC algorithm in circular coordinates. In this case all of the FUs are implemented directly in the logic fabric. The FPGA area, AF , can be expressed as AF = 3 · aadd + 2 · abarrel + 3 · amux + aLUT + aQ + aQ−1
(25.20)
where aadd , abarrel , amux , aLUT , aQ , and aQ−1 correspond to the area of an adder, barrel shifter, input multiplexer, elementary angle LUT, quadrant input mapper, and output mapper circuits, respectively. The FPGA logic fabric is designed to efficiently support the implementation of arbitrary-precision highspeed adder/subtractors. Each configurable logic block (CLB) [41] includes dedicated circuitry that provides fast carry resolution, with the LUT itself producing the half-sum. The component that can be costly in terms of area is the barrel shifter. The barrel shifter area cost can be much more significant than the aggregate cost of the adder/subtractors used for updating the x, y, and z variables. For example, in a design that supplies 16 effective result digits, the 2 barrel shifters occupy an aggregate area of 226 LUTs while the adders occupy 74 LUTs in total. Here, the barrel shifters have a footprint approximately three times that of the adders. The barrel shifter area can be reduced if a multiplier-based barrel shifter is used rather than a purely logic fabric–based implementation. FPGA families such as Spartan-3E [37], Virtex-II Pro [39], and Virtex-4 [40] include an array of embedded multipliers, which are useful for realizing arithmetic shifts. The multiplier accepts 18-bit precision operands and produces a 36-bit result. When used as a barrel shifter, one port of the multiplier is supplied with the input operand that is to experience the arithmetic shift, while the second port accepts the shift value 2i , where i is the iteration index. In a typical hardware implementation the iteration index rather than the exponentiated value is usually available in the control plane that coordinates the operation of the circuit. The exponentiation can be done via a small LUT implemented using distributed memory [40]. Multiple multiplier primitives can be combined with an adder to form a barrel shifter that can support a wider datapath. For the previous example, multiplier realization of the barrel shifter results in an FPGA footprint that is less than half that of an entirely fabric-based implementation. The folded CORDIC architecture is a recursive graph, which means that deep pipelining cannot be employed to reduce the critical path. The structure can accept a new set of operands, and produces a result every n clock cycles.
25.3.3
Parallel Linear Array
When throughput is the overriding design consideration, a fully parallel pipelined CORDIC realization is the preferred architecture. With this approach
25.3 FPGA Implementation of CORDIC Processors
x0
531
xn 0
1
n21
i
2
y0
yn
z0
zn αm,s2
αm,s1
αm,s0
αm,si
αm,s0
Mode Mode control: vectoring/rotation Adder/subtractor control a
a
a±b
b Directed adder/ subtractor
a22i Finite-state machine to generate direction for adder/subtractor
i Wire-based barrel shifter
Register
FIGURE 25.10 I A programmable parallel pipelined CORDIC array. In a completely unfolded implementation, the barrel shifters are realized as FPGA routing and so consume no resources other than interconnect.
the CORDIC algorithm is completely unrolled and each operation is projected onto a unique hardware resource, as shown in Figure 25.10. One interesting effect of the unrolling is that the data shifts required in the cross-addition update can be realized as wiring between successive CORDIC processing elements (PEs). Unlike the folded architecture, where either LUTs or embedded multipliers are consumed to realize the barrel shifter, no resources other than interconnect are required to implement the shift in the linear array architecture. The only functional units required for each PE with this approach are three adder/subtractors and a small amount of logic to implement the control circuit that steers the add/subtract FUs. The micro-rotation angle for each PE is encoded as a constant supplied on one arm of the adder/subtractor that performs the angle update—no LUT resources are required for this. Note in Figure 25.10 that the sign bit of the y and z variables is supplied to the control circuit that is local to each processing engine. This permits the architecture to operate in the y- or z-reduction configuration under the control of the Mode input control signal, and thus support vectoring or rotation, respectively. Figure 25.11(a) shows a comparison of the area functions for the parallel and folded architectures. The folded implementation is entirely fabric based. As expected, the area of the parallel design exhibits modest exponential growth and, for an effective number of result digits greater than 15, occupies more than three times the area of the folded architecture. For the case of 24 effective result digits, the parallel design is larger by a factor of approximately 5. Figure 25.11(b) contrasts the throughput of the two architectures. Naturally, the parallel design has a constant throughput of one CORDIC operation per second for a normalized clock rate of 1, while the throughput for the folded design falls off as the inverse of the number of iterations.
Chapter 25
532
I
CORDIC Architectures for FPGA Computing 1.2
2500 CORDIC operations/second
Folded Parallel
Area LUTs
2000 1500 1000 500 0
10
15
20
Number of bits in datapath (a)
1 0.8 0.6 0.4
Folded linear array
0.2 0
10
15 20 Effective number of result bits (b)
FIGURE 25.11 I (a) Comparison of the FPGA resource requirements for folded and linear array CORDIC architectures—circular coordinates. (b) Throughput in rotations/vectoring operations per second for the two architectures. A normalized clock rate of 1 is assumed.
The parallel design has a performance advantage of approximately an order of magnitude for the number of effective result bits great than 10. In an FPGA implementation the advantage is significantly more than this because of the higher clock frequency that can be supported by the linear array compared to the folded processor. With its heavy pipelining, the linear array typically achieves an operating frequency approximately twice that of the folded architecture, so for high-precision calculations—for example, on the order of 24 effective fractional bits or greater—the parallel implementation has a throughput advantage of approximately 50, which is delivered in a footprint that is only five times that of the folded design. The add/subtract FUs can be realized using the logic fabric or the 48-bit adder that is resident in each DSP48 tile in the Virtex-4 class of FPGAs. The DSP48 [42] is a dynamically configurable embedded processing block that supports over 40 different op-codes, optimized for signal-processing tasks. The logic fabric approach tends to result in an implementation that operates at a lower clock frequency than a fully pipelined version based on the DSP48. The DSP48-based implementations can operate at very high clock frequencies—in the region of 500 MHz in the fastest “–12” speed-grade parts [40]. However, for a datapath precision of up to 36 bits, three DSP48 tiles are required for each CORDIC iteration (see Figures 25.12 and 25.13). For scenarios where throughput is the overarching requirement, these resource requirements are acceptable. A potential downside to the use of the DSP48 in this application is that the multiplier colocated with the high-precision adder is not available for use by another function if the adder is used by the CORDIC PE. This is because the input and output ports of the block are occupied supporting the addition/subtraction and there is no I/O available to access other functions (such as the multiplier) in the tile.
25.3 FPGA Implementation of CORDIC Processors i
18 48
CAST B
xi
1
1
A
18
48
B
18
C
48
SUB 18
CAST
18 48
CAST B
yi
i
CAST
1
A
18
48
B
18
C
48
1
1
A
18
48
18
B
18
zi
CAST
48
C
48
48
DSP48
48
48
18
SUB ␣m,i
48
1
SUB 18
48
48
48
B
xi11
x Engine
DSP48
48
CAST
533
CAST
B
yi11
y Engine
48
CAST
B
zi11
DSP48 z Engine
SGN CORDIC PE i
Mode control: vectoring/rotation Add/subtract control a±b
a
σi
SGN
σi
b
I
0 CAST x0
B
CAST
CAST y0
B
0
CAST
␣m,0 z0
CAST
18 18 48
18 18 48
18 18 48
Finite-state machine to generate direction for adder/subtractor
σi
a22i
1 if zi $ 0
5
21 if zi , 0
Fabricbased register
i
DSP48 internal register
Wire-based barrel shifter
Vectoring mode
Pipelined DSP48 adder/subtractor
FIGURE 25.12
a
Rotation mode
1 if yi , 0
5
21 if yi $ 0
Processing element i of a Virtex-4 DSP48-based CORDIC processor.
SUB 1
1
A
18
48
B
18
C
48
48
SUB 1
1
A
18
48
B
18
C
48
48
SUB 1
1
A
18
48
B
18
C
48
48
1 48
48
DSP48
48
48
DSP48
48
48
DSP48
SGN
CAST
B
x1 B
x Engine
CAST
B
z Engine
CAST B
B
CAST
y1
y Engine
CAST
CAST
z1
1
CAST
␣m,1 CAST
18 18 48
18 18 48
18 18 48
SUB 1
1
A
18
48
B
18
C
48
48
SUB 1
1
A
18
48
B
18
C
48
48
SUB 1
1
A
18
48
B
18
C
48
48
n21 48
DSP48
CAST
B
x1 B
x Engine
SUB 1 CAST
CAST
18
A
18
18
B
18
48
C
48
SUB 1 48
48
DSP48
CAST
B
y1
CAST B
y Engine
n21
CAST
18
A
18
18
B
18
48
C
48
SUB 1 48
48
DSP48
SGN CORDIC PE 0
48
CAST z Engine
B
z1
␣m,n21 CAST
18
A
18
18
B
18
48
C
48
1 48
48
48
48
DSP48
CAST
B
xn
x Engine
1 48
48
48
48
DSP48
CAST
B
yn
y Engine
1 48
48
48
48
DSP48
CAST
B
zn
z Engine
SGN CORDIC PE 1
CORDIC PE n
Mode
FIGURE 25.13 I A programmable parallel pipelined CORDIC array based almost entirely on the Virtex-4 DSP48 embedded tile. Each DSP48 has three levels of pipelining. Additional fabric-based registers are included to pipeline the routing between DSP48 tiles.
534
Chapter 25
25.3.4
I
CORDIC Architectures for FPGA Computing
Scaling Compensation
As highlighted earlier, the rotation mode of the CORDIC algorithm produces a rotation extension (i.e., it increases or decreases the distance of the point from the origin) rather than a pure rotation. The growth associated with circular and hyperbolic coordinate systems is approximately K1, n ≈ 1.6468 and K−1,n ≈ 0.8382, respectively. In some applications this growth can be tolerated, and there is no need to perform any compensation. For example, if the vectoring mode is used to map the output vector of a discrete Fourier transform (DFT) from Cartesian to polar coordinates in order to compute a magnitude spectrum, the CORDIC scaling may not be an issue because all terms are similarly scaled. If the CORDIC output is to be further processed, there might be an opportunity to absorb the CORDIC scale factor in the postprocessing circuit. Continuing with the DFT example, if the magnitude spectrum is to be compared with a threshold in order to make a decision about a particular spectral bin, the CORDIC scaling can be absorbed into the threshold value. If the scaling cannot be tolerated, several scaling compensation techniques are possible. Some approaches employ modified iterations [20, 32, 33] while others exploit alternatives such as online arithmetic [6]. Some methods merge scaling iterations with the basic CORDIC iterations [15], which result in either an area penalty or a time penalty if the basic CORDIC hardware is to be used for both the fundamental updates and the scaling iterations. It is also possible to employ a modified set of elemental angles [9]. The problem of scaling compensation has been examined by many researchers, and many creative and elegant results have been produced; however, the most direct way to accommodate the problem in an FPGA is to employ its embedded multipliers. The architecture of a programmable and scalecompensated CORDIC engine is shown in Figure 25.14. The Mode control signal defines if a vectoring or rotation operation is to be performed. It essentially controls if the iteration update is guided by the sign of the y or z variable for vectoring or rotation, respectively. The Coordinate_System signal selects the coordinate system for the processor: circular, hyperbolic, or linear. This control line selects the page in memory where the elemental angles are stored: tan−1 2−i , i = 0, . . ., n − 1 for circular; tanh−1 2−i , i = 1, . . ., n for hyperbolic; and 2−i , i = 0, . . ., n − 1 for linear. Coordinate_System also indexes a small memory located in FPGA distributed memory that stores the values 1/Km, n for use by the scaling compensation multiplier M1. Naturally, the precision of these constants should be commensurate with the number of effective result bits.
25.4
SUMMARY This chapter provided an overview of the CORDIC algorithm and its implementation in current-generation FPGAs such as the Xilinx Virtex-4 family. The basic set of CORDIC equations was first reviewed, and the utility of this simple shift-and-add-type algorithm was highlighted by the many functions that can be accessed through it. We also highlighted the fact that, while there are many options for architecting math functions in hardware, the CORDIC approach
25.4 Summary K0,n
535
Scaling factor memory
K21,n K1,n z 2L2 Quadrant mapping
x0 y0
x90 y90
Km,n xn yn
xi 11
xi
z0
M1
Folded or parellel CORDIC array
z 2L1
i
␣1, i ␣0, i
yi
yi 11
zi
zi 11
z9n
Quadrant demapping
zn
␣21, i Coordinate system Mode (vectoring/rotation)
FIGURE 25.14
I
A programmable CORDIC processor with multiplier-based scaling compensation.
comes into its own when multi-element input and output vectors are involved. The functional requirements of the angle and cross-addition updates make it an excellent match for FPGAs because of the utility and efficiency with which these devices realize addition and subtraction. Most hardware realizations of the CORDIC algorithm employ fixed-point arithmetic, and this is certainly true of nearly all FPGA implementations. We showed that it is therefore important to understand the effects of quantizing the datapath. While this analysis can be complex [16], for most applications the simplified approach first described by Walther [36] is suitable for most cases and provides excellent results. The FPGA implementation of a CORDIC processor would appear to be straightforward. However, FPGA-embedded functions such as multipliers and the DSP48 provide opportunities for architectural innovation and for design trade-offs that satisfy design requirements. For example, embedded multipliers can be exchanged for logic fabric with the implementation of the barrel shifter. The wide 48-bit adder in the DSP48 can be used almost as the sole arithmetic building block of a complete fully parallel CORDIC array.
References [1] J. R. Cavallaro, F. T. Luk. CORDIC arithmetic for an SVD processor. Journal of Parallel and Distributed Computing 5, 1988. [2] J. R. Cavallaro, F. T. Luk. Floating-point CORDIC for matrix computations. Proceedings of the IEEE International Conference on Computer Design: VLSI in Computers and Processors, October 1988.
536
Chapter 25
I
CORDIC Architectures for FPGA Computing
[3] L. W. Chang, S. W. Lee. Systolic arrays for the discrete Hartley transform. IEEE Transactions on Signal Processing 29(11), November 1991. [4] W. H. Chen, C. H. Smith, S. C. Fralick. A fast computational algorithm for the discrete cosine Transform. IEEE Transactions on Communications C-25, September 1977. [5] Cray Research. Cray XD1 Supercomputer, http://www.cray.com/products/xd1/ index.html. [6] H. Dawid, H. Meyer. The differential CORDIC algorithm: Constant scale factor redundant implementation without correcting iterations. IEEE Transactions on Computers 45(3), March 1996. [7] A. A. J. de Lange, A. J. van der Hoeven, E. F. Deprettere, J. Bu. An optimal floatingpoint pipeline CMOS CORDIC processor. IEEE Symposium on Circuits and Systems, June 1988. [8] J. M. Delsme. VLSI implementation of rotations in pseudo-Euclidean spaces. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing 2, 1983. [9] E. Deprettere, P. Dewilde, R. Udo. Pipelined CORDIC architectures for fast VLSI filtering and array processing. Proceedings of the ICASSP’ 84, 1984. [10] A. M. Despain. Very fast Fourier transform algorithms for hardware implementation. IEEE Transactions on Computers C-28, May 1979. [11] A. M. Despain. Fourier transform computers using CORDIC iterations. IEEE Transactions on Computers 23, October 1974. [12] C. Dick, F. Harris, M. Rice. FPGA implementation of carrier phase synchronization for QAM demodulators. Journal of VLSI Signal Processing, Special Issue on FieldProgrammable Logic (R. Woods, R. Tessier, eds.), Kluwer Academic, January 2004. [13] D. Ercegovac, T. Lang. Digital Arithmetic, Morgan Kaufmann, 2004. [14] B. Haller, J. Gotze, J. Cavallaro. Efficient implementation of rotation operations for high-performance QRD-RLS filtering. Proceedings of the International Conference on Application-Specific Systems, Arthictectures and Processors, July 1997. [15] G. H. Haviland, A. A. Tuszinsky. A CORDIC arithmetic processor chip. IEEE Transactions on Computers c-29(2), February 1980. [16] Y. H. Hu. The quantization effects of the CORDIC algorithm. IEEE Transactions on Signal Processing 40, July 1992. [17] X. Hu, S. C. Bass. A neglected error source in the CORDIC algorithm. IEEE International Symposium on Circuits and Systems 1, May 1993. [18] X. Hu, S. C. Bass. A neglected error source in the CORDIC algorithm. Proceedings of the IEEE ISCAS, 1993. [19] X. Hu, R. G. Garber, S. C. Bass. Expanding the range of convergence of the CORDIC algorithm. IEEE Transactions on Computers 40(1), January 1991. [20] J. Lee. Constant-factor redundant CORDIC for angle calculation and rotation. IEEE Transactions on Computers 41(8), August 1992. [21] Y. H. Liao, H. E. Liao. CALF: A CORDIC adaptive lattice filter. IEEE Transactions on Signal Processing 40(4), April 1992. [22] Mathworks, The, http://www.mathworks.com/. [23] U. Mengali, A. N. D’Andrea. Synchronization Techniques for Digital Receivers, Plenum Press, 1997. [24] J. Mia, K. K. Parhi, E. F. Deprettere. Pipelined implementation of CORDIC-based QRD-MVDR adaptive beamforming. IEEE Fourth International Conference on Signal Processing, October 1998. [25] J. M. Muller. Discrete basis and computation of elementary functions. IEEE Transactions on Computers C-34(9), September 1985.
25.4 Summary
537
[26] B. Parhami. Computer Arithmetic: Algorithms and Hardware Designs, Oxford University Press, 2000. [27] K. K. Parhi. VLSI Digital Signal Processing Systems Design and Implementation, John Wiley, 1999. [28] S. Y. Park, N. I. Cho. Fixed-point error analysis of CORDIC processor based on the Variance Propagation Formula. IEEE Transactions on Circuits and Systems 51(3), March 2004. [29] J. G. Proakis, M. Salehi. Communication Systems Engineering, Prentice-Hall, 1994. [30] C. M. Rader. VLSI systolic arrays for adaptive nulling. IEEE Signal Processing Magazine 13(4), July 1996. [31] T. Y. Sung, Y. H. Hu. Parallel VLSI implementation of Kalman filter. IEEE Transactions on Aerospace and Electronic Systems AES 23(2), March 1987. [32] N. Takagi. Redundant CORDIC methods with a constant scale factor for sine and cosine computation. IEEE Transactions on Computers 40(9), September 1991. [33] D. H. Timmerman, B. J. Hosticka, B. Rix. A new addition scheme and fast scaling factor compensation methods for CORDIC algorithms. Integration, the VLSI Journal (11), 1991. [34] D. H. Timmerman, B. J. Hosticka, G. Schmidt. A programmable CORDIC chip for digital signal processing applications. IEEE Journal of Solid-State Circuits 26(9), September 1991. [35] J. E. Volder. The CORDIC trigonometric computing technique. IRE Transactions on Electronic Computers 3, September 1959. [36] J. S. Walther. A unified algorithm for the elementary functions. AFIPS Spring Joint Computer Conference 38, 1971. [37] Xilinx Inc. Spartan-3E Datasheet, http://www.xilinx.com/xlnx/xweb/xil_publications_display.jsp?iLanguageID=1&category= /Data+Sheets/FPGA+Device+Families/ Spartan-3E. [38] Xilinx Inc. System Generator for DSP, http://www.xilinx.com/ise/optional_prod/ system_generator.htm. [39] Xilinx Inc. Virtex-II Pro Datasheet, http://www.xilinx.com/xlnx/xweb/xil_ publications_display.jsp?category=Publications/FPGA+Device+Families/VirtexII+Pro&iLanguageID=1. [40] Xilinx Inc. Virtex-4 Datasheet, http://www.xilinx.com/xlnx/xweb/xil_publications_ display.jsp?sGlobalNavPick=&sSecondaryNavPick=&category=-1210771&iLanguage ID=1. [41] Xilinx Inc. Virtex-4 Multi-Platform FPGA, http://www.xilinx.com/products/silicon_ solutions/fpgas/virtex/virtex4/index.htm. [42] Xilinx Inc. XtremeDSP Design Considerations Guide, http://www.xilinx.com/ products/silicon_solutions/fpgas/virtex/virtex4/capabilities/xtremedsp.htm. [43] Xilinx Inc. XtremeDSP Slice, http://www.xilinx.com/products/silicon_solutions/ fpgas/virtex/virtex4/capabilities/xtremedsp.htm.
This page intentionally left blank
CHAPTER
26
Hardware/Software Partitioning Frank Vahid, Greg Stitt Department of Computer Science and Engineering University of California–Riverside
Field-programmable gate arrays (FPGAs) excel at implementing applications as highly parallel custom circuits, thus yielding fast performance. However, large applications implemented on a microprocessor may be more size efficient and require less designer effort, at the expense of slower performance. In some cases, mapping an entire application to a microprocessor satisfies performance requirements and so is preferred. In other cases, mapping an application entirely to custom circuits on FPGAs may be necessary to meet performance requirements. In many cases, though, the best implementation lies somewhere between these two extremes. Hardware/software partitioning, illustrated in Figure 26.1, is the process of dividing an application between a microprocessor component (“software”) and one or more custom coprocessor components (“hardware”) to achieve an implementation that best satisfies requirements of performance, size, designer effort, and other metrics.1 A custom coprocessor is a processing circuit that is tailor-made to execute critical application computations far faster than if those computations had been executed on a microprocessor. FPGA technology encourages hardware/software (HW/SW) partitioning by simplifying the job of implementing custom coprocessors, which can be done just by downloading bits onto an FPGA rather than by manufacturing a new integrated circuit or by wiring a printed-circuit board. In fact, new FPGAs even support integration of microprocessors within an FPGA itself, either as separate physical components alongside the FPGA fabric (“hard-core microprocessors”) or as circuits mapped onto the FPGA fabric just like any other circuit (“soft-core microprocessors”). High-end computers have also begun integrating microprocessors and FPGAs on boards, allowing application designers to make use of both resources when implementing applications. Hardware/software partitioning is a hard problem in part because of the large number of possible partitions. In its simplest form, hardware/software partitioning considers an application as comprising a set of regions and maps 1
The terms software, to represent microprocessor implementation, and hardware, to represent coprocessor implementation, are common and so appear in this chapter. However, when implemented on FPGAs, coprocessors are actually just as “soft” as programs implemented on a microprocessor, with both consisting merely of a sequence of bits downloaded into a physical device, leading to a broader concept of “software.”
540
Chapter 26
I
Hardware/Software Partitioning
Application
Memory Microprocessor Custom processors
FIGURE 26.1 I A diagram of hardware/software partitioning, which divides an application between a microprocessor component (“software”) and custom processor components (“hardware”).
each region to either software or hardware such that some cost criteria (e.g., performance) is optimized while some constraints (e.g., size) are satisfied. A partition is a complete mapping of every region to either hardware or software. Even in this simple formulation, the number of possible partitions can be enormous. If there are n regions and there are two choices (software or hardware) for each one, then there are 2n possible partitions. A mere 32 regions yield over 4 billion possibilities. Finding the optimal partition of this simple form is known to be NP-hard in general. Many other factors contribute to making the problem even harder, as will be discussed. This chapter discusses issues involved in partitioning an application among microprocessor and coprocessor components. It considers two application categories: sequential programs, where an application is a program written in a sequential programming language such as C, C++, or Java and where partitioning maps critical functions and/or loops to coprocessors; and parallel programs, where an application is a set of concurrently executing tasks and where partitioning maps some of those tasks to coprocessors. While designers today do mostly manual partitioning, automating the process has been an area of active study since the early 1990s (e.g., [10, 15, 26]) and continues to be intensively researched and developed. For that reason, we will begin the chapter with a discussion of the trend toward automatic partitioning.
26.1
THE TREND TOWARD AUTOMATIC PARTITIONING Traditionally, designers have manually partitioned applications between microprocessors and custom coprocessors. Manual partitioning was in part necessitated by radically different design flows for microprocessors versus coprocessors. A microprocessor design flow typically involved developing code
26.1 The Trend Toward Automatic Partitioning
541
in programming languages such as C, C++, or Java. In sharp contrast, a coprocessor design flow may have involved developing cleverly parallelized and/or pipelined datapath circuits, control circuits to sequence data through the datapath, memory circuits to enable rapid data access by the datapath, and then mapping those circuits to a particular ASIC technology. Thus, manual partitioning was necessary because partitioning was done early in the design process, well before a machine-readable or executable description of an application’s desired behavior existed. It resulted in specifications for both the software design and the hardware design teams, both of which might then have worked for many months developing their respective implementations. However, the evolution of synthesis and FPGA technologies is leading toward automated partitioning because the starting point of FPGA design has been elevated to the same level as that for microprocessors, as shown in Figure 26.2. Current technology enables coprocessors to be realized merely by downloading bits onto an FPGA. Downloading takes just seconds and eliminates the months-long and expensive design step of mapping circuits to an ASIC. Furthermore, synthesis tools have evolved to automatically design coprocessors from high-level descriptions in hardware description languages (HDLs), such as VHDL or Verilog, or even in languages traditionally used to program microprocessors, such as C, C++, or Java. Thus, designers may develop a single machine-readable high-level executable description of an application’s desired behavior and then partition that description between microprocessor and coprocessor parts, in a process sometimes called hardware/software codesign. New
Application (C/C11/Java/SystemC…) Automated hardware/software partitioning C/C11/Java
C/C11/Java/VHDL/Verilog/SystemC Behavioral synthesis (1990s) Register transfers
Compilation
RT synthesis (1980s, 1990s) Logic equations / FSMs
Assembly code Assembling, linking
Logic synthesis, physical design (1970s, 1980s)
Micro-processor machine code
Coprocessor FPGA bitstream
Downloading
Downloading
Microprocessors
Implementation
FPGA coprocessors
FIGURE 26.2 I The codesign ladder: evolution toward automated hardware/software partitioning due to synthesis tools and FPGA technologies enabling a similar design starting point, and similar implementation manner of downloading bits into a prefabricated device.
542
Chapter 26
I
Hardware/Software Partitioning
approaches, such as SystemC [14], which supports HDL concepts using C++, have evolved specifically to support it. With a single behavior description of an application, and automated tools to convert partitioned applications to coprocessors, automating partitioning is a logical next step in tool evolution. Some commercial automated hardware/software partitioning products are just beginning to appear [4, 7, 21, 27]. In the remainder of the chapter, many of the issues discussed relate to both manual and automatic partitioning, while some relate to automatic partitioning alone.
26.2
PARTITIONING OF SEQUENTIAL PROGRAMS In a sequential program, the regions comprising an application’s behavior are defined to execute sequentially rather than concurrently. For example, the semantics of the C programming language are such that its functions execute sequentially (though parallel execution is allowed as long as the results of the computation stay the same). Hardware/software partitioning of a sequential program involves speeding up certain regions by moving them to faster-executing FPGA coprocessors, yielding overall application speedup. Hardware/software partitioning of sequential programs is governed to a large extent by the well-known Amdahl’s Law [1] (described in 1967 by Gene Amdahl of IBM in the context of discussing the limits of parallel architectures for speeding up sequential programs). Informally, Amdahl’s Law states that application speedup is limited by the part of the program not being parallelized. For example, if 75 percent of a program can be parallelized, the remaining nonparallelized 25 percent of the program limits the speedup to 100/25 = 4 times speedup (usually written as 4x) in the best possible case, even in the ideal situation of zero-time execution of the other 75 percent. Amdahl’s Law has been described more formally using the equation max_speedup = 1/(s + p/n), where p is the fraction of the program execution that can be parallelized; s is the fraction that remains sequential, s+p = 1; n is the number of parallel processors being used to speed up the parallelizable fraction; and max_speedup is the ideal speedup. In the 75 percent example, assuming that n is very large, we obtain max_speedup = 1/(0.25 + 0.75/n) = 1/(0.25 + ∼0) = 4x. Amdahl’s Law applies to hardware/software partitioning by providing speedup limits based on the regions not mapped to hardware. For example, if a region accounts for 25 percent of execution but is not mapped to hardware, then the maximum possible speedup obtainable by partitioning is 4x. Figure 26.3 illustrates that only when regions accounting for a large percentage of execution are mapped to hardware might partitioning yield substantial results. For example, to obtain 10x speedup, partitioning must map to hardware those regions accounting for at least 90 percent of an application’s execution time. Fortunately, most of the execution time for many applications comes from just a few regions. For example, Figure 26.4 shows the average execution time
543
100
90
80
70
60
50
40
30
20
10
10 9 8 7 6 5 4 3 2 1 0 0
Ideal speedup
26.2 Partitioning of Sequential Programs
Execution mapped to hardware (%) I
Hardware/software partitioning speedup following Amdahl’s Law.
Cumulative application execution (%)
FIGURE 26.3
100 80 60 40 20 0 1
2
3 4 5 6 7 8 Most frequent regions
9 10
FIGURE 26.4 I Ideal speedups achievable by moving regions (loops) to hardware, averaged for a variety of embedded system benchmark suites (MediaBench, Powerstone, and Netbench).
contribution for the first n regions (in this case loops) for several dozen standard embedded system application benchmarks, all sequential programs. Note that the first few regions account for 75 to 80 percent of the execution time. The regions are roughly equal in size following the well-known informal “90–10” rule, which states that that 90 percent of a program’s execution time is spent in 10 percent of its code. Thus, hardware/software partitioning of sequential programs generally must sort regions by their execution percentage and then consider moving the highest contributing regions to hardware. A corollary to Amdahl’s Law is that if a region is moved to hardware, its actual speedup limits the remaining possible speedup. For example, consider a region accounting for 80 percent of execution time that, when moved to hardware, runs only 2x faster than in software. Such a situation is equivalent to 40 percent of the region being sped up ideally and the other 40 percent not being sped up at all. With 40 percent not sped up, the ideal speedup obtainable by partitioning of the remaining regions (the other 20 percent) is limited
544
Chapter 26
I
Hardware/Software Partitioning
to a mere 100 percent/40 percent = 2.5x. For this reason, hardware/software partitioning of sequential programs generally must focus on obtaining very large speedups of the highest-contributing regions. Amdahl’s Law therefore greatly prunes the solution space that partitioning of sequential programs must consider—good solutions must move the biggestcontributing regions to hardware and greatly speed them up to yield good overall application speedups. Even with this relatively simple view, several issues make the problem of hardware/software partitioning of sequential programs quite challenging. Those issues, illustrated in Figure 26.5(a–e), include determining critical region
Application
Implementation
Functions
Estimation
Loops HW time: 13.1s HW area: 13122 gates SW time: 85.1s Runtime: .02s
HW time: 12.3s HW area: 11523 gates SW time: 78.7s Runtime: 5 minutes
Blocks
(b)
(a)
Cache
Memory
Microprocessor Bridge 1
1 1
1 1 1 1 1
1
1
DMA
1
(c) SW
(d) HW
Performance: 28.5s Area: 0 gates
SW
SW
HW
Performance: 28.5s Area: 1452 gates
HW
Performance: 16.2s Area: 3418 gates
SW
HW
Performance: 11.1s Area: 12380 gates
(e)
FIGURE 26.5 I Hardware/software partitioning: (a) granularity; (b) partition evaluation; (c) alternative region implementations; (d) implementation models; (e) exploration.
26.2 Partitioning of Sequential Programs
545
granularity (a), evaluating partitions (b), considering multiple alternative implementations of a region (c), determining implementation models (d), and exploring the partitioning solution space (e).
26.2.1
Granularity
Partitioning moves some code regions from a microprocessor to coprocessors. A first issue in defining a partitioning approach is thus to determine the granularity of the regions to be considered. Granularity is a measure of the amount of functionality encapsulated by a region, which is illustrated in Figure 26.6. A key trade-off involves coarse versus fine region granularity [11]. Coarser granularity simplifies partitioning by reducing the number of possible partitions, enables more accurate estimates during partitioning by considering more computations when creating those estimates (and thus reducing inaccuracy when combining multiple estimates for different regions into one), and reduces interregion communication. On the other hand, finer granularity may expose better
f1() Loop1 Block1 Loop2 Block2 ….. Block20
f1()
f2()
f2() Loop3 Loop4 Block21 ….. Block40 Loop5
Loop1
(a)
Loop5
(b)
Block1
Block2
Block3
Block4
Block5
Block6
Loop3 f1()
Loop5 Loop4
Block40 (c)
(d)
FIGURE 26.6 I The region granularities of an application (top): (a) functions; (b) loops; (c) blocks; (d) heterogeneous combination. Finer granularities may expose better solutions, at the expense of a more complex partitioning problem and more difficult estimation challenges.
546
Chapter 26
I
Hardware/Software Partitioning
partitions that would not otherwise be possible. Early automated partitioning research considered fine granularities of arithmetic operations or statements, while more recent work typically considers coarser granularities involving basic blocks, loops, or entire functions. Coarse granularity simplifies the partitioning problem by reducing the number of possible partitions. Take, for example, an application with two 1000-line C functions, like the one shown in Figure 26.6 (top), and consider partitioning at the granularity of functions, loops, or basic blocks. The granularity of functions involves only two regions, as shown in Figure 26.6(a), and the granularity of loops involves five regions, as shown Figure 26.6(b). However, the granularity of the basic block may involve many tens or hundreds of regions, as shown in Figure 26.6(c). If partitioning simply chooses between hardware and software, then two regions would yield 2∗ 2 = 4 possible partitions, while just 32 regions would involve 2∗ 2∗ 2∗ . . . ∗ 2 (32 times) possible partitions, or over four billion. Coarse granularity also enables more accurate early estimations of a region’s performance, size, power, and so forth. For example, an approach using function granularity could individually presynthesize the two previously mentioned functions to FPGAs before partitioning, gathering performance and size data. During partitioning, it could simply estimate that, for the case of partitioning both functions to the FPGA, the two functions’ performances would stay the same and their sizes would add. This estimate is not entirely accurate because synthesizing both functions could involve interactions between the function’s implementations that would impact performance and size, but it is likely reasonably accurate. In contrast, similar presynthesis and performance/size estimates for basic blocks would yield grossly inaccurate values because multiple basic blocks would actually be synthesized into a combined circuit having extensive sharing among the blocks, bearing little resemblance to the individual circuits presynthesized for each block. However, finer granularity may expose better partitions that otherwise would not be possible. In the two-function example just described, perhaps the best partition would move only half of one function to hardware—an option not possible at the coarse granularity of functions but possible at finer granularities of loops or basic blocks. Manual partitioning often involves initially considering a “natural” granularity for an application. An application may consist of dozens of functions, but a designer may naturally categorize them into just a few key high-level functions. A data-processing application, for example, may naturally consist of several key high-level functions: acquire, decompress, transform, compress, and transmit. The designer may first try to partition at that natural granularity before considering finer granularities. Granularity may be restricted to one region type, but can instead be heterogeneous, as shown in Figure 26.6(d). For example, in the previous two-function example from Figure 26.6 (top), one function may be treated as a region while the other may be broken down so that its loops are each considered as a region. A particular loop may even be broken down so that its basic blocks are individually considered as regions. Thus, for a single application, regions considered
26.2 Partitioning of Sequential Programs
547
for movement to hardware may include functions, loops, and basic blocks. With heterogeneous granularity, preanalysis of the code may select regions based on execution time and size, breaking down a region with very high execution time or large size. Furthermore, while granularity can be predetermined statically, it can also be determined dynamically during partitioning [16]. Thus, an approach might start with coarse-grained regions and then decompose specific regions deemed to be critical during partitioning. Granularity need not be restricted to regions defined by the language constructs such as functions or loops, used in the original application description. Transformations, some being well-known compiler transformations, may be applied to significantly change the original description. They include function inlining (replacing a function call with that function’s statements), function “exlining” (replacing statements with a function call), function cloning (making multiple copies of a function for use in different places), function specialization (creating versions of a function with constant parameters), loop unrolling (expanding a loop’s body to incorporate multiple iterations), loop fusion (merging two loops into one), loop splitting (splitting one loop into two), code hoisting and sinking (moving code out of and into loops), and so on.
26.2.2
Partition Evaluation
The process of finding a good partition is typically iterative, involving consideration and evaluation of certain partitions and then decisions as to which partitions to consider next. Evaluation determines a partition’s design metric values. A design metric is a measure of a partition. Common metrics include performance, size, and power/energy. Other metrics include implementation cost, engineering cost, reliability, maintainability, and so on. Some design metrics may need to be optimized, meaning that partitioning should seek the best possible value of a metric. Other design metrics may be constrained, meaning that partitioning must meet some threshold value for a metric. An objective function is one that combines multiple metric values into a single number, known as cost, which the partitioning may seek to minimize. A partitioning approach must define the metrics and constraints that can be considered, and define or allow a user to define an objective function. Evaluation can be a complex problem because it must consider several implementation factors in order to obtain accurate design metric values. Among others, these factors include determining the communication time between regions that transfer data (thus requiring knowledge of the communication structure), considering clock cycle lengthening caused by multiple application regions sharing hardware resources (which may introduce multiplexers or longer wires), and the like. The key trade-off in evaluation involves estimation versus implementation. Estimating design metric values is faster and so enables consideration of more possible partitions. Obtaining the values through implementation is more accurate and thus ensures that partitioning decisions are based on sound evaluations.
548
Chapter 26
I
Hardware/Software Partitioning
Estimation involves some characterization of an application’s regions before partitioning and then, during partitioning, quickly combining the characterizations into design metric values. The previous section on granularity discussed how two C function regions could be characterized for hardware by synthesizing each region individually to an FPGA, resulting in a characterization of each region consisting of performance and size data. Then a partition with multiple regions in hardware could be evaluated simply by assuming that each region’s performance is the same as the predetermined performance and by adding any hardware-mapped region sizes together to obtain total hardware size. Estimation for software can be done similarly, using compilation rather than synthesis for characterization. Nevertheless, while estimation typically works well for software [24], the nature of hardware may introduce significant inaccuracy into an estimation approach because multiple regions may actually share hardware resources, thus intertwining their performance and size values [9,18]. Alternatively, implementation as a means of evaluation involves synthesizing actual hardware circuits for a given partition’s hardware regions. Such synthesis thus accounts for hardware sharing and other interdependencies among the regions. However, synthesis is time consuming, requiring perhaps tens of seconds, minutes, or even hours, restricting the number of partitions that can be evaluated. Many approaches exist between the two extremes just described. Estimation can be improved with more extensive characterization, incorporating much more detail than just performance and size. Characterization may, for example, describe what hardware resources a region utilizes, such as two multipliers or 2 Kbytes of RAM. Then estimation can use more complex algorithms to combine region characterizations into actual design metric values, such as that the regions may share resources such as multipliers (possibly introducing multiplexers to carry out such sharing) or RAM. These algorithms yield higher accuracy but are still much faster than synthesis. Alternatively, synthesis approaches can be improved by performing a “rough” rather than a complete synthesis, using faster heuristics rather than slower, but higher-optimizing heuristics, for example. Evaluation need not be done in a single exploration loop of partitioning, but can be heterogeneous. An outer exploration loop may be added to partitioning that is traversed less frequently, with the inner exploration loop considering thousands of partitions (if automated) and using estimation for evaluation, while the outer exploration loop considers only tens of partitions that are evaluated more extensively using synthesis. The inner/outer loop concept can of course be extended to even more loops, with the inner loops examining more partitions evaluated quickly and the outer loops performing increasingly in-depth synthesis on fewer partitions. Furthermore, evaluation methods can change dynamically during partitioning. Early stages in the partitioning process may use fast estimation techniques to map out the solution space and narrow in on particular sections of it, while later stages may utilize more accurate synthesis techniques to fine-tune the solution.
26.2 Partitioning of Sequential Programs
26.2.3
549
Alternative Region Implementations
Further adding to the partitioning challenge is the fact that a given region may have alternative region implementations in hardware rather than just one implementation, as assumed in the previous sections. For example, Figure 26.7 (top) shows a particular function that performs 100 multiplications. A fast but large hardware implementation may use 100 multipliers, as shown in Figure 26.7(a). The much smaller but much slower hardware implementation in Figure 26.7(b) uses only 1 multiplier. Numerous implementation alternatives exist between those two extremes, such as having 2 multipliers as in Figure 26.7(c), 10 multipliers, and so on. Furthermore, the function may be implemented in a pipelined or nonpipelined manner. Utilized components may be fast and large (e.g., array-style multipliers or carry-lookahead adders) or small and slow (e.g., shift-and-add multipliers or carry-ripple adders). Many other alternatives exist. A key trade-off involves deciding how many alternative implementations to consider during partitioning. More alternatives greatly expand the number of possible partitions and thus may possibly lead to improved results. However, they also expand the solution space tremendously. For example, 8 regions each with one hardware implementation yield 28 = 256 possible partitions. If each
f() { … for (i 5 0; i , 100; i11) c[i] 5 a[i]∗b[i]; … }
∗
∗
∗
∗
(a)
... ...
Ctrl
∗
(b)
Ctrl
∗
...
...
∗
(c)
FIGURE 26.7 I Alternative region implementations for an original application (top) requiring 100 multiplications: (a) 100 multipliers; (b) 1 multiplier; (c) 2 multipliers. Alternative region implementations may have hugely different performances and sizes.
550
Chapter 26
I
Hardware/Software Partitioning
region instead has 4 possible hardware implementations, then it has 5 possible implementations (1 software and 4 hardware implementations), yielding 58 , or more than 300,000, possible partitions. Most automated hardware/software partitioning approaches consider one possible hardware implementation per region. Even then, a question exists as to which one to consider for that region: the fastest, the smallest, or some alternative in the middle? Some approaches do consider multiple alternative implementations, perhaps selecting a small number that span the possible space, such as small, medium, and large [5]. As we saw with granularity and evaluation, the number of alternative implementations considered can also be heterogeneous. Partitioning may consider only one alternative for particular regions and multiple alternatives for other regions deemed more critical. Furthermore, as we saw with granularity and evaluation, the number of alternative implementations can change dynamically as well. Partitioning may start by considering only a few alternatives per region and then consider more for particular regions as partitioning narrows in on a solution. Sometimes obtaining alternative implementations of an application region may require the designer to write several versions of it, each leading to one or more alternatives. In fact, a designer may have to write different region versions for software and hardware because a version that executes fast in software may execute slow in hardware, and vice versa. That difference is due to software’s fundamental sequential execution model that demands clever sequential algorithms, while hardware’s inherently parallel model demands parallelizable algorithms.
26.2.4
Implementation Models
Partitioning moves critical microprocessor software regions to hardware coprocessors. Different implementation models define how the coprocessors are integrated with the microprocessor and with one another [6], enlarging the possible solution space for partitioning and greatly impacting performance and size. One implementation model parameter is whether coprocessor execution and microprocessor execution overlap or are mutually exclusive. In the overlapping model, the microprocessor activates a coprocessor and may then continue to execute concurrently with it (if the data dependencies of the application allow). In the mutually exclusive model, the microprocessor waits idly until the coprocessor finishes, at which time the microprocessor resumes execution. Figure 26.8(a) illustrates the execution of both models. Overlapping may improve overall performance, but mutual exclusivity simplifies implementation by eliminating issues related to memory contention, cache coherency, and synchronization—the coprocessor may even access cache directly. In many partitioned implementations, the coprocessor executes for only a small fraction of the total application cycles, meaning that overlapping gains little performance improvement. When the microprocessor and coprocessor cycles are closer to
26.2 Partitioning of Sequential Programs Mutually exclusive
Overlapping
Microprocessor
Microprocessor
FPGA
FPGA
Time
551
Time (a)
Memory
Cache
Dynamically reconfigurable Microprocessor Bridge
Direct communication
Fused
DMA Tightly coupled Loosely coupled (b)
FIGURE 26.8 I Implementation models: (a) mutually exclusive and overlapping. (b) implementation model parameters.
being equal, overlapping may improve performance, up to a limit of 2 times, of course. Similarly, the execution of coprocessors relative to one another may be overlapped or mutually exclusive. A second implementation model parameter involves communication methods. The microprocessor and coprocessors may communicate through memory and share the same data cache, or the microprocessor may communicate directly with the FPGA through memory-mapped registers, queues, fast serial links, or some combination of those mechanisms. Another implementation model parameter is whether multiple coprocessors are implemented separately or are fused. In a separate coprocessor model, each critical region is synthesized to its own controller and datapath. In a fused model, the critical regions are synthesized into a single controller and datapath. The fused model may reduce size because the hardware resources are shared, but it may result in performance overhead because of a longer critical path as well as the need to run at the slowest clock frequency of all the regions.
552
Chapter 26
I
Hardware/Software Partitioning
Certain coprocessors can be fused and others left separate. Furthermore, fusing need not be complete—two coprocessors can share key components, such as a floating-point unit, but otherwise be implemented separately. Yet another model parameter is whether coprocessors and the microprocessor are tightly or loosely coupled. Tightly coupled coprocessors may coexist on the microprocessor memory bus or may even have direct access to microprocessor registers. Loosely coupled, they may access microprocessor memory through a bridge, adding several cycles to data accesses. Both couplings can coexist in a single implementation. FPGAs add a particularly interesting model parameter to partitioning— dynamic reconfiguration—which replaces an FPGA circuit with another circuit during runtime by swapping in a new FPGA configuration bitstream [2]. In this way, not all of an application’s coprocessors need to simultaneously coexist in the FPGA. Instead, one subset of the application’s required coprocessors may initially be loaded into the FPGA, but, as the application continues to execute, that subset may be replaced by another subset needed later in the application’s execution. Reconfiguration increases the effective size of an FPGA, thus enabling better performance when more application regions are partitioned to it or, alternatively, enabling use of a smaller and hence cheaper FPGA with a runtime overhead required to swap in new bitstreams. In some cases, this overhead may limit the benefits of reconfiguration and should therefore be considered during partition evaluation. Figure 26.8(b) illustrates some of the different implementation model parameters, including communication methods, fused regions, and tightly/loosely coupled coprocessors. Often these parameters are fixed prior to partitioning, but can also be explored dynamically during partitioning to determine the best implementation model for a given application and given constraints.
26.2.5
Exploration
Exploration is the searching of the partition solution space for a good partition. As mentioned before, it is at present mostly a manual task, but automated techniques are beginning to mature. This section discusses automated exploration techniques for various formulations of the partitioning problem. Simple formulation A simple and common form of the hardware/software partitioning problem consists of n regions, each having a software runtime value, a hardware runtime value, and a hardware size. It assumes that all values are independent of one another (so if two regions are mapped to hardware, their hardware runtime and size values are unchanged); it assumes that communication times are constant regardless of whether a region is implemented as software or hardware (such as when all regions use the same interface to a shared memory); and it seeks to minimize total application runtime subject to a hardware size constraint (assuming no dynamic reconfiguration).
26.2 Partitioning of Sequential Programs
553
Although this problem is known to be NP-hard, it can be solved by first mapping it to the well-known 0-1 knapsack problem [20]. The 0-1 knapsack problem involves a knapsack with a specified weight capacity and a set of items, each with a weight and a profit. The goal is to select which items to place in the knapsack such that the total profit is maximized without violating the weight capacity. For hardware/software partitioning, regions correspond to items, the FPGA size constraint corresponds to the knapsack capacity, an implementation’s size corresponds to an item’s weight, and the speedup obtained by implementing a region in hardware instead of software corresponds to an item’s profit. Thus, algorithms that solve the 0-1 knapsack problem solve the simple form of the hardware/software partitioning problem. The 0-1 knapsack problem is NPhard, but efficient optimal algorithms exist for relatively large problem sizes. One of these is a well-known dynamic programming algorithm [12] having runtime complexity of O(A∗ n), where A is the capacity and n is the number of items. Alternatively, integer linear programming (ILP) [22] may be used. ILP solvers perform extensive solution space pruning to reduce exploration time. For problems too big for either such optimal technique, heuristics may be utilized. A heuristic finds a good, but not necessarily the optimal, solution, while an algorithm finds the optimal solution. A common heuristic for the 0-1 knapsack problem is a greedy one. A greedy heuristic starts with an initial solution and then makes changes only if they seem to improve the solution. It sorts each item based on the ratio of profit to weight and then traverses the sorted list, placing an item in the knapsack if it fits and skipping it otherwise, terminating when reaching the knapsack capacity or when all items have been considered. This heuristic has O(nlgn) time complexity, allowing for fast automated partitioning of thousands of regions or feasible manual partitioning of tens of regions. Furthermore, the heuristic has been shown to commonly obtain near-optimal results in the situation when a few items have a high profit to weight ratio. In hardware/software partitioning terms, that situation corresponds to the existence of regions that are responsible for the majority of execution time and require little hardware area, which is often the case. Formulation with asymmetric communication and greedy/nongreedy automated heuristics A slightly more complex form of the hardware/software partitioning problem considers cases where communication times between regions change depending on the partitioning, with different required times for communication depending on whether the regions are both in software or both in hardware, or are separated, with one in software and one in hardware. This form of the problem can be mapped to the well-known graph bipartitioning problem. Graph bipartitioning divides a graph into two sets in order to minimize an objective function. Each graph node has two weights, one for each set. Edges may have three different weights: two weights associated with nodes connected in the same set (one weight for each set) and one for nodes connected between sets. Typically, the objective function is to minimize the sum of all node and edge
Chapter 26
I
Hardware/Software Partitioning
weights using the appropriate weights for a given partition. Graph bipartitioning is NP-hard. ILP approaches may be used for automatically obtaining optimal solutions to the graph bipartitioning problem. Heuristics may be used when ILP is too time consuming. A simple greedy heuristic for graph bipartitioning starts with some initial partition, perhaps random or all software. It then determines the cost improvement of moving each node from its present set to the opposite set and then moves the node yielding the best improvement. The heuristic repeats these steps until no move yielding an improvement is found. Given n nodes, a basic form of such a heuristic has O(n2 ) runtime complexity. Techniques to update the existing cost improvement values can reduce the complexity to O(n) in practice [25]. More advanced heuristics seek to overcome what are known as “local minima,” accepting solution-worsening moves in the hope that they will eventually lead to an even better solution. For example, Figure 26.9 illustrates a heuristic that accepts some solution-worsening changes to escape a local minimum and eventually reach a better solution. A common situation causing a local minimum involves two items such that moving only one item worsens the solution but moving both improves it. A well-known category of nongreedy heuristic used in partitioning is known as group migration [11], which evolved from an initial heuristic by Kernighan– Lin. Like the previous greedy heuristic, group migration starts with an initial partition and determines the cost improvement of moving each node from its present set to the opposite set. The group migration heuristic then moves the node yielding the best improvement (like the greedy heuristic) or yielding the least worsening (including zero cost change) if no improving move exists. Accepting such worsening moves enables local minima to be overcome. Of course, such a heuristic would never terminate, so group migration ensures termination by locking a node after it is moved. Group migration moves each node exactly once in what is referred to as an iteration, and an iteration has complexity of O(n2 ) (or O(n) if clever techniques are used to update cost improvements after each
Objective function
554
Local minimum Better solution
Considered sequence of changes
FIGURE 26.9 I Solution-worsening moves accepted by a nongreedy heuristic to escape local minima and find better solutions.
26.2 Partitioning of Sequential Programs
555
move). If an iteration ultimately leads to an improvement, then group migration runs another iteration. In practice, only a few iterations, typically less than five, can be run before no further improvement can be found. The previous discussions of heuristics ignore the time required by partition evaluation. The heuristics therefore may have even higher runtime complexity unless care is taken to incorporate fast incremental evaluation updates during exploration. Complex formulations and powerful automated heuristics Increasingly complex forms of the hardware/software partitioning problem integrate more parameters related to the earlier mentioned issues of exploration— granularity, evaluation, alternative region implementation, and implementation models. For example, the earlier mentioned dynamic granularity modifications, such as decomposing a given region into smaller regions, or even applying transformations to an application such as function inlining, can be applied during partitioning. The partitioning problem can consider different couplings of coprocessors, may also consider coprocessor fusing, and can support dynamic reconfiguration. When one considers the multitude of possible parameters that can be integrated with partitioning, the size of the solution space is mind-boggling. Searching that space for the best solution becomes a tremendous combinatorial optimization challenge, likely requiring long-running search heuristics. At this point, it may be interesting to note that hardware/software partitioning brings together two previously separate research fields: compilers and CAD (computer-aided design). Compilation techniques tend to emphasize a quick series of transformations applied to an application’s description. In contrast, CAD techniques tend to emphasize a long-running iterative search of enormous solution spaces. One possible reason for these different perspectives is that compilers were generally expected to run quickly, in seconds or at most minutes, because they were part of a design loop in which compilation was applied perhaps dozens or hundreds of times a day as programs were developed. In contrast, CAD optimization techniques were part of a much longer design loop. Running CAD optimization tools for hours or even days was perfectly acceptable because that time was still small compared to the weeks or months required to manufacture chips. Furthermore, the very nature of coprocessor design meant that a designer was extremely interested in high performance, so longer tool runtimes were acceptable if they optimized an implementation. Hardware/software partitioning merges compilation and synthesis into a single framework. In some cases, compiler-like runtimes of seconds must be achieved. In other cases, CAD-like runtimes of hours may be acceptable. Approaches to partitioning may span that range. Highly complex partitioning formulations will likely require moving away from the fast linear time algorithms and heuristics described earlier and toward longer-running powerful search heuristics. A popular powerful and general search heuristic is simulated annealing [17]. The simulated annealing heuristic starts with a random solution and then randomly makes some change to it, perhaps moving a region between software
556
Chapter 26
I
Hardware/Software Partitioning
and hardware, choosing an alternative implementation for a particular region, decomposing a particular region into finer-grained regions, performing a transformation on the original regions, and so forth, and evaluates the cost (as determined by an objective function) of the new partition obtained from that change. If the change improves the cost, it is accepted (i.e., the change is made). If the change worsens the cost, the seemingly “bad” change is accepted with some probability. The key feature of simulated annealing is that the probability of accepting a seemingly bad move decreases as the approach proceeds, with the pattern of decrease determined by some parameters provided to the annealing process that eventually causes it to narrow in on a good solution. Simulated annealing typically must evaluate many thousands or millions of solutions in order to arrive at a good one and thus requires very fast evaluation methods. The complexity of simulated annealing is generally dependent on the problem instance. With properly set parameters, it can achieve near-optimal solutions on very large problems in long but acceptable runtimes. Faster machines have made simulated annealing an increasingly acceptable search heuristic for a wider variety of problems—it can complete in just seconds for many problem instances. The simulated annealing heuristic is known as a neighborhood search heuristic because it makes local changes to an existing solution. Tabu search [13] is an effective method for improving neighborhood search. Meaning “forbidden,” Tabu maintains a list of recently seen, Tabu, solutions. When considering a change to an existing solution, it disregards any change that would yield a solution on the Tabu list. This prevents cycling among the same solutions and has been shown to yield improved results in less time. The Tabu list concept can also be applied on a broader scale, maintaining a long-term history of considered solutions in order to increase solution diversity. Tabu search can improve neighborhood search heuristic runtimes during hardware/software partitioning by a factor of 20x [8].
Other issues Because implementing an application as software generally requires a smaller size and less designer effort, most approaches to exploration start with an allsoftware implementation and then explore the mapping of critical application regions to hardware. However, in some cases, such as when the application is written specifically for hardware, an approach may start with an all-hardware implementation and then move noncritical application regions to software to reduce hardware size. Furthermore, when an application is originally written for software implementation, some of its regions may not be suitable for hardware implementation. For example, application regions that utilize recursive function calls, pointer-based data structures, or dynamic memory allocation may not be easy to implement as a hardware circuit. Some research efforts are beginning to address these problems by developing new synthesis techniques that support a wider range of program constructs and behavior. Alternatively, designers sometimes
26.3 Partitioning of Parallel Programs
557
write (or rewrite) critical regions such that those regions are well suited for circuit implementation.
26.3
PARTITIONING OF PARALLEL PROGRAMS In parallel programs, the regions that make up an application are defined to execute concurrently, as opposed to sequentially. Such regions are often called tasks or processes. For some applications, expressing behavior using tasks may result in a more parallel implementation and hence in faster application performance. For example, an MPEG2 decoder may be described as several tasks, such as motion compensation, dequantization, or inverse discrete cosine transform, that can be implemented in a pipelined manner. Numerous parallel programming models have been considered for hardware/software partitioning, among others, synchronous dataflow, dynamic dataflow, Kahn process networks, and communicating sequential processes.
26.3.1
Differences among Parallel Programming Models
While hardware/software partitioning of parallel programs has many similarities to partitioning for sequential programs, several key differences exist. Granularity Partitioning of parallel programs typically treats each task as a region, meaning that the granularity is quite coarse. In some cases, decomposing a task into finer granularity may be considered. Evaluation Parallel programs often involve multiple performance constraints, with particular tasks or sets of tasks having unique performance constraints of their own. Furthermore, estimations of performance must consider the scheduling of tasks on processors, which is not an issue for sequential programs because regions in these programs are not concurrent. Alternative region implementations Given the coarse granularity of tasks, considering alternative implementations becomes even more important, as the variations among the alternatives can be huge. Implementation models Because tasks are inherently concurrent, partitioning of parallel programs typically uses parallel execution models in their implementations, meaning that microprocessors and coprocessors run concurrently rather than mutually exclusively and meaning that coprocessors may be arranged to form highlevel pipelines. Partitioning of parallel programs is less likely to consider fusing multiple coprocessors into one because fusing eliminates concurrency.
558
Chapter 26
I
Hardware/Software Partitioning
Parallel program partitioning introduces a new aspect to exploration— scheduling. When mapping multiple tasks to a single microprocessor, partitioning must carry out the additional step of scheduling to determine when each task will execute. Scheduling tasks to meet performance constraints is known as realtime scheduling and is a heavily studied problem [3]. Including partitioning during scheduling results in a more complex problem. Such partitioning often considers more than just one microprocessor as well and even different types of microprocessors. It may even consider different numbers and types of memories and different bus structures connecting memories to processors. Parallel partitioning must also pay more attention to the data storage requirements between processors. Queues may be introduced between processors, the sizes of those queues must be determined, and their implementation (e.g., in shared memory or in separate hardware components) must be decided. Exploration More complex issues in the hardware/software partitioning problem—such as scheduling, different granularities, different evaluation methods, alternative region implementations, and different numbers and connections of microprocessors/memories/buses—require more complex solution approaches. Most modern automatic partitioning research considers one or a few extensions to basic hardware/software partitioning and develops custom heuristics to solve the new formulations in fast compiler-like runtimes. However, as more complex forms of partitioning are considered, more powerful search heuristics with longer runtimes, such as simulated annealing or search algorithms tuned to the problem formulation, may be necessary.
26.4
SUMMARY AND DIRECTIONS Developing an approach for hardware/software partitioning requires the consideration of granularity, evaluation, alternative region implementations, implementation models, exploration, and so forth, and each such issue involves numerous options. The result is a tremendously large partition solution space and a huge variety of approaches to finding good partitions. While much research into automated hardware/software partitioning has occurred over the past decades, most of the problem’s more complex formulations have yet to be considered. A key future challenge will be the development of effective partitioning approaches for these increasingly complex formulations. As FPGAs continue to enter mainstream embedded, desktop, and server computing, incorporating automated hardware/software partitioning into standard software design flows becomes increasingly important. One approach to minimizing the disruption of standard software design flows is to incorporate partitioning as a backend tool that operates on a final binary, allowing continued use of existing programming languages and compilers and supporting the use
26.4 Summary and Directions
559
of assembly and even object code. Such binary-level partitioning [23] requires powerful decompilation methods to recover high-level regions such as functions and loops. Binary-level partitioning even opens the door for dynamic partitioning, wherein on-chip tools transparently move software regions to FPGA coprocessors, making use of new lean, just-in-time compilers for FPGAs [19].
References [1] G. Amdahl. Validity of the single processor approach to achieving large-scale computing capabilities. Proceedings of the AFIPS Spring Joint Computer Conference, 1967. [2] J. Burns, A. Donlin, J. Hogg, S. Singh, M. De Wit. A dynamic reconfiguration runtime system. Proceedings of the Symposium on FPGA-Based Custom Computing Machines, 1997. [3] G. Buttazzo. Hard Real-time Computing Systems: Predictable Scheduling Algorithms and Applications, Kluwer Academic, 1997. [4] S. Chappell, C. Sullivan. Handel-C for co-processing and co-design of field programmable system on chip. Proceedings of Workshop on Reconfigurable Computing and Applications, 2002. [5] K. Chatha, R. Vemuri. An iterative algorithm for partitioning, hardware design space exploration and scheduling of hardware-software systems. Design Automation for Embedded Systems 5(3–4), 2000. [6] K. Compton, S. Hauck. Reconfigurable computing: A survey of systems and software. ACM Computing Surveys 34(2), 2002. [7] CriticalBlue. http://www.criticalblue.com. [8] P. Eles, Z. Peng, K. Kuchchinski, A. Doboli. System level hardware/software partitioning based on simulated annealing and tabu search. Design Automation for Embedded Systems 2(1), 1997. ¨ [9] R. Enzler, T. Jeger, D. Cottet, G. Troster. High-level area and performance estimation of hardware building blocks on FPGAs. Lecture Notes in Computer Science 1896, 2000. [10] R. Ernst, J. Henkel. Hardware-software codesign of embedded controllers based on hardware extraction. Proceedings of the International Workshop on Hardware/Software Codesign, 1992. [11] D. Gajski, F. Vahid, S. Narayan, J. Gong. Specification and Design of Embedded Systems, Prentice-Hall, 1994. [12] P. C Gilmore, R. E Gomory. The theory and computation of knapsack functions. Operations Research 14, 1966. [13] F. Glover. Tabu search, part I. Operations Research Society of America Journal on Computing 1, 1989. [14] T. Grotker, S. Liao, G. Martin, S. Swan. System Design with System C. SpringerVerlag, 2002. [15] R. Gupta, G. De Micheli. System-level synthesis using re-programmable components. Proceedings of the European Design Automation Conference, 1992. [16] J. Henkel, R. Ernst. A hardware/software partitioner using a dynamically determined granularity. Design Automation Conference, 1997. [17] S. Kirkpatrick, C. Gelatt, M. Vecchi. Optimization by simulated annealing. Science 220(4598), May 1983.
560
Chapter 26
I
Hardware/Software Partitioning
[18] Y. Li, J. Henkel. A framework for estimation and minimizing energy dissipation of embedded HW/SW systems. Design Automation Conference, 1998. [19] R. Lysecky, G. Stitt, F. Vahid. Warp processors. Transactions on Design Automation of Electronic Systems 11(3), 2006. [20] S. Martello, P. Toth. Knapsack Problems: Algorithms and Computer Implementations, Wiley, 1990. [21] Poseidon Design Systems, Inc. http://www.poseidon-systems.com/index.htm. [22] A. Schrijver. Theory of Linear and Integer Programming, Wiley, 1998. [23] G. Stitt, F. Vahid. New decompilation techniques for binary-level co-processor generation. Proceedings of the International Conference on Computer-Aided Design, 2005. [24] K. Suzuki, A. Sangiovanni-Vincentelli. Efficient software performance estimation methods for hardware/software codesign. Design Automation Conference, 1996. [25] F. Vahid, D. Gajski. Incremental hardware estimation during hardware/software functional partitioning. IEEE Transactions on VLSI Systems 3(3), 1995. [26] F. Vahid, D. Gajski. Specification partitioning for system design. Design Automation Conference, 1992. [27] XPRES Compiler. http://www.tensilica.com/products/xpres.htm.
PART
V
CASE STUDIES OF FPGA APPLICATIONS
Parts I through IV covered technologies and techniques for creating efficient FPGA-based solutions to important problems. Part V focuses on specific, important field-programmable gate array (FPGA) applications, presenting case studies of interesting uses of reconfigurable technology. While this is by no means an exhaustive survey of all applications done on FPGAs, these chapters do contain several very interesting representative points in this space. They can be read in any order, and can even be interspersed with other chapters of this book. This introduction should help readers identify the concepts the case studies cover and the chapters each help to illustrate. To understand the case studies, a basic knowledge of FPGAs (Chapter 1), CAD tools (Chapters 6, 13, 14, and 17), and application development (Chapter 21) is required. Chapter 27 presents a high-performance image compression engine optimized for satellite imagery. This is a streaming signal-processing application (Chapters 5, 8, and 9), a type of computation that typically maps well to reconfigurable devices. In this case, the system saw speedups of approximately 400 times, for which the authors had to optimize the algorithm carefully, considering memory bandwidth (Chapter 21), conversion to fixed point (Chapter 23), and alteration of the algorithm to eliminate sequential dependencies. Chapter 28 focuses on automatic target recognition, which is the detection of regions of interest in military synthetic aperture radar (SAR) images. Like the compression engine in Chapter 27, this represents a very complex, streaming signal-processing application. It also is one of the most influential applications of runtime-reconfiguration (Chapters 4 and 21), where a large circuit is time-multiplexed onto a single FPGA, enabling it to reuse the same silicon multiple times. This was necessary because the possible targets to be detected were represented by individual custom, instance-specific circuits (Chapter 22), the huge number of which was too large for the available FPGAs. Chapter 29 discusses Boolean satisfiability (SAT) solving—the determination of whether there is an assignment of values to variables that
562
Part V
I
Case Studies of FPGA Applications
makes a given Boolean equation true (satisfied). SAT is a fairly general optimization technique that is useful in, for example, chip testing, formal verification, and even FPGA CAD flows. This work on solving Boolean equations via FPGAs is an interesting application of instancespecific circuitry (Chapter 3) because each equation to be solved was compiled directly into FPGA logic. However, this meant that the runtime of the CAD tools was part of the time needed to solve a given Boolean equation, creating a strong push toward faster CAD algorithms for FPGAs (Chapter 20). Chapter 30 covers logic emulation—the prototyping of complex integrated circuits on huge boxes filled with FPGAs and programmable interconnect chips. This is one of the most successful applications of multi-FPGA systems (Chapter 3) because the translation of a single ASIC into FPGA logic necessitates hundreds to thousands of FPGAs to provide adequate logic capacity. Fast mapping tools for such systems are also important (Chapter 20). In Chapter 23 we discussed methods for eliminating (or at least minimizing) the amount of floating-point computation in FPGA designs by converting floating-point operations to fixed point. However, there are situations where floating point is unavoidable. Scientific computing codes often depend on floating-point values, and many users require that the FPGA-based implementation provide exactly the same results as those of a processor-based solution. These situations require full floating-point support. In other cases, the high dynamic range of values might make fixed-point computations untenable. Chapter 31 considers the development of a library of floating-point units and their use in applications such as FFTs. Chapter 32 covers a complex physical simulation application—the finite difference time domain (FDTD) method, which is a way of modeling electromagnetic signals in complex situations that can be very useful in applications such as antenna design and breast cancer detection. The solution involves a large-scale cellular automata (Chapter 5) representation of the space to be modeled and an iterative solver. The key to achieving a high-performance implementation on FPGAs, however, involves conversion to fixed-point arithmetic (Chapter 23), simplification of complex mathematical equations, and careful consideration of the memory bottlenecks in the system (Chapter 21). Chapter 33 discusses an alternative to traditional design flow for creating FPGA mappings in which the FPGA is allowed to evolve its own configuration. Because the FPGA is reprogrammable, a genetic optimization system can simply load into it random configurations and see how well they function. Those that show promise are retained; those that do
Case Studies of FPGA Applications
563
not are removed. Through mutation and breeding, new configurations are created and evaluated in the same way, slowly evolving better and better computations. The hope is that such a system can support important classes of computation with circuits significantly more efficient than standard design flows. This design strategy exploits special features of the FPGA’s reprogrammability and flexibility (Chapter 4). Some of the chapters in this section focus on streaming digital signal processing (DSP) applications. Such applications often benefit from FPGA logic because of their amenability to pipelining and because of the large amount of data parallelism inherent in the computation. Network processing and routing is another such application domain. Chapter 34 considers packet processing, the application of FPGA logic to network filtering, and related tasks. Heavy pipelining of circuits onto the reconfigurable fabric and optimization of custom boards to network processing (Chapter 3) support very high-bandwidth networking. However, because the system retains the flexibility of FPGA logic, new computations and new filtering techniques can be easily accommodated within the system. This ability to incrementally adjust, tune, and invent new circuits provides a valuable capability even in a field as rapidly evolving as network security. For many applications, memory access to a large set of state, rather than computational, throughput can be the bottleneck. Chapter 35 explores an object-oriented, data-centric model (Chapter 5) based on adding programmable or reprogrammable logic into DRAM memories. The chapter emphasizes custom-reprogrammable chips (Chapter 2) and explores both FPGA and VLIW implementation for the programmable logic. Nevertheless, much of the analysis and techniques employed can also be applied to modern FPGAs with large, on-chip memories.
This page intentionally left blank
CHAPTER
27
SPIHT IMAGE COMPRESSION Thomas W. Fry Samsung, Global Strategy Group
Scott Hauck Department of Electrical Engineering University of Washington
This chapter describes the process of mapping the image compression algorithm SPIHT onto a reconfigurable logic architecture. A discussion of why adaptive logic is required, as opposed to an ASIC, is provided, along with background material on SPIHT. Several discrete wavelet transform hardware architectures are analyzed and evaluated. In addition, two major modifications to the original image compression algorithm, which are required in order to build a reconfigurable hardware implementation, are presented: (1) the storage elements necessary for each wavelet coefficient, and (2) a modification to the original SPIHT algorithm created to parallelize the computation. Also discussed are the effects these modifications have on the final compression results and the trade-offs involved. The chapter then describes how the updated SPIHT algorithm is mapped onto the Annapolis Microsystems WildStar reconfigurable hardware system. This system is populated with three Virtex-E field-programmable gate array (FPGA) parts and several memory ports. The issues of how the modified algorithm is divided between individual FPGA parts and how data flows through the memories are discussed. Lastly, final results and speedups are presented and evaluated against a comparable microprocessor solution from the time the Annapolis Microsystems WildStar was released.
27.1
BACKGROUND As NASA deploys each new generation of satellites with more sensors, capturing an ever-larger number of spectral bands, the volume of data being collected begins to outstrip a satellite’s ability to transmit data back to Earth. For example, the Terra satellite contains five separate sensors, each collecting up to 36 individual spectral bands. The Tracking and Data Relay Satellite System (TDRSS) ground terminal in White Sands, New Mexico, captures data from these sensors at a limited rate of 150 Mbps [19]. As the number of sensors on a satellite grows and the transmission rates increase, this bandwidth limitation became a driving force for NASA to study methods of compressing images prior to downlinking.
566
Chapter 27
I
SPIHT Image Compression
FPGAs are an attractive implementation medium for such a system. Software solutions suffer from performance limitations and power requirements. At the same time, traditional hardware platforms lack the required flexibility needed for postlaunch modifications. After launch, such fixed hardware systems cannot be modified to use newer compression schemes or even to implement bug fixes. In the past, modification of fixed systems in satellites proved to be very expensive [4]. By implementing an image compression kernel in a reconfigurable system, we overcame these shortcomings. Because such a system may be reprogrammed after launch, it does not suffer from conventional hardware’s inherit inflexibility. At the same time, the algorithm is computing in custom hardware and can perform at the required processing rates while consuming less power than a traditional software implementation. This chapter describes the work performed as part of a NASA-sponsored investigation into the design and implementation of a space-bound FPGA-based hyperspectral image compression machine. For this work, the Set Partitioning in Hierarchical Trees (SPIHT) routine was selected as the image compression algorithm. First, we describe the algorithm and discuss the reasons for its selection. Then we describe how the algorithm was optimized for implementation in a specific hardware platform and we present the results.
27.2
SPIHT ALGORITHM SPIHT is a wavelet-based image compression coder. It first converts an image into its wavelet transform and then transmits information about the wavelet coefficients. The decoder uses the received signal to reconstruct the wavelet and then performs an inverse transform to recover the image. SPIHT was selected because both it and its predecessor, the embedded zerotree wavelet coder, were significant breakthroughs in still-image compression. Both offered significantly improved quality over other image compression techniques such as vector quantization, JPEG, and wavelets combined with quantization, while not requiring training that would have been more difficult to implement in hardware. In short, SPIHT displays exceptional characteristics over several properties all at once [15]: I I I I I I
Good image quality with a high peak-signal-to-noise ratio (PSNR). Fast coding and decoding. A fully progressive bitstream. Can be used for lossless compression. May be combined with error protection (useful in satellite transmissions). Ability to code for an exact bitrate or PSNR.
In addition, since the SPIHT algorithm processes an image in two distinct steps—the discrete wavelet transform phase and the coding phase—it provides a natural point at which a hardware implementation may be divided. (The advantage of this property will be seen in Section 27.4.) The rest of this section
27.2 SPIHT Algorithm
567
describes the basics of wavelets, the discrete wavelet transform, and the SPIHT coding engine.
27.2.1
Wavelets and the Discrete Wavelet Transform
The wavelet transform is a reversible transform on spatial data. The discrete wavelet transform (DWT) is a form appropriate to discrete data, such as the individual points or pixels in an image. DWT runs a high-pass and low-pass filter over the signal in one dimension. This produces a low-pass (“average”) version of the data and a high-pass (rapid changes within the average) version. Every other result from each pass is then sampled, yielding two subbands, each of which is one-half the size of the input stream. The result is a new image comprising of a high- and a low-pass subband. These two subbands can be used to fully recover the original image. In the case of a multidimensional signal such as an image, this procedure is repeated in each dimension (Figure 27.1). The vertical and horizontal transformations break up the image into four distinct subbands. The wavelet coefficients that correspond to the fine details are the LH, HL, and HH subbands. Lower frequencies are represented by the LL subband, which is a low-pass filtered version of the original image [17]. The next wavelet level is calculated by repeating the horizontal and vertical transformations on the LL subband from the previous level. Four new subbands are created from the transformations. The LH, HL, and HH subbands in the next level represent coarser-scale coefficients and the new LL subband is an even smoother version of the original image. It is possible to obtain coarser and coarser scales of the LH, HL, and HH subbands by iteratively repeating the wavelet transformation on the LL subband of each level. Figure 27.2 displays the subband components of an image with three scales of wavelet transformation. The reverse transformation uses an inverse filter on the final LL subband and the LH, HL, and HH subbands at the same level to recreate the LL subband of the previous level. By iteratively processing each level, the original image may be restored. Figure 27.3 displays a satellite image of San Francisco and its corresponding 3-level DWT. By processing either the wavelet transform or the inverse wavelet transform, these two images may be converted from one into the other and thus may be viewed as equivalent.
LP
↓2 L
HP (a)
LP
↓2
LL
LH
HP
↓2
HL
HH
H
↓2 (b)
(c)
FIGURE 27.1 I A 1-level wavelet built by two one-dimensional passes: (a) original image, (b) horizontal pass, and (c) vertical pass.
568
Chapter 27
I
SPIHT Image Compression
LL3 LH3 HL3 HH3 HL2
LH2 LH1 HH2
HL1
HH1
FIGURE 27.2
I
A 3-level wavelet transform.
FIGURE 27.3
I
An image of San Francisco (a) and the resulting 3-level DWT (b).
27.2.2
SPIHT Coding Engine
SPIHT is a method of coding and decoding the wavelet transform of an image. As discussed in the previous section, by coding and transmitting information about the wavelet coefficients, it is possible for a decoder to perform an inverse transformation on the wavelet and reconstruct the original image. A useful property of SPIHT is that the entire wavelet does not need to be transmitted in order to recover the image. Instead, as the decoder receives more information about the original wavelet transform, the inverse transformation yields a better-quality reconstruction (i.e., a higher PSNR) of the original image. SPIHT generates excellent image quality and performance due to three properties of the coding algorithm: partial ordering by coefficient value, taking advantage
27.2 SPIHT Algorithm
FIGURE 27.4
I
569
Spatial orientation trees.
of the redundancies between different wavelet scales, and transmitting data in bit-plane order [14]. Following a wavelet transformation, SPIHT divides the wavelet into spatial orientation trees (Figure 27.4). Each node in a tree corresponds to an individual pixel. The offspring of a pixel are the four pixels in the same spatial location of the same subband at the next finer scale of the wavelet. Pixels at the finest scale of the wavelet are the leaves of the tree and have no children. Every pixel is part of a 2 × 2 block with its adjacent pixels. Blocks are a natural result of the hierarchical trees because every pixel in a block shares the same parent pixel. Also, the upperleft pixel of each 2 × 2 block at the root of the tree has no children since there are only three subbands at each scale and not four. Figure 27.4 shows how the pyramid is defined. Arrows point to the offspring of an individual pixel and the grayed blocks show all of the descendents for a specific pixel at every scale. SPIHT codes a wavelet by transmitting information about the significance of a pixel. By stating whether or not a pixel is above some threshold, information about that pixel’s value is implied. Furthermore, SPIHT transmits information stating whether a pixel or any of its descendents are above a threshold. If the statement proves false, all of the pixel’s descendants are known to be below that threshold level and they do not need to be considered during the rest of the current pass. At the end of each pass, the threshold is divided by two and the algorithm continues. In this manner, information about the most significant bits of the wavelet coefficients will always precede information on lower-order significant bits, which is referred to as bit-plane ordering. Information stating whether or not a pixel is above the current threshold or is being processed at the current threshold is contained in three lists: the list of insignificant pixels (LIP), the list of insignificant sets (LIS) and the list of significant pixels (LSP). The LIP are pixels that are currently being processed
570
Chapter 27
I
SPIHT Image Compression
but are not yet above the threshold. The LIS are pixels that are currently being processed but none of their descendents are yet above the current threshold and so they are not being processed. Lastly, the LSP are pixels that were already stated to be above a previous threshold level and whose value at each bit plane is now transmitted. Figure 27.5 is the algorithm from the original SPIHT paper [14], modified to reflect changes (discussed later in the chapter) referring to 2 × 2 block information. Sn (i, j) represents if the pixel (i, j) is greater than the current threshold, and Sn (D(i, j)) states if any of the pixel’s (i, j) descendents are greater than the current threshold. There are three important concepts to take from the SPIHT algorithm. First, as the encoder sequentially steps through the image, it inserts or deletes pixels from the three lists. All of the information required to keep track of the lists is output to the decoder, allowing the decoder to generate and maintain an identical list order as the encoder. For the decoder to reproduce the steps taken by the encoder we merely need to replace the output statements in the encoder’s algorithm with input for the decoder’s algorithm. Second, the bitstream produced is naturally progressive. A progressive bitstream is one that can be cut off at any point and still be valid. As the decoder steps through the coding algorithm, it gathers finer and finer detail about the original wavelet transform. The decoder can stop at any point and perform an inverse transform with the wavelet coefficients it has currently reconstructed. Progressive bitstreams can also be reduced to an arbitrary size or be cut off during transmission and still produce a valid image. Such a property is very useful in satellite transmissions.
1. Initialization: output n = floor[log2(max(i,j){|ci,j|})]; clear the LSP list, add the root pixels to the LIP list and root pixels with descendants to LIS. 2. Sorting Pass: 2.1 for each entry (i,j) in the LIP: 2.1.1 output S n(i,j); 2.1.2 If S n(i,j) = 1, move (i,j) to the LSP list and output its sign 2.2 for each entry (i,j) in the LIS: 2.2.1 If one of the pixels in (i,j)'s block is not in LIP but all are in LIS: output S n(all descendants of the current block); if none are significant, skip 2.2.2. 2.2.2 Output S n(D(i,j)) if S n(D(i,j)) = 1, then for each of (i,j) immediate children (k,l): output S n(k,l); add (k,l) to the LIS for the current pass if S n(k,l) = 1, add (k,l) to the LSP and output its sign else add (k,l) to the LIP 3. Refinement Pass: for each entry (i,j) in LSP, except ones inserted in the current pass, output the nth most significant bit of (i,j). 4. Quantization-step Update: decrement n by 1 and go to Step 2.
FIGURE 27.5
I
SPIHT coding algorithm.
27.3 Design Considerations and Modifications
571
Third, and the concept that has the largest impact on building a hardware platform, the SPIHT algorithm develops an individual list order to transmit information within each bit plane. This ordering is implicitly created from the threshold information discussed before—the order in which each pixel enters each list determines the transmission order for each image. As a result, each image will transmit wavelet coefficients in an entirely different order. Slightly better PSNRs are achieved with this dynamic ordering of the wavelet coefficients. The SPIHT algorithm in Figure 27.5, which creates the individual list ordering, is inherently sequential. As a result, SPIHT cannot be significantly parallelized in hardware. This drawback greatly limits the performance of any SPIHT implementation in hardware. To get around this limitation and improve performance, it was necessary to parallelize the SPIHT algorithm and essentially create a new image compression algorithm. These changes and the trade-offs involved are described in Section 27.3.3.
27.3
DESIGN CONSIDERATIONS AND MODIFICATIONS To fully take advantage of the high performance a custom hardware implementation of SPIHT could yield, the software specifications had to be examined and adjusted where they either performed poorly in hardware or did not make the most of the resources available. Here we review the three major factors taken under consideration while evaluating how to create a hardware implementation of the SPIHT algorithm on an adaptive computing platform. The first factor was to determine what discrete wavelet transform architecture to use. Section 27.3.1 provides a summary of the DWTs considered, showing how memory and communication requirements helped dictate the structure chosen. Section 27.3.2 describes the fixed-point precision optimization performed for each wavelet coefficient and the final data representation employed. Section 27.3.3 explains how the SPIHT algorithm was altered to vastly speed up the hardware implementation.
27.3.1
Discrete Wavelet Transform Architectures
One of the benefits of the SPIHT algorithm is its use of the discrete wavelet transform, which had existed for several years prior to this work. As a result, numerous studies on how to create a DWT hardware implementation were available for review. Much of this work on DWTs involved parallel platforms to save both memory access and computations [5, 12, 16]. The most basic architecture is the basic folded architecture. The one-dimensional DWT entails demanding computations, which involve significant hardware resources. Since the horizontal and vertical passes use identical finite impulse response (FIR) filters, most two-dimensional DWT architectures implement folding to reuse logic for each dimension [6]. Figure 27.6 illustrates how folded architectures use a one-dimensional DWT to realize a two-dimensional DWT.
572
Chapter 27
I
SPIHT Image Compression
Row data
1-D DWT
Memory
Column data
FIGURE 27.6
I
A folded architecture.
Although the folded architecture saves hardware resources, it suffers from high memory bandwidth. For an N × N image there are at least 2N2 read-andwrite cycles for the first wavelet level. Additional levels require rereading previously computed coefficients, further reducing efficiency. To lower the memory bandwidth requirements needed to compute the DWT, we considered several alternative architectures. The first was the Recursive Pyramid Algorithm (RPA) [21]. RPA takes advantage of the fact that the various wavelet levels run at different clock rates. Each wavelet level requires onequarter of the time that the previous level needed because at each level the size of the area under computation is reduced by one-half in both the horizontal and vertical dimensions. Thus, it is possible to store previously computed coefficients on-chip and intermix the next level’s computations with the current level’s. A careful analysis of the runtime yields (4∗ N2 )/3 individual memory load and store operations for an image. However, the algorithm has huge on-chip memory requirements and demands a thorough scheduling process to interleave the various wavelet levels. Another method to reduce memory accesses is the partitioned DWT, which breaks the image into smaller blocks and computes several scales of the DWT at once for each block [13]. In addition, the algorithm made use of wavelet lifting to reduce the DWT’s computational complexity [18]. By partitioning an image into smaller blocks, the amount of on-chip memory storage required was significantly reduced because only the coefficients in the block needed to be stored. This approach was similar to the RPA, except that it computed over sections of the image at a time instead of the entire image at once. Figure 27.7, from Ritter and Molitor [13], illustrates how the partitioned wavelet was constructed. Unfortunately, the partitioned approach suffers from blocking artifacts along the partition boundaries if the boundaries were treated with reflection.1 Thus, pixels from neighboring partitions were required to smooth out these boundaries. The number of wavelet levels determined how many pixels beyond a subimage’s boundary were needed, since higher wavelet levels represent data 1
An FIR filter generally computes over several pixels at once and generates a result for the middle pixel. To calculate pixels close to an image’s edge, data points are required beyond the edge of the image. Reflection is a method that takes pixels toward the image’s edge and copies them beyond the edge of the actual image for calculation purposes.
27.3 Design Considerations and Modifications
FIGURE 27.7
I
The partitioned DWT.
High High
Low
↓2
HH1
↓2 Low
↓2
HL1
High
↓2
LH1 Low
↓2
I
High High
Low
FIGURE 27.8
573
↓2
↓2
HH2
↓2
High Low
↓2
HL2
High
↓2
LH2 Low
↓2 Low
↓2
High
↓2
HH3
Low
↓2
HL3
High
↓2
LH3
Low
↓2
LL3
↓2
↓2
A generic 2D biorthogonal DWT.
from a larger image region. To compensate for the partition boundaries, the algorithm processed subimages along a single row to eliminate multiple reads in the horizontal direction. Overall data throughputs of up to 152 Mbytes/second were reported with the partitioned DWT. The last architecture we considered was the generic 2D biorthogonal DWT [3]. Unlike previous designs, the generic 2D biorthogonal DWT did not require FIR filter folding or on-chip memories as the Recursive Pyramid design. Nor did it involve partitioning an image into subimages. Instead, the architecture created separate structures to calculate each wavelet level as data were presented to it, as shown in Figure 27.8. The design sequentially read in the image and computed the four DWT subbands. As the LL1 subband became available, the coefficients were passed to the next stage, which calculated the next coarser level subbands, and so on. For larger images that required several individual wavelet scales, the generic 2D biorthogonal DWT architecture consumed a tremendous amount of on-chip resources. With SPIHT, a 1024 × 1024 pixel image computes seven separate wavelet scales. The proposed architecture would employ 21 individual high- and low-pass FIR filters. Since each wavelet scale processed data at different rates, some control complexity would be inevitable. The advantage of the architecture
574
Chapter 27
I
SPIHT Image Compression
was much lower on-chip memory requirements and full utilization of the memory’s bandwidth, since each pixel was read and written only once. To select a DWT, each of the architectures discussed before were reevaluated against our target hardware platform (discussed below). The parallel versions of the DWT saved some memory bandwidth. However, additional resources and more complex scheduling algorithms became necessary. In addition, some of the savings were minimal since each higher wavelet level is one-quarter the size of the previous wavelet level. In a 7-level DWT, the highest 4 levels compute in just 2 percent of the time it takes to compute the first level. Other factors considered were that the more complex DWT architectures simply required more resources than a single Xilinx Virtex 2000E FPGA (our target device) could accommodate, and that enough memory ports were available in our board to read and write four coefficients at a time in parallel. For these reasons, we did not select a more complex parallel DWT architecture, but instead designed a simple folded architecture that processes one dimension of a single wavelet level at a time. In the architecture created, pixels are read in horizontally from one memory port and written directly to a second memory port. In addition, pixels are written to memory in columns, inverting the image along the 45-degree line. By utilizing the same addressing logic, pixels are again read in horizontally and written vertically. However, since the image was inverted along its diagonal, the second pass will calculate the vertical dimension of the wavelet and restore the image to its original orientation. Each dimension of the image is reduced by half, and the process iteratively continues for each wavelet level. Finally, the mean of the LL subband is calculated and subtracted from itself. To speed up the DWT, the design reads and writes four rows at a time. Figure 27.9 illustrates the architecture of the DWT phase. Since every pixel is read and written once and the design processes four rows at a time, for an N × N-size image both dimensions in the lowest wavelet level compute in 2∗ N2 /4 clock cycles. Similarly, the next wavelet level processes the image in one-quarter the number of clock cycles as the previous level. With an infinite number of wavelet levels, the image processes in: ∞
2 · N2 3 2 = ·N 4 4l l=1
∑
(27.1)
Thus, the runtime of the DWT engine is bounded by three-quarters of a clock cycle per pixel in the image. This was made possible because the memory ports in the system allowed four pixels to be read and written in a single clock cycle. It is very important to note that many of the parallel architectures designed to process multiple wavelet levels simultaneously run in more than one clock cycle per image. Also, because of the additional resources required by a parallel implementation, computing multiple rows at once becomes impractical. Given more resources, the parallel architectures discussed previously could process multiple rows at once and yield runtimes lower than three-quarters of a clock cycle per pixel. However, the FPGAs available in the system used, although state of the art at the time, did not have such extensive resources.
27.3 Design Considerations and Modifications
575
Write memory port
Read memory port
Read–write crossbar
Row 1 Row boundary reflection Read address logic
Row 2 Row boundary reflection Row 3 Row boundary reflection Row 4 Row boundary reflection
DWT-level calculation and control logic
Low pass High pass Low pass High pass Low pass High pass Low pass High pass
Variable fixedpoint scaling
Variable fixedpoint scaling
Variable fixedpoint scaling
Data selection and write address logic
Variable fixedpoint scaling
LL subband mean calculation and subtraction
FIGURE 27.9
I
A discrete wavelet transform architecture.
By keeping the address and control logic simple, there were enough resources on the FPGA to implement 8 distributed arithmetic FIR filters [23] from the Xilinx Core library. The FIR filters required significant FPGA resources, approximately 8 percent of the Virtex 2000E FPGA for each high- and low-pass FIR filter. We chose the distributed arithmetic FIR filters because they calculate a new coefficient every clock cycle, and this contributed to the system being able to process an image in three-quarters of a clock cycle per pixel.
27.3.2
Fixed-point Precision Analysis
The next major consideration was how to represent the wavelet coefficients in hardware. The discrete wavelet transform produces real numbers as the wavelet coefficients, which general-purpose computers realize as floating-point numbers. Traditionally, FPGAs have not employed floating-point numbers for several reasons: I
Floating-point numbers require variable shifts based on the exponential description, and variable shifters perform poorly in FPGAs.
576
Chapter 27 I
I
I
SPIHT Image Compression
Floating-point numbers consume enormous hardware resources on a limited-resource FPGA. Floating point is often unnecessary for a known dataset.
At each wavelet level of the DWT, coefficients have a fixed range. Therefore, we opted for a fixed-point numerical representation—that is, one where the decimal point’s position is predefined. With the decimal point locked at a specific location, each bit contributes a known value to the number, which eliminates the need for variable shifters. However, the DWT’s filter bank was unbounded, meaning that the range of possible numbers increases with each additional wavelet level. We chose to use the FIR filter set from the original SPIHT implementation. An analysis of the coefficients of each filter bank showed that the two-dimensional low-pass FIR filter at most increases the range of possible numbers by a factor of 2.9054. This number is the increase found from both the horizontal and the vertical directions. It represents how much larger a coefficient at the next wavelet level could be if the previous level’s input wavelet coefficients were the maximum possible value and the correct sign to create the largest possible filter output. As a result, the coefficients at various wavelet levels require a variable number of bits above the decimal point to cover their possible ranges. Table 27.1 illustrates the various requirements placed on a numerical representation for each wavelet level. The Factor and Maximum Magnitude columns demonstrate how the range of possible numbers increases with each level for an image starting with 1 byte per pixel. The Maximum Bits column shows the maximum number of bits (with a sign bit) necessary to represent the numeric range at each wavelet level. The Maximum Bits from Data column represents the maximum number of bits required to encode over one hundred sample images obtained from NASA. These numbers were produced via software simulation on this sample dataset. In practice, the magnitude of the wavelet coefficients does not grow at the maximum theoretical rate. To maximize efficiency, the Maximum Bits from Data values were used to determine what position the most significant bit must stand for. Since the theoretical maximum is not used, an overflow situation may occur. TABLE 27.1 Wavelet level Input image 0 1 2 3 4 5 6
I
Fixed-point magnitude calculations Factor
Maximum magnitude
Maximum bits
Maximum bits from data
1 2.9054 8.4412 24.525 71.253 207.02 601.46 1747.5
255 741 2152 6254 18170 52789 153373 445605
8 11 13 14 16 17 19 20
8 11 12 13 14 15 16 17
27.3 Design Considerations and Modifications
577
To compensate, the system flags overflow occurrences as an error and truncates the data. However, after examining hundreds of sample images, no instances of overflow occurred, and the data scheme used provided enough space to capture all the required data. If each wavelet level used the same numerical representation, they would all be required to handle numbers as large as the highest wavelet level to prevent overflow. However, since the lowest wavelet levels never encounter numbers in that range, several bits at these levels would not be used and therefore wasted. To fully utilize all of the bits for each wavelet coefficient, we introduced the concept of variable fixed-point representation. With variable fixed-point we assigned a fixed-point numerical representation for each wavelet level optimized for that level’s expected data size. In addition, each representation differed from one another, meaning that we employed a different fixed-point scheme for each wavelet level. Doing so allowed us to optimize both memory storage and I/O at each wavelet level to yield maximum performance. Once the position of the most significant bit was found for each wavelet level, the number of precision bits needed to accurately represent the wavelet coefficients had to be determined. Our goal was to provide enough bits to fully recover the image and no more. Figure 27.10 displays the average PSNRs for several recovered images from SPIHT using a range of bit widths for each coefficient. An assignment of 16 bits per coefficient most accurately matched the fullprecision floating-point coefficients used in software, up through perfect reconstruction. Previous wavelet designs we looked at focused on bitrates less than 4 bits per pixel (bpp) and did not consider rounding effects on the wavelet transformation for bitrates greater than 4 bpp. These studies found this lower bitrate acceptable for lossy SPIHT compression [3].
120
100
PSNR
80
60
40
20
Real
FIGURE 27.10
I
16 bits
14 bits
12 bits
10 bits
PSNR versus bitrate for various coefficient sizes.
7.4
7.75
6.7
7.05
6
Bit rate
6.35
5.3
5.65
4.6
4.95
3.9
4.25
3.2
3.55
2.5
2.85
1.8
2.15
1.1
1.45
0.4
0.75
0.05
0
578
Chapter 27
I
SPIHT Image Compression TABLE 27.2
I
Final variable fixed-point representation
Wavelet level Input image 0 1 2 3 4 5 6
Integer bits
Fractional bits
10 11 12 13 14 15 16 17
6 5 4 3 2 1 0 −1
Instead, we chose a numerical representation that retains the equivalent amount of information as a full floating-point number during wavelet transformation. By doing so, it was possible to perfectly reconstruct an image given a high enough bitrate. In other words, we allowed for a lossless implementation. Table 27.2 provides the number of integer and fractional bits allocated for each wavelet level. The number of integer bits also includes 1 extra bit for the sign value. The highest wavelet level’s 16 integer bits represent positions 17 to 1, with no bit assigned for the 0 position.
27.3.3
Fixed Order SPIHT
The last major factor we took under consideration was how to parallelize the SPIHT algorithm for use in hardware. As discussed in Section 27.2, SPIHT computes a dynamic ordering of the wavelet coefficients as it progresses. By always adding pixels to the end of the LIP, LIS, and LSP, coefficients most critical to constructing a valid wavelet are generally sent first, while less critical coefficients are placed later in the lists. Such an ordering yields better image quality for bitstreams that end in the middle of a bit plane. The drawback of this ordering is that every image has a unique list order determined by the image’s wavelet coefficient values. By analyzing the SPIHT algorithm, we were able to conclude that the data a block of coefficients contributes to the final SPIHT bitstream is fully determined by the following set of localized information: I I I
The 2 × 2 block of coefficients Their immediate children The maximum magnitude of the four subtrees
As a result, we were able to show that every block of coefficients could be calculated independently and in parallel of one another. We were also able to determine that, if we could parallelize the computation of these coefficients, the final hardware implementation would operate at a much higher throughput. However, we were not able to take advantage of this parallelism because in SPIHT
27.3 Design Considerations and Modifications
579
the order in which a block’s data is inserted into the bitstream is not known, since it depends on the image’s unique ordering. Only once the order is determined is it possible to produce a valid SPIHT bitstream from the information listed previously. Unfortunately, the algorithm employed to calculate the SPIHT ordering of coefficients is sequential. The computation steps over the coefficients of the image multiple times within each bit plane and dynamically inserts and removes coefficients from the LIP and LIS lists. Such an algorithm is not parallelizable in hardware. As a result, many of the speedups a custom hardware implementation may produce would be lost. Instead, any hardware implementation we could develop would need to create the lists in an identical manner as the software implementation. This process would require many clock cycles per block of coefficients, which would significantly limit the throughput of any SPIHT implementation in hardware. To remove this limitation and design a faster system, we created a modification to the original algorithm called Fixed Order SPIHT. Fixed Order SPIHT is similar to the SPIHT algorithm shown in Figure 27.5, except that the order of the LIP, LIS, and LSP lists is fixed and known beforehand. Instead of inserting blocks of coefficients at the end of the lists, they are inserted in a predetermined order. For example, block A will always appear before block B, which is always before block C, regardless of the order in which A, B, and C were added to the lists. The order of Fixed Order SPIHT is based upon the Morton scan ordering discussed in Algazi and Estes [1]. Fixed Order SPIHT removed the need to calculate the ordering of coefficients within each bit plane and allowed us to create a fully parallel version of the original SPIHT algorithm. Such a modification increased the throughput of a hardware encoder by more than an order of magnitude at the cost of a slightly lower PSNR within each bit plane. Figure 27.11 outlines the new version of SPIHT we created. The final bitstream generated is precisely the same as the bitstream generated from the original SPIHT algorithm except that data will appear in a different order within each bit plane. By using the algorithm in Figure 27.11 instead of the original sequential algorithm in Figure 27.8, the final datastream can be computed in one pass through the image instead of multiple passes. In addition, each pixel block is coded in parallel, which yields significantly faster compression times with FPGAs. The advantage of this method is that at the end of each bit plane, the exact same data will have been transmitted, just in a different order. Thus, at the end of each bit plane the PSNR of Fixed Order SPIHT will match that of the original SPIHT algorithm, as shown in Figure 27.12. Since the length of each bitstream is fairly short within the transmitted datastream, the PSNR curve of Fixed Order SPIHT very closely matches that of the original algorithm. The maximum loss in quality between Fixed Order SPIHT and the original SPIHT algorithm found was 0.2 dB. This is the maximum loss any image in our sample set displayed over any bitrate from 0.05 to 8.00 bpp. For a more complete discussion on Fixed Order SPIHT, refer to Fry [8].
580
Chapter 27
I
SPIHT Image Compression
1. Bit-plane calculation: for each 2×2 block of pixels (i,j) in a Morton Scan Ordering 1.1 for each threshold level n from the highest level to the lowest 1.1.1 if (i,j) is a root and Max((i,j)) >= n add all four pixels to the LIP 1.1.2 if (i,j) is not a root and Max((i,j)) >= previous n for each pixel p in the block if p < previous n add p to the LIP else add p to the LSP 1.1.3 if (i,j) is not a leaf and Max((i,j) ) >= n add all four pixel to the LIS unless (i,j) is a root, then just add the three with children 1.1.4 if all four pixels are in LIS and at least one is not in the LIP if at least one pixel will be removed from the LIS at this level output a '0' to the LIS stream else output a '1' to the LIS stream 1.1.5 for each pixel p in the LIP if p >= n output a '1' and the sign of p to the LIP stream remove p from the LIP and add it to the LSP else output a '0' to the LIP stream 1.1.6 for each pixel p in the LIS if child max(p) >= n output a '1' to the LIS stream remove p from the LIS for each child (k,l) of p if (k,l) >= n output a '1' and the sign of (k,l) to the LIS stream else output a '0' to the LIS stream else output a '0' to the LIS stream 1.1.7 for each pixel p in the LSP output the value of p at the bit plane n to the LSP stream 2. Grouping phase: for each threshold level n from the highest level to the lowest 2.1 output the LIP stream at threshold level n to the final data stream 2.2 output the LIS stream at threshold level n to the final data stream 2.3 output the LSP stream at threshold level n to the final data stream
FIGURE 27.11
27.4
I
Fixed Order SPIHT.
HARDWARE IMPLEMENTATION In the following subsections we first describe the target hardware platform that the SPIHT algorithm was mapped onto. Next, we present an overview of the implementation and a detailed description of the three major steps of the
27.4 Hardware Implementation
581
60
50
PSNR
40
30
20
10
3.8
3.55
3.3
3.05
2.8
2.55
2.3
2.05
1.8
1.55
1.3
1.05
0.8
0.55
0.3
0.05
0
Bit rate Original
FIGURE 27.12
I
Fixed order
A comparison of original SPIHT and Fixed Order SPIHT.
computation. A thorough understanding of the target platform is required because it strongly influenced the SPIHT implementation created.
27.4.1
Target Hardware Platform
The target platform was the WildStar FPGA processor board developed by Annapolis Microsystems [2]. Shown in Figure 27.13, it consists of three Xilinx Virtex 2000E FPGAs—PE 0, PE 1, and PE 2—and operates at rates of up to 133 MHz. The board makes available 48 MBytes of memory through 12 individual memory ports, between 32 and 64 bits wide, yielding a throughput of up to 8.5 GBytes/sec. Four shared memory blocks connect the Virtex chips through a crossbar. By switching a crossbar, several MBytes of data are passed between the chips in just a few clock cycles. The Xilinx Virtex 2000E FPGA allows for 2 million gate designs [22]. For extra on-chip memory, the FPGAs contain 160 asynchronous dual-ported BlockRAMs. Each BlockRAM stores 4096 bits of data and is accessible in 1-, 2-, 4-, 8-, or 16-bit-wide words. Because they are dual ported, the BlockRAMs function well as first in, first outs (FIFOs). A PCI bus connects the board to a host computer.
27.4.2
Design Overview
The architecture constructed consisted of three phases: wavelet transform, maximum magnitude calculation, and Fixed Order SPIHT coding. Each phase
Chapter 27
582
I
SPIHT Image Compression
SRAM
64 bits
SRAM
64 bits
64 bits
SRAM
SRAM
Crossbar
64 bits
64 bits
64 bits
SRAM
64 bits Crossbar
64 bits
32 bits
32 bits
PE1
PE0
PE2
32 bits
32 bits
64 bits
SRAM
Crossbar
64 bits SRAM
FIGURE 27.13
I
SRAM
64 bits
64 bits SRAM
64 bits
Crossbar
64 bits SRAM
64 bits
64 bits
SRAM
SRAM
A block diagram of the Annapolis Microsystems WildStar board.
was implemented in one of the three Virtex chips. By instantiating each phase on a separate chip, separate images could be operated on in parallel. Data was transferred from one phase by the next through the shared memories. The decision on how to break up the phases came naturally from the resources available in each FPGA and the requirements of each section. The DWT and the SPIHT coding phases each required close to the full resources of a single FPGA, and the maximum magnitude phase needed to be completed prior to the SPIHT coding phase. These characteristics of the algorithm and system naturally lead to placing the three phases on the three separate FPGAs. The architecture was also designed in this manner because once processing in a phase is complete, the crossbar mode could be switched and the data calculated would be accessible to the next chip. By coding a different image in each phase simultaneously, the throughput of the system is determined by the slowest phase, while the latency of the architecture is the sum of the three phases. Figure 27.14 illustrates the architecture of the system.
27.4.3
Discrete Wavelet Transform Phase
As discussed in Section 27.3.1, after implementing each algorithm in hardware we chose a simple folded architecture, which matched the bandwidth, memory, and chip capacities of the target board well. The results of this phase are stored into memory and passed to the maximum magnitude phase.
27.4 Hardware Implementation
PE1 wavelet
FIGURE 27.14
27.4.4
I
Wavelet coefficients
PE0 magnitude
Wavelet coefficients Magnitude information
583
PE2 SPIHT
An overview of the architecture.
Maximum Magnitude Phase
Once the DWT is complete, the next phase prepares and organizes the image into a form easily readable by the parallel version of the SPIHT coder. Specifically, the maximum magnitude phase calculates and rearranges the following information for the next phase: I I I I
I I
The maximum magnitude of each of the four child trees The absolute value of the 2 × 2 block of coefficients A sign value for each coefficient in the block The threshold level when the block is first inserted into the LIS by its parent Threshold and sign data of each of the 16 child coefficients Reorder the wavelet coefficients into a Morton Scan Ordering
The SPIHT coding phase shares two 64-bit memory ports with the maximum magnitude phase, allowing it to read 128 bits on each clock cycle. The data just listed can fit into these two memory ports. By doing so on every clock cycle the SPIHT coding phase will be able to read and process an entire block of data. The data that the maximum magnitude phase calculates is shown in Figure 27.15. To calculate the maximum magnitude of all coefficients below a node in the spatial orientation trees, the image must be scanned in depth-first search order [7]. With a depth-first search, whenever a new coefficient is read and considered, all of its children will have already been read and the maximum coefficient so far is known. On every clock cycle the new coefficient is compared to and updates the current maximum. Because PE 0 (the maximum magnitude phase) uses 32-bit-wide memory ports, it can read half a block at a time. The state machine, which controls how the spatial orientation trees are traversed, reads one-half of a block as it descends the tree, and the other half as it ascends the tree. By doing so all of the data needed to compute the maximum magnitude for the current block is available as the state machine ascends back up the spatial orientation tree. In addition, the four most recent blocks of each level are saved onto a stack so that all 16 child coefficients are available to the parent block. Figure 27.16 demonstrates the algorithm. The current block, maximum magnitude for each child, and 16 child coefficients are shown on the stack. Light gray blocks are coefficients previously read and processed. Dark gray blocks are coefficients currently being read. In this example, the state machine has just finished reading the lowest level and has ascended to the second wavelet level.
584
Chapter 27
63
I
SPIHT Image Compression
48 47 Coefficient 4
32 31 Coefficient 3
0
16 15 Coefficient 2
Sign bit
32 31 25 24 Threshold and sign data for 16 children
0
Coefficient magnitude
Left memory port
63
15 14
Coefficient 1
Coefficient
0
4 Children and parent’s threshold data
Right memory port
FIGURE 27.15
I
Data passed to the SPIHT coder to calculate a single block.
Spatial Orientation Tree
Stack Child coefficients
FIGURE 27.16
I
Child maximum magnitudes
Current coefficients
A depth-first search of the spatial orientation trees.
The second block in the second level is now complete, and its maximum magnitude can now be calculated, shown as the dark gray block in the stack’s highest level. In addition, the 16 child coefficients in the lowest level were saved and are available. There are no child values for the lowest level since there are no children. Another benefit of scanning the image in a depth-first search order is that Morton Scan Ordering is naturally realized within each level, although it is intermixed between levels. By writing data from each level to a separate area of memory and later reading the data from the highest wavelet level to the lowest, the Morton
27.4 Hardware Implementation
585
Read memory port
Magnitude calculation
Coefficient stack and maximum magnitude calculation
Encode maximum magnitudes and group block data
Depth-first search state machine and control logic
FIGURE 27.17
I
Write memory port 1 Memory buffer and address generator
Write memory port 2
A block diagram of the SPIHT maximum magnitude phase.
Scan Ordering is naturally realized. A block diagram of the maximum magnitude phase is provided in Figure 27.17. Since two pixels are read together and the image is scanned only once, the runtime of this phase is half a clock cycle per pixel. Because the maximum magnitude phase computes in less time than the wavelet phase, the throughput of the overall system is not affected.
27.4.5
The SPIHT Coding Phase
The final SPIHT coding phase performs the Fixed Order SPIHT encoding in parallel, based on the data from the maximum magnitude phase. Coefficient blocks are read from the highest wavelet level to the lowest. As information is loaded from memory it is shifted from the variable fixed-point representation to a common fixed-point representation for every wavelet level. Once each block has been adjusted to an identical numerical representation, the parallel version of SPIHT is used to calculate what information each block will contribute to each bit plane. The information is grouped and counted before being added to three separate variable FIFOs for each bit plane. The data that the variable FIFO components receive range in size from 0 to 37 bits, and the variable FIFOs arrange the block data into regular sized 32-bit words for memory access. Care is also taken to stall the algorithm if any of the variable FIFOs becomes too full. Data from each buffer is output to a fixed location in memory and the number of bits in each bitstream is output as well. Given that data is added dynamically to each bitstream, there needs to be a dynamic scheduler to select which buffer
586
Chapter 27
I
SPIHT Image Compression
should be written to memory. Since there are a large number of FIFOs that all require a BlockRAM, the FIFOs are spread across the FPGA, and some type of staging is required to prevent a signal from traveling too far. The scheduler selects which FIFO to read based on both how full a FIFO is and when it was last accessed. Our studies showed that the LSP bitstream is roughly the same size of the LIP and LIS streams combined. Because of this the LSP bitstreams transfer more data to memory than the other two lists. In our design the LIP and LIS bitstreams share a memory port while the LSP stream writes to a separate memory port. Since a 2 × 2 block of coefficients is processed every clock cycle, the design takes one-quarter of a clock cycle per pixel, which is far less than the three-quarters of a clock cycle per pixel for the DWT. The block diagram for the SPIHT coding phase is given in Figure 27.18. With 22 total bit planes to calculate, the design involves 66 individual data grouping and variable FIFO blocks. Although none consume a significant amount of FPGA resources individually, 66 blocks do. The entire design required 160 percent of the resources in a Virtex 2000E, and would not fit in the target system. However, by removing the lower bit planes, less FPGA resources are needed, and the architecture can easily be adjusted to fit the FPGA being used. Depending on the size of the final bitstream required, the FPGA size used in the SPIHT phase can be varied to handle the number of intermediate bitstreams generated. Removing lower bit planes is possible since the final bitstream transmits data from the highest bit plane to the lowest. In our design the lower 9-bit planes
LIP and LIS address generator
Write memory port 1
Dynamic FIFO scheduler
Variable Variable Variable FIFO FIFO FIFO
Group data Shift data
I
Variable Variable Variable FIFO FIFO FIFO
…
Read memory port 1
Read memory port 2
LSP address generator
Select and Read FIFOs
Address generator and control logic
FIGURE 27.18
Write memory port 2
Group data
LIP data
LIS data
LIP data
LSP data
Calculate bit plane 21
A block diagram of the SPIHT coding phase.
Group data
Group data
…
Group data LIS data Calculate bit plane 0
Group data LSP data
27.5 Design Results
587
were eliminated. Yet, without these lower planes, bitrates of up to 6 bpp can still be achieved. We found the constraint to be acceptable because we are interested in high compression ratios using low bitrates, and 6 bpp is practically a lossless signal. Since SPIHT is optimized for lower bitrates, the ability to calculate higher bitrates was not considered necessary. Alternatively, the use of a larger FPGA would alleviate the size constraint.
27.5
DESIGN RESULTS The system was designed using VHDL with models provided by Annapolis Micro Systems to access the PCI bus and memory ports. Simulations for debugging purposes were carried out with ModelSim EE 5.4e from Mentor Graphics. Synplify 6.2 from Synplicity was used to compile the VHDL code and generate a netlist. The Xilinx Foundation Series 3.1i tool set was used to place and route the design. Lastly, the peutil.exe utility from Annapolis Micro Systems generated the FPGA configuration streams. Table 27.3 shows the speed and runtime specifications of the final architecture. All performance numbers are measured results from the actual hardware implementation. Each phase computes on separate memory blocks, which can operate at different clock rates. The design can process any square image where the dimensions are a power of 2: 16 × 16, 32 × 32, up to 1024 × 1024. Since the WildStar board is connected to the host computer by a relatively slow PCI bus, the throughput of the entire system we built is constrained by the throughput of the PCI bus. However, since the study is on how image compression routines could be implemented on a satellite, such a system would be designed differently, and would not contain a reconfigurable board connected to some host platform though a PCI bus. Instead, the image compression routines would be inserted directly into the data path and the data transfer times would not be the bottleneck of the system. For this reason we analyzed the throughput of just the SPIHT compression engine and analyzed how quickly the FPGAs can process the images. The throughput of the system was constrained by the discrete wavelet transform at 100 MPixels/sec. One method to increase this rate is to compute more rows in parallel. If the available memory ports accessed 128 bits of data instead of the 64 bits with our WildStar board, the number of clock cycles per pixel could be reduced by half and the throughput could double. TABLE 27.3 Phase Wavelet Magnitude SPIHT
I
Performance numbers Clock cycles per 512 × 512 image
Clock cycles per pixel
Clock rate
Throughput
FPGA area (%)
182465 131132 65793
3/4 1/2 1/4
75 MHz 73 MHz 56 MHz
100 MPixels/sec 146 MPixels/sec 224 MPixels/sec
62 34 98
588
Chapter 27
I
SPIHT Image Compression
Assuming the original image consists of 8 bpp, images are processed at a rate of 800 Mbits/sec. The entire throughput of the architecture is less than one clock cycle for every pixel, which is lower than parallel versions of the DWT. Parallel versions of the DWT used complex scheduling to compute multiple wavelet levels simultaneously, which left limited resources to process multiple rows at a time. Given more resources though, they would obtain higher data rates than our architecture by processing multiple rows simultaneously. In the future, a DWT architecture other than the one we implemented could be selected for additional speed improvements. We compared our results to the original software version of SPIHT provided on the SPIHT web site [15]. The comparison was made without arithmetic coding since our hardware implementation does not perform any arithmetic coding on the final bitstream. Additionally, in our testing on sample NASA images, arithmetic coding added little to overall compression rates and thus was dropped [11]. An IBM RS/6000 Model 270 workstation was used for the comparison, and we used a combination of standard image compression benchmark images and satellite images from NASA’s web site. The software version of SPIHT compressed a 512 × 512 image in 1.101 seconds on average without including disk access. The wavelet phase, which constrains the hardware implementation, computes in 2.48 milliseconds, yielding a speedup of 443 times for the SPIHT engine. In addition, by creating a parallel implementation of the wavelet phase, further improvements to the runtimes of the SPIHT engine are possible. While this is the speedup we will obtain if the data transfer times are not a factor, the design may be used to speed up SPIHT on a general-purpose processor. On such a system the time to read and write data must be included as well. Our WildStar board is connected to the host processor over a PCI bus, which writes images in 13 milliseconds and reads the final datastream in 20.75 milliseconds. Even with the data transfer delay, the total speedup still yields an improvement of 31.4 times. Both the magnitude and SPIHT phases yield higher throughputs than the wavelet phase, even though they operate at lower clock rates. The reason for the higher throughputs is that both of these phases need fewer clock cycles per pixel to compute an image. The magnitude phase takes half a clock cycle per pixel and the SPIHT phase requires just a quarter. The fact that the SPIHT phase computes in less than one clock cycle per pixel, let alone a quarter, is a striking result considering that the original SPIHT algorithm is very sequential in nature and had to consider each pixel in an image multiple times per bit plane.
27.6
SUMMARY AND FUTURE WORK In this chapter we demonstrated a viable image compression routine on a reconfigurable platform. We showed how by analyzing the range of data processed by each section of the algorithm, it is advantageous to create optimized memory
27.6 Summary and Future Work
589
structures as with our variable fixed-point work. Doing so minimizes memory usages and yields efficient data transfers. Here each bit transferred between memory and the processor board directly impacted the final results. In addition, our Fixed Order SPIHT modifications illustrate how by making slight adjustments to an existing algorithm, it is possible to dramatically increase the performance in a custom hardware implementation and simultaneously yield essentially identical results. With Fixed Order SPIHT the throughput of the system increased by over an order of magnitude while still matching the original algorithm’s PSNR curve. This SPIHT work was part of a development effort funded by NASA.
References [1] V. R. Algazi, R. R. Estes. Analysis-based coding of image transform and subband coefficients. Applications of Digital Image Processing XVIII, SPIE Proceedings 2564, 1995. [2] Annapolis Microsystems. WildStar Reference Manual, Annapolis Microsystems, 2000. [3] A. Benkrid, D. Crookes, K. Benkrid. Design and implementation of generic 2D biorthogonal discrete wavelet transform on an FPGA. IEEE Symposium on FieldProgrammable Custom Computing Machines, April 2001. [4] M. Carraeu. Hubble Servicing Mission: Hubble is fitted with a new “eye.” http://www.chron.com/content/interactive/space/missions/sts-103/hubble/archive/ 931207.html, December 7, 1993. [5] C. M. Chakrabarti, M. Vishwanath. Efficient realization of the discrete and continuous wavelet transforms: From single chip implementations to mappings in SIMD array computers. IEEE Transactions on Signal Processing 43, March 1995. [6] C. M. Chakrabarti, M. Vishwanath, R. M. Owens. Architectures for wavelet transforms: A survey. Journal of VLSI Signal Processing 14, 1996. [7] T. Cormen, C. Leiserson, R. Rivest. Introduction to Algorithms, MIT Press, 1997. [8] T. W. Fry. Hyper Spectral Image Compression on Reconfigurable Platforms, Master’s thesis, University of Washington, Seattle, 2001. [9] R. C. Gonzalez, R. E. Woods. Digital Image Processing, Addison-Wesley, 1993. [10] A. Graps. An introduction to wavelets. IEEE Computational Science and Engineering 2(2), 1995. [11] T. Owen, S. Hauck. Arithmetic Compression on SPITH Encoded Images, Technical report UWEETR-2002–2007, Department of Electrical Engineering, University of Washington, Seattle, 2002. [12] K. K. Parhi, T. Nishitani. VLSI architectures for discrete wavelet transforms. IEEE Transactions on VLSI Systems 1(2), 1993. [13] J. Ritter, P. Molitor. A pipelined architecture for partitioned DWT based lossy image compression using FPGAs. ACM/SIGDA Ninth International Symposium on FieldProgrammable Gate Arrays, February 2001. [14] A. Said, W. A. Pearlman. A new fast and efficient image codec based on set partitioning in hierarchical trees. IEEE Transactions on Circuits and Systems for Video Technology 6, June 1996. [15] A. Said, W. A. Pearlman. SPIHT image compression: Properties of the method. http://www.cipr.rpi.edu/research/SPIHT/spiht1.html. [16] H. Sava, M. Fleury, A. C. Downton, A. Clark. Parallel pipeline implementations of wavelet transforms. IEEE Proceedings Part 1 (Vision, Image and Signal Processing) 144(6), 1997.
590
Chapter 27
I
SPIHT Image Compression
[17] J. M. Shapiro. Embedded image coding using zero trees of wavelet coefficients. IEEE Transactions on Signal Processing 41(12), 1993. [18] W. Sweldens. The Lifting Scheme: A new philosophy in biorthogonal wavelet constructions. Wavelet Applications in Signal and Image Processing 3, 1995. [19] NASA. TERRA: The EOS flagship. The EOS Data and Information System (EOSDIS). http://terra.nasa.gov/Brochure/Sect 5-1.html. [20] C. Valens. A really friendly guide to wavelets. http://perso.wanadoo.fr/polyvalens/ clemens/wavelets/wavelets.html. [21] M. Vishwanath, R. M. Owens, M. J. Irwin. VLSI architectures for the discrete wavelet transform. IEEE Transactions on Circuits and Systems, Part II, May 1995. [22] Xilinx, Inc. The Programmable Logic Data Book, Xilinx, Inc., 2000. [23] Xilinx, Inc. Serial Distributed Arithmetic FIR Filter, Xilinx, Inc., 1998.
CHAPTER
28
AUTOMATIC TARGET RECOGNITION SYSTEMS ON RECONFIGURABLE DEVICES Young H. Cho Open Acceleration Systems Research
An Automatic Target Recognition (ATR) system analyzes a digital image or video sequence to locate and identify all objects of a certain class. There are several ways to implement ATR systems, and the right one is dependent, in large part, on the operating environment and the signal source. In this chapter we focus on the implementations of reconfigurable ATR designs based on the algorithms from Sandia National Laboratories (SNL) for the U.S. Department of Defense Joint STARS airborne radar imaging platform. STARS is similar to an aircraft AWACS system, but detects ground targets. ATR in Synthetic Aperture Radar (SAR) imagery requires tremendous processing throughput. In this application, data come from high-bandwidth sensors, and the processing is time critical. On the other hand, there is limited space and power for processing the data in the sensor platforms. One way to meet the high computational requirement is to build custom circuits as an ASIC. However, very high nonrecurring engineering (NRE) costs for low-volume ASICs, and often evolving algorithms, limit the feasibility of using custom hardware. Therefore, reconfigurable devices can play a prominent role in meeting the challenges with greater flexibility and lower costs. This chapter is organized as follows: Section 28.1 describes a highly parallelizable Automatic Target Recognition (ATR) algorithm. The system based on it is implemented using a mix of software and hardware processing, where the most computationally demanding tasks are accelerated using field-programmable gate arrays (FPGAs). We present two high-performance implementations that exercise the FPGA’s benefits. Section 28.2 describes the system that automatically builds algorithm-specific and resource-efficient “hardwired” accelerators. It relies on the dynamic reconfiguration feature of FPGAs to obtain high performance using limited logic resources. The system in Section 28.3 is based on an architecture that does not require frequent reconfiguration. The architecture is modular, easily scalable, and highly tuned for the ATR application. These application-specific processors are automatically generated based on application and environment parameters. In Section 28.4 we compare the implementations to discuss the benefits and the trade-offs of designing ATR systems using FPGAs. In Section 28.5, we draw our conclusions on FPGA-based ATR system design.
592
28.1
Chapter 28
I
Automatic Target Recognition Systems on Reconfigurable Devices
AUTOMATIC TARGET RECOGNITION ALGORITHMS Sandia real-time SAR ATR systems use a hierarchy of algorithms to reduce the processing demands for SAR images in order to yield a high probability of detection (PD) and a low false alarm rate (FAR).
28.1.1
Focus of Attention
As shown in Figure 28.1, the first step in the SNL algorithm is a Focus of Attention (FOA) algorithm that runs over a downsampled version of the entire image to find regions of interest that are of approximately the right size and brightness. These regions are then extracted and processed by an indexing stage to further reduce the datastream, which includes target hypotheses, orientation estimations, and target center locations. The surviving hypotheses have the full resolution data sent to an identification executive that schedules multiple identification algorithms and then fuses their results. The FOA stage identifies interesting image areas called “chips.” Then it composes a list of targets suspected to be in a chip. Having access to range and altitude information, the FOA algorithm also determines the elevation for the chip, without having to identify the target first. It then tasks the next stage with evaluating the likelihood that the suspected targets are actually in the given image chip and exactly where.
28.1.2
Second-level Detection
The next stage of the algorithm, called Second Level Detection (SLD), takes the extracted imagery (an image chip), matches it against a list of provided target
Synthetic aperture radar sensors
Focus of attention
Second-level detection driver
Reporting module
FIGURE 28.1
I
M-47 Tank Angle: 3558 Elevation: 10 ft
The Sandia Automatic Target Recognition algorithm.
28.1 Automatic Target Recognition Algorithms
593
hypotheses, and returns the hit information for each image chip consisting of the best two orientation matches and other relevant information. The system has a database of target models. For each target, and for each of its three different elevations, 72 templates are defined corresponding to its all-around views. The orientations of adjacent views are separated by 5 degrees. SLD is a binary silhouette matcher that has a bright mask and a surround mask that are mutually exclusive. Each template is composed of several parameters along with a “bright mask” and a “surround mask,” where the former defines the image pixels that should be bright for a match, and the latter defines the ones that should not. The bright and surround masks are 32×32 bitmaps, each with about 100 asserted bits. “Bright” is defined relative to a dynamic threshold. On receiving tasks from the FOA, the SLD unit compares all of the stored templates for this target and elevation and the applicable orientations with the image chip, and computes the level of matching (the “hit quality”). The two hits with the highest quality are reported to the SLD driver as the most likely candidates to include targets. For each hit, the template index number, the exact position of the hit in the search area, and the hit quality are provided. After receiving this information, the SLD driver reports it to the ATR system. The purpose of the first step in the SLD algorithm, called the shape sum, is to distinguish the target from its surrounding background. This consists of adaptively estimating the illumination for each position in the search area, assuming that the target is at that orientation and location. If the energy is too little or too much, no further processing for that position for that template match is required. Hence, for each mask position in the search area, a specific threshold value is computed as in equation 28.1. 31 31
SMx, y =
∑ ∑ Bu, v Mx+u, y+v
(28.1)
SMx, y – Bias BC
(28.2)
u=0 v=0
THx, y =
The next step in the algorithm distinguishes the target from the background by thresholding each image pixel with respect to the threshold of the current mask position, as computed before. The same pixel may be above the threshold for some mask positions but below it for others. This threshold calculation determines the actual bright and surround pixel for each position. As shown in equation 28.2, it consists of dividing the shape sum by the number of pixels in the bright mask and subtracting a template-specific Bias constant. As shown in equation 28.3, the pixel values under the bright mask that are greater than or equal to the threshold are counted; if this count exceeds the minimal bright sum, the processing continues. On the other hand, the pixel
594
Chapter 28
I
Automatic Target Recognition Systems on Reconfigurable Devices
values under the surround mask that are less than the threshold are counted to calculate the surround sum as shown in equation 28.4. If this count exceeds the minimal surround sum, it is declared a hit. 31 31
BSx, y =
∑ ∑ Bu, v
Mx+u, y+v ≥ THx, y
(28.3)
Mx+u, y+v < THx, y
(28.4)
u=0 v=0 31 31
SSx, y =
∑ ∑ Su, v
u=0 v=0
Once the position of the hit is determined, we can calculate its quality by taking the average of bright and surround pixels that were correct, as shown in equation 28.5. This quality value is sent back to the driver with the position to determine the two best targets. 1 BSx, y SSx, y Qx , y = (28.5) + 2 BC SC
28.2
DYNAMICALLY RECONFIGURABLE DESIGNS FPGAs can be reconfigured to perform multiple functions with the same logic resources by providing a number of corresponding configuration bit files. This ability allows us to develop dynamically reconfigurable designs. In this section, we present an ATR system implementation of UCLA’s Mojave project that uses an FPGA’s dynamic reconfigurability.
28.2.1
Algorithm Modifications
As described previously, the current Sandia system uses 64 × 64 pixel chips and 32 × 32 pixel templates. However, the Mojave system uses chip sizes of 128 × 128 pixels and template sizes of 8 × 8 pixels. It uses different chip and template sizes in order to map into existing FPGA devices that are relatively small. A single template moves through a single chip to yield 14,641 (121 × 121) image correlation results. Assuming that each output can be represented with 6 bits, the 87,846 bits are produced by the system. There is also a divide step in the Sandia algorithm that follows the shape sum operation and guides the selection of threshold bin for the chip. This system does not implement the divide, mainly because it is expensive relative to available FPGA resources for the design platform.
28.2.2
Image Correlation Circuit
FPGAs offer an extremely attractive solution to the correlation problem. First of all, the operations being performed occur directly at the bit level and are dominated by shifts and adds, making them easy to map into the hardware provided by the FPGA. This contrasts, for example, with multiply-intensive algorithms
28.2 Dynamically Reconfigurable Designs
595
that would make relatively poor utilization of FPGA resources. More important, the sparse nature of the templates can be utilized to achieve a far more efficient implementation in the FPGA than could be realized in a general-purpose correlation device. This can be illustrated using the example of the simple template shown in Figure 28.2. In the example template shown in the figure, only 5 of the 20 pixels are asserted. At any given relative offset between the template and the chip, the correlation output is the sum of the 5 binary pixels in the chip that match the asserted bits in the template. The template can therefore be implemented in the FPGA as a simple multiple-port adder. The chip pixel values can be stored in flip-flops and are shifted to the right by one flip-flop with each clock cycle. Though correlation of a large image with a small mask is often understood conceptually in terms of the mask being scanned across the image, in this case the opposite is occurring—the template is hardwired into the FPGA while the image pixels are clocked past it. Another important opportunity for increased efficiency lies in the potential to combine multiple templates on a single FPGA. The simplest way to do this is to spatially partition the FPGA into several smaller blocks, each of which handles the logic for a single template. Alternatively, we can try to identify templates that have some topological commonality and can therefore share parts of their adder trees. This is illustrated in Figure 28.3, which shows two templates sharing several pixels that can be mapped using a set of adder trees to leverage this overlap. A potential advantage FPGAs have over ASICs is that they can be dynamically optimized at the gate level to exploit template characteristics. For our application, a programmable ASIC design would need to provide large generalpurpose adder trees to handle the worst-case condition of summing all possible template bits, as shown in Figure 28.4. In constrast, an FPGA exploits the sparse nature of the templates and constructs only the small adder trees required. Additionally, FPGAs can optimize the design based on other application-specific characteristics.
TemplateA
Image
FIGURE 28.2
I
D00 D10 D20 D01 D21
1
ResultA
D00
D10
D20
D30
D01
D11
D21
D31
An example template and a corresponding register chain with an adder tree.
596
Chapter 28
I
Automatic Target Recognition Systems on Reconfigurable Devices TemplateA D01
TemplateB
D00 D10 D20 D21
I
ResultA
+
ResultB
+ D31
FIGURE 28.3
+
Common hardware shared between two templates.
Image
TemplateA
Registers for image chip and templates
AND gates used to perform dot product
Sum
FIGURE 28.4
28.2.3
I
Large adder
The ASIC version of the equivalent function.
Performance Analysis
Using a template-specific adder tree achieves significant reduction in routing complexity over a general correlation device, which must include logic to support arbitrary templates. The extent of this reduction is inversely proportional to the fraction of asserted pixels in the template. While this complexity reduction is important, alone it is not sufficient to lead to efficient implementations on FPGAs. The number of D-flip-flops required for storing the data points can cause inefficiencies in the design. Implementing these on the FPGA using the usual flip-flop–based shift registers is inefficient. This problem can be resolved by collapsing the long strings of image pixels— those not being actively correlated against a template—into shift registers, which can be implemented very efficiently on some lookup table (LUT)–based FPGAs. For example, LUTs in the Xilinx XC4000 library can be used as shift registers that delay data by some predetermined number of clock cycles. Each 16×1-bit
28.2 Dynamically Reconfigurable Designs
597
LUT can implement an element that is effectively a 16-bit shift register in which the internal bits cannot be accessed. A flip-flop is also needed at the output of each RAM to act as a buffer and synchronizer. A single control circuit is used to control the stepping of the address lines and the timely assertion of the writeenable and output-enable signals for all RAM-based shift register elements. This is a small price to pay for the savings in configurable logic block (CLB) usage relative to a brute-force implementation using flip-flops. In contrast, the 256-pixel template images, like those shown in Figure 28.5, can be stored easily using flip-flop–based registers. This is because sufficient flip-flops are available to do this, and the adder tree structures do not consume them. Also, using standard flip-flop–based shift registers for image pixels in the template simplifies the mapping process by allowing every pixel to be accessed. New templates can be implemented by simply connecting the template pixels of concern to the inputs of the adder tree structures. This leads to significant simplification of automated template-mapping tools. The resources used by the two components of target correlation—namely, storage of active pixels on the FPGA and implementation of the adder tree corresponding to the templates—are independent of each other. The resources used by the pixel storage are determined by the template size and are independent of the number of templates being implemented. Adding templates involves adding new adder tree structures and hence increases the number of function generators being used. The total number of templates on an FPGA is bounded by the number of usable function generators. The experimental results suggest that in practice we can expect to fit 6 to 10 surround templates having a higher number of overlapping pixels onto a 13,000gate FPGA. However, intelligent grouping of compatible templates is important. Because the bright templates are less populated than the surround templates, we estimate that 15 to 20 of them can be mapped onto the same FPGA.
FIGURE 28.5
I
Example of eight rotation templates of a SAR 16 × 16 bitmap image.
598
Chapter 28
28.2.4
I
Automatic Target Recognition Systems on Reconfigurable Devices
Template Partitioning
To minimize the number of FPGA reconfigurations necessary to correlate a given target image against the entire set of templates, it is necessary to maximize the number of templates in every configuration of the FPGA. To accomplish this optimization goal, we want to partition the set of templates into groups that can share adder trees so that fewer resources are used per template. The set of templates may number in the thousands, and the goal may be to place 10 to 20 of them per configuration; thus, exhaustive enumeration of all of the possible groupings is not an option. Instead, we use a heuristic method that furnishes a good, although perhaps suboptimal, solution. Correlation between two templates can establish the number of pixels in common, and it is a good starting point for comparing and selecting templates. However, some extra analysis, beyond iterative correlations on the template set, is necessary. For example, a template with many pixels correlates well with several smaller templates, perhaps even completely subsuming them, but the smaller templates may not correlate with each other and involve no redundant computations. There are two possible solutions to this. The first is to ensure that any template added to an existing group is approximately the same size as the templates already in it. The second is to compute the number of additions required each time a new template is brought in—effectively recomputing the adder tree each time. Recomputing the entire adder tree is computationally expensive and not a good method of partitioning a set of templates into subsets. However, one of the heuristics used in deciding whether or not to include a template in a newly formed partition is to determine the number of new terms that its inclusion would create in the partition’s adder tree. The assumption is that more terms would result in a significant number of new additions, resulting in a wider and deeper adder tree. Thus, by keeping to a minimum the number of new terms created, newly added templates do not increase the number of additions by a significant amount. Using C++, we have created a design tool to implement the partitioning process that uses an iterative approach to partitioning templates. Templates that compare well to a chosen “base” template (usually selected by largest area) are removed from the main template set and placed in a separate partition. This process is repeated until all templates are partitioned. After the partitions have been selected, the tool computes the adder tree for each partition. Figure 28.6 shows the creation of an adder tree from the templates in a partition. Within each partition, the templates are searched for shared subsets of pixels. Called terms, these subsets can be automatically added together, leading to a template description that uses terms instead of pixels. The most common addition of two terms is chosen to be grouped together, to form a new term that can be used by the templates. In this way, each template is rebuilt by combining terms in such a way that the most redundant additions are shared between templates; the final result is terms that compute entire templates. For the sample templates shown in Figure 28.6, 39 additions would be required to compute the correlations for all 5 in a naive approach. However,
28.2 Dynamically Reconfigurable Designs
A
B
C
D
599
E
Template A 5 1 1 3 1 4 1 2
3
Template B 5 3 1 4 1 5 4
5
6
Template D 5 1 1 2 1 6 1 7
7 8
FIGURE 28.6
I
Template C 5 2 1 3 1 6 1 7
Template E 5 1 1 3 1 7
Example of template grouping and rewritten as sums of terms.
after combining the templates through the process just described, only 17 additions are required.
28.2.5
Implementation Method
For a configurable computing system, the problem of dividing hardware and software is particularly interesting because it is both a hardware and a software issue. Consider the two methods for performing addition shown in Figure 28.7. Method A, a straightforward parallel implementation requiring several FPGAs, has several drawbacks. First, the outputs from several FPGAs converge at the addition operation, which may create a severe I/O bottleneck. Second, the system is not scalable—if it requires more precision, and therefore more bit planes, more FPGAs must be added. Method B in Figure 28.7 illustrates our approach. Each bit plane is correlated individually and then added to the previous results in temporary storage. It is completely scalable to any image or template precision, and it can implement all correlation, normalization, and peak detection routines required for ATR. One drawback of method B is the cost and power required for the resulting wide temporary SRAM. Another possible drawback is the extra execution time required to run ATR correlations in serial. The ratio of performance to number of FPGAs is roughly equivalent for the two methods, and the performance gap can be closed simply by using more of the smaller method B boards. The approach of a reconfigurable FPGA connected to an intermediate memory allows us a fairly complicated flow of control. For example, the sum calculation in ATR tends to be more difficult than the image–template correlation. Thus, we may want a program that performs two sum operations and forwards the results to a single correlation. Reconfigurations for 10K-gate FPGAs are typically around 20 kB in length. Reconfiguring every 20 milliseconds gives a reconfiguration bandwidth of approximately 1 MB per FPGA per second. Coupled with the complexity of the
600
Chapter 28
I
Automatic Target Recognition Systems on Reconfigurable Devices
Bit plane 0
Corr
Bit plane 1
Corr
SRAM 1 Sum
Bit plane 7
Corr (a)
Bit plane 0 Bit plane 1
Corr
Bit plane 7
1 Sum
(b)
FIGURE 28.7 I Each of eight FPGAs correlating each bit plane of the template (a). A single FPGA correlating bit planes and adding the partial sums serially (b).
flow control, this reconfiguration bandwidth can be handled by placing a small microcontroller and configuration RAM next to every FPGA. The microcontroller permits complicated flow of control, and since it addresses the configuration RAM, it frees up valuable I/O on the FPGA. The microcontroller is also important for debugging, which is a major issue in configurable systems because the many different hardware configurations can make it difficult to isolate problems. The principal components include a “dynamic” FPGA, which is reconfigured on the fly and performs most of the computing functions, and a “static” FPGA, which is configured only once and performs control and some computational functions. The EPROM holds configuration bitstreams, and the SRAM holds the input image data (e.g., the chip). Because the correlation operation involves the application of a small target template to a large chip, a first in, first out (FIFO) is needed to hold the pixels being wrapped around to the next row of the template mask. The templates used in this implementation are of size 8 × 8, whereas the correlation image is 128 × 128. Each configuration of the dynamic FPGA implements a total of four template pairs (four bright templates and four surround templates). The large amount of sum in the algorithm can be performed in parallel. This requires a total of D clock cycles, where D is each pixel’s depth of representation. Once the sum results are obtained, the correlation outputs are produced at the rate of 1 per clock cycle. Parallelism cannot be as directly exploited in this step because different pixels are asserted for different templates. However, in the limit of very large FPGAs the number of clock cycles to compute the correlation is upper-bounded by the number of possible thresholds, as opposed to the number of templates.
28.3
RECONFIGURABLE STATIC DESIGN Although the idea of reusing reconfigurable hardware to dynamically perform different functions is unique to FPGAs, the main weaknesses of dynamic FPGA reconfiguration are the lengthy time and additional resources required for FPGA reconfiguration and design compilation. Although reconfiguration time
28.3 Reconfigurable Static Design
601
has improved dramatically over the years, any time spent on reconfiguration is time that could be used to process more data. Unlike the dynamic reconfigurable architecture describe in the previous section, we describe another efficient FPGA design that does not require complete design reconfiguration. However, like the previous system, it uses a number of parameters to design a highly pipelined custom design to maximize utilization of the design space to exploit the parallelism in the algorithm.
28.3.1
Design-specific Parameters
To verify our understanding of the algorithm, we first implemented a software simulator and ran it on a sample dataset. Our simulations reproduced the expected results. Over time this algorithm simulator became a full hardware simulator and verifier. It also allowed us to investigate various design options before implementing them in hardware. The dataset includes 2 targets, each with 72 templates for 5-degree orientation intervals. In total, then, we have 144 bright masks and 144 surround masks, each a 32 × 32 bitmap. The dataset also includes 16 image chips, each with 64 × 64 pixels at 1 byte per pixel. Given a template and an image, there are 441 matrix correlations that must take place for each mask. This corresponds to 21 search rows, each 21 positions wide. The total number of search row correlations for the sample data and templates is thus 48,384. The behavior of the simulator on the sample dataset revealed a number of algorithm-specific characteristics. Because the design architecture was developed for reconfigurable devices, these characteristics are incorporated to tune the hardware engine for the best cost and performance.
28.3.2
Order of Correlation Tasks
Correlation tasks for threshold calculation (equation 28.2), bright sum (equation 28.3), and surround sum (equation 28.4) are very closely related. Valid results for all three must exist in order to calculate the quality of the hit, so invalid results from any one of them make other calculations unnecessary. For the data samples, about 60 percent of the surround sums and 40 percent of the threshold results were invalid, while all of the bright sum results were valid. The low rejection rate by bright sum is the result of the threshold being computed using only the bright mask, regardless of the surround mask. The threshold is computed by the same pixels used for computing bright sum, so we find that, for a typical dataset, checking for invalid surround sums before the other calculations drastically reduces the total number of calculations needed. Zero mask rows Each mask has 32 rows. However, many have all-zero rows that can be skipped. By storing with each template a pointer to its first nonzero row we can skip directly to that row “for free.” Embedded all-zero rows are also skipped. The simulation tools showed that, for our template set, this optimization significantly reduces the total computational requirements. For the sample
Chapter 28
602
I
Automatic Target Recognition Systems on Reconfigurable Devices
template set, there are total of 4608 bitmap rows to use in the correlation tasks. Out of 4608 bright rows, only 2206 are nonzero, and out of 4608 surround rows, 2815 are nonzero. Since the bright mask is used for both threshold and bright sum calculations, and the surround mask is used once, skipping the zero rows reduces the number of row operations from 13,824 to 7227, which produces a savings of about 52 percent. It is also possible to reduce the computation by skipping zero columns. However, as will be described in following section, the FPGA implementation works on an entire search row concurrently. Hence, skipping rows reduces time but skipping columns reduces the number of active elements that work in parallel, yielding no savings.
28.3.3
Reconfigurable Image Correlator
Although it is possible to reconfigure FPGAs dynamically, the time spent on context switching and reconfiguration could be used instead to process data on a register-based static design. For this reason, minimizing reconfiguration time during computation is essential in effective FPGA use. Nevertheless, when we use FPGAs as compute engines, reconfiguration allows the hardware to take on a large range of task parameters. The SLD tasks represented in equations 28.1, 28.3, and 28.4 are image correlation calculations on sliding template masks with radar images. To explain our design strategies, we examine each equation by applying the algorithm on a small dataset consisting of a 6 × 6 pixel image, a 3 × 3 mask bitmap, and a 4 × 4 result matrix. For this dataset, the shape sum calculation for a mask requires multiplying all 9 mask bits with the corresponding image pixels and summing them to find 1 of 16 results. To build an efficient circuit for the sum equations 28.3 and 28.4, we write out the subset of both equations as shown in Table 28.1. By expanding the summation equations, we expose opportunities for hardware to optimize the calculations. First, the same Buv is used to calculate the nth term of all of the shape sum results. Thus, when the summation calculations are done in parallel, the Buv coefficient can be broadcast to all of the units that calculate each result. Second, the image data in the nth term of the SMxy is in the (n + 1)th term of SMxy−1 , except when v returns to 0, the image pixel is located in the subsequent row. This is useful in implementing the pipeline datapath for the image pixels through the parallel summation units. TABLE 28.1 Term u v SM00 = SM01 = SM02 = SM03 =
I
Expanded sum equations 28.3 and 28.4 1
2
3
4
5
6
7
8
9
0 0 B00 M00 + B00 M01 + B00 M02 + B00 M03 +
0 1 B01 M01 + B01 M02 + B01 M03 + B01 M04 +
0 2 B02 M02 + B02 M03 + B02 M04 + B02 M05 +
1 0 B10 M10 + B10 M11 + B10 M12 + B10 M13 +
1 1 B11 M11 + B11 M12 + B11 M13 + B11 M14 +
1 2 B12 M12 + B12 M13 + B12 M14 + B12 M15 +
2 0 B20 M20 + B20 M21 + B20 M22 + B20 M23 +
2 1 B21 M21 + B21 M22 + B21 M23 + B21 M24 +
2 2 B22 M22 B22 M23 B22 M24 B22 M25
28.3 Reconfigurable Static Design
603
4-byte pipeline
M00
M01
M02
M03
M10
M14
M11
M15
M12
M16
M13
M17
M04
M05
1-byte (pixel) pipeline U0
U1
U2
U3 1-bit (mask) broadcast B00, B01, B02, B10, B11, ...
FIGURE 28.8
I
A systolic image array pipeline.
Based on the characteristics of the expanded equations, we can build a systolic computation unit as in Figure 28.8. To save time while changing the rows of pixels, the pixel pipeline can either operate as a pipeline or be directly loaded from another set of registers. At every clock cycle, each Uy unit performs one operation, v is incremented modulo 3, and the pixel pipeline shifts by one stage (U1 to U0 , U2 to U1 , . . . ). When v returns to 0, u is incremented modulo 3, and the pixel pipeline is loaded with the entire (u + x)th row of the image. When u returns to 0, the results are offloaded from the Uy stage, their accumulators are cleared, and x is incremented modulo 4. When x returns to 0, this computing task is completed. The initial loading of the image pixel pipeline is from the image word pipeline, which is word wide and so four times faster than the image pixel pipeline. This speed advantage guarantees that the pipeline will be ready with the next image row data when u returns to 0.
28.3.4
Application-specific Computation Unit
Developing different FPGA mappings for equations 28.1, 28.3, and 28.4 in parallel processing unit is one way to implement the design. At the end of each stage, the FPGA device is reconfigured with the optimal structure for the next task. As appealing as this may sound, current FPGA devices have typical reconfiguration times of tens of milliseconds, during which the reconfiguring logic cannot be used for computation. As presented in Section 28.3, each set of template configurations also has to be designed and compiled before any computation can take place. This can be a time-consuming procedure that does not allow dynamic template sets to be immediately used in the system. Fortunately, we can rely on the fact that FPGAs can be tuned to target-specific applications. From the equations, we derived one compact structure, shown in Figure 28.9, that can efficiently perform all ATR tasks. Since the target ATR
604
Chapter 28
I
Automatic Target Recognition Systems on Reconfigurable Devices 8-bit lmage pixel Mx1u, y1v Su,v
Bu,v Carry
FIGURE 28.9
I
LU
Accumulator
Accumulator
Register
Surround sum (SSx,y)
Shape sum (SMx,y) Bright sum (BSx,y)
Computation logic for equations 28.1, 28.3, and 28.4.
system can be seen as “embarrassingly parallel,” the performance of the FPGA design is linearly scalable to the number of the application-specific units.
28.4
ATR IMPLEMENTATIONS In this section we present the implementation results of two reconfigurable Sandia ATR systems, researched and developed on different reconfigurable platforms. Both designs leverage the unique characteristics of reconfigurable devices to accelerate ATR algorithms while making efficient use of available resources. Therefore, they both outperformed existing software as well as custom ASIC solutions. By analyzing the results of the reconfigurable solutions, we examine design trade-offs in cost and performance.
28.4.1
A Dynamically Reconfigurable System
All of the component technologies described in this chapter have been designed, implemented, tested, and debugged using the Mojave board shown in Figure 28.10. This section discusses various performance aspects of the complete system, from abstract template sets through application-specific CAD tools and finally down to the embedded processor and dynamic FPGA. The current hardware is connected to a video camera rather than a SAR data source, though this is only necessary for testing and early evaluation. The results presented here are based on routing circuits to two devices: the Xilinx 4013PG223-4 FPGA and the Xilinx 4036. Xilinx rates the capacity of these parts as 13K and 36K equivalent gates. Table 28.2 presents data on the effectiveness of the template-partitioning phase. Twelve templates were considered for this comparison: in one case they were randomly divided into three partitions; in the other, the CAD tool was used to guide the process. The randomly selected partitions required 33 percent more CLBs than those produced by the intelligent partitioning tool. These numbers
28.4 ATR Implementations
FIGURE 28.10
I
605
Photograph of second-generation Mojave ATR system.
Table 28.2
I
Comparison of scored and random partitioning on an Xilinx 4036
Random grouping CLB count
Initial partitioning CLB count
1961 1959 1958
Table 28.3
I
1491 1449 1487
Comparison of resources used for the dynamic and static FPGAs
Dynamic FPGA Support FPGA Available
Flip-flops
Function generators
I/O pins
532 196 1536
939 217 1536
54 96 192
account for the hardware requirements of the entire design, including the control hardware that is common to all designs as well as the template-specific adder trees. Relative savings in the adder trees alone are higher. Table 28.3 lists the overall resources used for both FPGAs in the system, the dynamic devices used for correlation, and the static support device used to implement system control features. Because the image source is a standard video camera rather than a SAR sensor, the surround template is the complement of the bright template, resulting in more hardware than would be required for true SAR templates. The majority of the flip-flops in the dynamic FPGA
606
Chapter 28
I
Automatic Target Recognition Systems on Reconfigurable Devices
are assigned to holding the 8-bit chip data in a set of shift registers. This load increases as a linear function of the template size. Each configuration of the dynamic FPGA requires 16 milliseconds to complete an evaluation of the entire chip for four template pairs. The Xilinx 4013PG223-4 requires 30 milliseconds for reconfiguration. Thus, a total of 4 template pairs can be evaluated in 46 milliseconds, or 84 template pairs per second. This timing will increase logarithmically with the template size. Comparing configurable machines with traditional ASIC solutions is necessary but complicated. Clearly, for almost any application, a bank of ASICs could be designed that used the same techniques as the multiple configurations of the FPGA and would likely achieve higher performance and consume less power. The principal advantage of configurable computing is that a single FPGA may act as many ASICs without the cost of manufacturing each device. If the comparison is restricted to a single IC (e.g., a single FPGA against a single ASIC of similar size), relative performance becomes a function of the hardware savings enabled by data specificity. For example, in the ATR application the templates used are quite sparse—only 5 to 10 percent of the pixels are important in the computation—which translates directly into a hardware savings that is much more difficult to realize in an ASIC. Further savings in the ATR application are possible by leveraging topological similarities across templates. Again, this is an advantage that ASICs cannot easily exploit. If the power and speed advantages of ASICs over FPGAs are estimated at a factor of 10, the configurable computing approach achieves a factor of improvement anywhere from 2 and 10 (depending on sparseness and topological properties) for the ATR application.
28.4.2
A Statically Reconfigurable System
The FPGA nodes developed by Myricom integrate reconfigurable computing with a 2-level multicomputer to promote flexibility of programmable computational components in a highly scalable network architecture. The Myricom FPGA nodes and its motherboard are shown in Figure 28.11. The daughter nodes are 2-level multicomputers whose first level provides the general-purpose infrastructure of the Myrinet network using the LANai RISC microprocessor. The FPGA functions as a second-level processor responsible for application-specific tasks. The host is a SparcStation IPX running SunOS 4.1.3 with a Myrinet interface board having a 512K memory. The FPGA node—consisting of Lucent Technologies’ ORCA FPGA 40K and Myricom’s LANai 4.1 running in 3.3 V at 40 MHz—communicates with the host through an 8-port Myrinet switch. Without additional optimization, static implementation of the complete ATR algorithm on one FPGA node processes more than 900 templates per second. Each template requires about 450,000 iterations of 1-bit conditional accumulate for the complete shape sum calculation. The threshold calculation requires one division followed by subtraction. The bright and surround sum compares all the image pixels against the threshold results. Next, 1-bit conditional accumulate is
28.4 ATR Implementations
607
FIGURE 28.11 I A Myrinet 8-port switch motherboard with Myricom ORCA FPGA daughter nodes. Four FPGA nodes can be plugged into a single motherboard.
executed for each sum. And then the quality values are calculated using two divides, an add, and a multiply. Given that 1-bit conditional accumulate, subtract, divide, multiply, and 8-bit compare are one operation each, the total number of 8-bit operations to process one 32 × 32 template over a 64 × 64 image is approximately 3.1 million. Each FPGA node executes over 2.8 billion 8-bit operations per second (GOPS). After the simulations, we found that the sparseness of the actual templates reduced their average valid rows to approximately one-half the number of total template rows. This optimization was implemented to increase the throughput by 40 percent. Further simulations revealed more room for improvements, such as dividing the shape sum in the FPGA, transposing narrow template masks, and skipping invalid threshold lines. Although these optimizations were not implemented in the FPGA, the simulation results indicated an additional 94 percent increase in throughput. Implementing all optimizations would yield a result equivalent to about a 7.75 GOPS correlator.
28.4.3
Reconfigurable Computing Models
The increased performance of configurable systems comes with several costs. These include the time and bandwidth required for reconfiguration, the memory and I/O required for intermediate results, and the additional hardware required for efficient implementation and debugging. Minimizing these costs requires innovative approaches to system design. Figure 28.12 illustrates the fundamental difference between a traditional computing model and the two reconfigurable computing architectures discussed in this chapter. The traditional processor receives simple operands from data
608
Chapter 28
I
Automatic Target Recognition Systems on Reconfigurable Devices
Instruction memory
Configuration memory
Configuration memory
Traditional processor
FPGA
FPGA
Data memory (a)
Intermediate result storage
Data memory (b)
Data memory (c)
FIGURE 28.12 I A comparison of a traditional computing model (a) with a dynamically reconfigurable model (b) and a statically reconfigurable custom model (c).
memory, performs a simple operation in the program, and returns the result to data memory. Similarly, dynamic computing uses a small number of rapidly reconfiguring FPGAs tightly coupled to an intermediate result memory, data memory, and configuration memory. A reconfigurable custom computer is similar to a fixed ASIC device in that, usually, only one highly tuned design is configured on the FPGA—there is no need to reconfigure to perform a needed function. In most cases, a custom ASIC performs far better than a traditional processor. However, traditional processors continue to be used for their programmability. FPGAs attempts to bridge the gap between custom ASICs and software by allowing designers to build custom hardware using programmable firmware. Therefore, unlike in pure ASIC designs, configuration memory is used to program the reconfigurable hardware as instructions in a traditional processor would dictate the functionality of a program. Unlike software, once the FPGA is configured, it can function just like a custom device. As shown in previous sections, an ATR was implemented in an FPGA using two different methods. The first implementation uses the dynamic computer model, where parts of the entire algorithm are dynamically configured to produce the final results. The second design uses simulation results to produce a highly tuned fixed design in the FPGA that does not require more than a single reconfiguration. Because of algorithm modifications made to the first design, there is no clear way to compare the two designs. However, looking deeper, we find that there is not a drastic difference in the subcomponents or the algorithm; in fact, the number of required operations for the algorithm in either design should be the same. The adders make up the critical path of both designs. Because both designs are reconfigurable, we expect the adders used to have approximately the same performance as long as pipelining is done properly. Clever use of adders in the static design allows it to execute more than one calculation
28.5 Summary
609
simultaneously. However, it is possible to make similar use of the hardware to increase performance in the dynamic design. The first design optimizes the use of adders to skip all unnecessary calculations, also making each configuration completely custom. The second design has to be more general to allow some programmability. Therefore, depending on the template data, not all of the adders may be in use at all times. If all of the templates for the first design can be mapped onto a single FPGA, the first method results in more resource efficiency than the second. The detrimental effect of idle adders in the static design becomes increasingly more prominent as template bitmap rows grow more sparse. On the other hand, if the templates do not all fit in a single FPGA, the first method adds a relatively large overhead because of reconfiguration latency. Unfortunately, the customized method of the second design works against making the design smaller. Every bit in the template maps to a port of the adder engine, so the total size of the design is proportional to the number of total bits in all of the templates. Therefore, as the number of templates increases, the total design size must also increase. Ultimately, the design must be divided into several smaller configurations that are dynamically reconfigured to share a single device. From these results, we observe the strengths and weaknesses of dynamic reconfiguration in such applications. Dynamic reconfiguration allows a large custom design to successfully run in a smaller FPGA device. The trade-off is significant time overhead in the system.
28.5
SUMMARY Like many streaming image correlation algorithms, the Sandia ATR system discussed in this chapter can be efficiently implemented on an FPGA. Because of the high degree of parallelism in the algorithm, designers can take full advantage of parallel processing in hardware while linearly scaling total throughput with available hardware resources. In this chapter we presented two different ways of implementing such a system. The first system employs a dynamic computing model to effectively implement a large custom design using a smaller reconfigurable device. To fit, highperformance custom designs can be divided into subcomponents, which can then share a single FPGA to execute parts of the algorithm at a high speed. For the ATR algorithm, this process produced a resource-efficient design that exceeded the performance of previous custom ASIC-based systems. The second system is based on a more generic architecture highly tuned for a given set of templates. Through extensive simulations, many parameters of the algorithm are tuned to efficiently process the incoming data. With algorithmspecific optimizations, the throughput of the system increased threefold from an initial naive implementation. Because of the highly pipelined structure of the design, the maximum clock frequency is more than three times that of the
610
Chapter 28
I
Automatic Target Recognition Systems on Reconfigurable Devices
dynamic computer design. Furthermore, a larger FPGA on the platform allowed the generic processing architecture to duplicate the specifications of the original algorithm. Therefore, the raw performance of the static design was faster than the dynamically reconfigurable system. Although the second system is a static design, it is best suited for reconfigurable platforms because of its highly tuned parameters. Since this system is reconfigurable, it is conceivable that the dynamic computational model can be applied on top of it. Thus, the highly tuned design may be implemented efficiently, even on a device with enough resources for only a fraction of the entire design. Acknowledgments I would like to acknowledge Professor William H. MangioneSmith for permission to integrate publications on the Mojave project into this chapter.
References [1] P. M. Athanas, H. F. Silverman. Processor reconfiguration through instruction-set metamorphosis. IEEE Computer 26, 1993. [2] J. G. Eldredge, B. L. Hutchings. Run-time reconfiguration: A method for enhancing the functional density of SRAM-based FPGAs. Journal of VLSI Signal Processing 12, 1996. [3] J. Villasenor, W. H. Mangione-Smith. Configurable computing. Scientific American 276, 1997. [4] E. Mirsky, A. DeHon. MATRIX: A reconfigurable computing architecture with configurable instruction distribution and deployable resources. Proceedings of the IEEE International Symposium on Field-Programmable Custom Computing Machines, 1996. [5] R. Razdan, M. D. Smith. A high-performance microarchitecture with hardwareprogrammable functional units. Proceedings of the 27th Annual International Symposium on Microarchitecture, pp. 172–180, 1994. [6] G. Estrin. Organization of computer systems—the fixed plus variable structure computer. Proceedings of the Western Joint Computer Conference, 1960. [7] M. Shand, J. Vuillemin. Fast implementations of RSA cryptography. Proceedings of the Symposium on Computer Arithmetic, 1993. [8] K. W. Tse, T. I. Yuk, S. S. Chan. Implementation of the data encryption standard algorithm with FPGAs. Proceedings of International Symposium on FieldProgrammable Logic and Applications, 1993. [9] J. Leonard, W. H. Mangione-Smith. A case study of partially evaluated hardware circuits: Key-specific DES. Proceedings of the 7th International Workshop on FieldProgrammable Logic and Applications 1304:151–160, 1997. [10] P. M. Athanas, A. L. Abbott. Real-time image processing on a custom computing platform. IEEE Computer 28, 1995. [11] J. G. Eldredge, B. L. Hutchings. Density enhancement of a neural network using FPGAs and run-time reconfiguration. Proceedings of the IEEE International Symposium on Field-Programmable Custom Computing Machines, 1994. [12] J. G. Eldredge, B. L. Hutchings. RRANN: The run-time reconfiguration artificial neural network. Proceedings of the Custom Integrated Circuits Conference, 1994.
28.5 Summary
611
[13] B. Schoner, C. Jones, J. Villasenor. Issues in wireless coding using run-timereconfigurable FPGAs. Proceedings of the IEEE International Symposium on FieldProgrammable Custom Computing Machines, 1995. [14] C. Chou, S. Mohanakrishnan, J. B. Evans. FPGA implementation of digital filters. Proceedings of the Fourth International Conference on Signal Processing Applications and Technology, pp. 80–88, 1993. [15] G. Estrin, B. Bussell, R. Turn, J. Bibb. Parallel processing in a restructurable computer system. IEEE Transactions on Electronic Computers EC-12(5):747–755, December 1963. [16] G. Estrin, R. Turn. Automatic assignment of computations in a variable structure computer system. IEEE Transactions on Electronic Computers EC-12(6):755–773, December 1963. [17] M. J. Wirthlin, B. L. Hutchings. Improving functional density through run-time constant propagation. Proceedings of the 1997 ACM Fifth International Symposium on Field-Programmable Gate Arrays, 1997. [18] P. Lee, M. Leone. Optimizing ML with run-time code generation. Proceedings of Programming Language Design and Implementation, 1996. [19] D. R. Engler, T. A. Proebsting. DCG: An efficient, retargetable dynamic code generation system. Proceedings of the Sixth International Symposium on Architectural Support for Programming Languages and Operating Systems, 1994. [20] H. Massalin. Synthesis: An Efficient Implementation of Fundamental Operating System Services, Ph.D. thesis, Columbia University, Department of Computer Science, 1992. [21] W. H. Mangione-Smith, B. Hutchings. Configurable computing: The road ahead. Proceedings of the Reconfigurable Architectures Workshop, 1997. [22] P. Bertin, H. Touati. PAM programming environments: Practice and experience. Proceedings of the International Symposium on Field-Programmable Custom Computing Machines, April 1994. [23] Y. H. Cho. Optimized automatic target recognition algorithm on scalable Myrinet/field programmable array nodes. Thirty-fourth IEEE Asilomar Conference on Signals, Systems, and Computers, October 2000. [24] K. N. Chia, H. J. Kim, S. Lansing, W. H. Mangione-Smith, J. Villasenor. Highperformance automatic target recognition through data-specific very large scale integration. IEEE Transactions on Very Large Scale Integration Systems 6(3), 1998. [25] J. Villasenor, B. Schoner, K. N. Chia, C. Zapata, H. J. Kim, C. Jones, S. Lansing, W. H. Mangione-Smith. Configurable computing solutions for automatic target recognition. Proceedings of the IEEE International Symposium on FPGAs for Custom Computing Machines, April 1996. [26] R. Sivilotti, Y. Cho, D. Cohen, W. Su, B. Bray. Scalable network based FPGA accelerators for an automatic target recognition application. Proceedings of the International Symposium on Field-Programmable Custom Computing Machines, April 1998. [27] R. Sivilotti, Y. Cho, W. Su, D. Cohen. Scalable, Network-connected, Reconfigurable, Hardware Accelerators for an Automatic-Target-Recognition Application, Myricom technical report, May 1998. [28] R. Sivilotti, Y. Cho, W. Su, D. Cohen. Myricom’s FPGA-based Approach to ATR/SLD, DARPA ACS PI meeting slide presentation, November 1997. [29] R. Sivilotti, Y. Cho, W. Su, D. Cohen. Production-quality, LANai-4-based quadFPGA-node VME boards. http://www.myri.com/research/darpa/97a-fpga.html, October 1997.
612
Chapter 28
I
Automatic Target Recognition Systems on Reconfigurable Devices
[30] C. L. Seitz, Tactical network and multicomputer technology. http://www.myri.com/ research/darpa/index.html, March 1997, July 1997, August 1998. [31] C. L. Seitz. Two-level-multicomputer project: Summary. http://www.myri.com/ research/darpa/ 96summary.html, July 1996. [32] W. C. Athas, L. Seitz. Multicomputers: Message-passing concurrent computers. IEEE Computer 21, 1988. [33] M. Shand, J. Vullemin. Fast implementations of RSA cryptography. Proceedings of 11th Symposium on Computer Arithmetic, 1993. [34] J. G. Eldredge, B. L. Hutchings. RRANN: The run-time reconfiguration artificial neural network. Proceedings of the IEEE Custom Integrated Circuits Conference, 1994. [35] Xilinx, Inc. RAM-based Shift Register v9.0, LogiCORE Datasheet, Xilinx, Inc., July 13, 2006.
CHAPTER
29
BOOLEAN SATISFIABILITY: CREATING SOLVERS OPTIMIZED FOR SPECIFIC PROBLEM INSTANCES Peixin Zhong Department of Electrical and Computer Engineering Michigan State University
Margaret Martonosi, Sharad Malik Department of Electrical Engineering Princeton University
Boolean satisfiability (SAT) is a classic NP-complete problem with a broad range of applications. There have been many projects that use reconfigurable computing to solve it. This chapter presents a review of the subject with emphasis on a particular approach that employs a backtrack search algorithm and generates solver hardware according to the problem instance. This approach utilizes the reconfigurability and fine-grained parallelism provided by FPGAs. The chapter is organized as follows: Section 29.1 is an introduction to the SAT formulation and applications. Section 29.2 describes the algorithms to solve the SAT problem. Sections 29.3 and 29.4 describe in detail two SAT solvers that use reconfigurable computing, and Section 29.5 provides a broader discussion.
29.1
BOOLEAN SATISFIABILITY BASICS The Boolean satisfiability problem is well known in computer science [1]. Given a Boolean formula, the goal is to find an assignment to the variables so that the formula evaluates to true or 1 (it satisfies the formula), or to prove that such an assignment does not exist (the formula is not satisfiable). It has many applications, including theorem proving [5], automatic test pattern generation [2], and formal verification [3, 4].
29.1.1
Problem Formulation
The Boolean formula in an SAT problem is typically represented in conjunctive normal form (CNF), also known as product-of-sums. Each sum of literals is called a clause. A literal is either a variable or the negation of a variable, denoted with a negation symbol or a bar (such as ¬ v1 or v1 ). Equations 29.1 and 29.2 are examples of simple CNFs. (v1 + v2 + v3 )(v1 + v2 + v3 )(v1 + v2 + v3 )(v2 + v3 )
(29.1)
614
Chapter 29
I
Boolean Satisfiability: Creating Solvers
or (v1 ∨ v2 ∨ v3 ) ∧ (¬ v1 ∨ v2 ∨ v3 ) ∧ (v1 ∨ v2 ∨ ¬ v3 ) ∧ (¬ v2 ∨ ¬ v3 )
(29.2)
Each sum term, such as (v1 + v2 + v3 ), is a clause. In the clause, v1 or ¬ v1 is called a literal. It can be easily tested that v1 = 1, v2 = 1, v3 = 0 is a solution to the problem. The SAT clauses represent implication relationships between variables. To satisfy the CNF, each clause should be satisfied (i.e., at least one literal in each clause should be 1). For a given partial assignment, if only one literal in a clause is not assigned but all others are assigned to 0, the unassigned literal is implied to be 1 to satisfy the clause. The first clause in equation 29.1 contains three possible implications. If v1 = 0 and v2 = 0, v3 is implied to be 1, denoted as ¬ v1 ¬ v2 ⊃ v3 . Similarly, v1 = 0 and v3 = 0 imply v2 = 1, and v2 = 0 and v3 = 0 imply v1 = 1. Such implications can be used to construct powerful logic expressions. They are also the key to SAT-solving algorithms.
29.1.2
SAT Applications
The many applications of SAT include test pattern generation [2] and model checking [3, 4]. The logic relations of a digital circuit can also be represented in SAT CNF. Each logic gate is represented by a group of clauses, with each signal represented by a variable with two possible values, 1 or 0. A circuit is represented by a conjunction of clauses representing all gates in the circuit. What follows is the transformation from simple gates to clauses: AND gate, z 100K
7.9
0.89 (144x)
8.8 (14x)
dubois20
986
11377
70.8 (14x)
11447
2.3
8.44 (117x)
10.7 (92x)
par8-1-c
0.02
12834
0.000011 (1818x)
12834
2.7
0.000035 (571x)
2.7 ( i) & 1) == 1 ? not(c_n) : c_n; Wire cbit1 = ((c1 >> i) & 1) == 1 ? not(c_n) : c_n; Wire constant_result = mux(cbit0,cbit1,s); // Generate the sum bit. Wire sum = mux(constant_result, xor(a.gw(i),and(s,b.gw(i))),c_n); // Map all the above logic in a single LUT Cell x = map(c_n,s,a.gw(i),b.gw(i),s_partial); place(x,0,virtex ? maxrow - i/2 : i/2); Wire mult_and_out = wire(1,"mult_and_out" +i); x = new mult_and(this,c_n,a.gw(i),mult_and_out); place(x,0,virtex ? maxrow - i/2 : i/2); x = new muxcy(this,mult_and_out,cin,s_partial,cout); place(x,0,i/2); x = new xorcy(this,s_partial, cin, output.gw(i)); place(x,0,i/2);
(b)
FIGURE 31.2
I
Logic (a) and JHDL code (b) for the i
th
bit of the passAddOrConstant.
first mapped into LUTs using the map function, then relationally placed, using the place function. The same place function is used to relationally place the lower-level blocks at each level of hierarchy. The overall unit is placed into a rectangular area so that it can be easily tiled in a design (see the descriptions of the adder and multiplier in Sections 31.1.2 and 31.1.3). In addition to concerns about efficiently using the LUT and providing good placement directives, there are concerns about where to pipeline the units. The major concern that largely determined the pipelining of the units presented here involves the carry-chain logic. In the Virtex family, the times to initalize and finalize the carry chain are large relative to the per-bit propagation time on the
31.1 Why Is Floating Point Difficult?
675
carry chain. Thus, it is necessary to avoid having cascaded carry chains in the same stage. In most cases, this constraint determines the stage mapping.
31.1.2
Adder Implementation
The most noticeable difference between integer operations and floating-point operations is in the implementation of the adder. A 64-bit registered integer adder requires 64 4-LUTs, 64 flip-flops, and the associated carry-chain logic. It can be packed into 32 slices in a Xilinx Virtex-41 or similar family. In stark contrast, a 64-bit floating-point adder requires hundreds of 4-LUTs, hundreds of flip-flops, and nearly 700 slices. The core of the differences can be seen in Figure 31.3(a). The fundamental problem is that two numbers of the form
and
(−1)S0 × 2exp0−bias × 1.mantissa0
(31.3)
(−1)S1 × 2exp1−bias × 1.mantissa1
(31.4)
must be added together. The signs can be the same or different, so the actual operation may be an addition or a subtraction. Worse, the exponents can differ (dramatically), so the two mantissas must be aligned before the operation can proceed. When the two are combined (different signs and different exponents), it becomes necessary to determine which number is larger so that they are subtracted in the right order. If the exponents are the same but the signs are different, the result can yield a very small mantissa, which must be normalized (i.e., the leftmost one is moved to the leftmost position) before it can be stored. Looking again at Figure 31.3(a), we can see the impact of the extra format. Each horizontal dashed line represents a register, and the vertical dashed line separates the exponent path from the mantissa path. Note that the first two stages are spent inspecting and preparing the numbers and determining whether either of the inputs is one of the special values. The third and fourth stages are needed to align the mantissas, and it is not until the fifth stage that the actual operation occurs. In the exponent path, stages six through nine clean up the exponent to handle a variety of exception conditions. The sixth and seventh mantissa stages have two parallel paths: one for rounding the result and one for computing the shift value if the result must be renormalized. The last two stages are used to renormalize the result (if needed). Figure 31.3(b) shows the approximate layout of the logic used in an implementation of the floating-point adder. For the adder implementation, it is possible to place all pipelining registers in the same slices as the logic, though some registers are placed in slices with unrelated logic. Of the total area, approximately 39 percent is used to align the mantissas prior to the actual add or subtract operation; this area includes right-shift logic and swap logic. These operations would be required for any floating-point format; however, the left-shift on the backend is only required because of the existence of the implicit 1 in the format. This case arises during a loss of precision when two numbers with 1
A slice is two 4-LUTs, two flip-flops, and the associated carry-chain logic in this generation.
Chapter 31
I
The Implications of Floating Point for FPGAs E0
E1
M0
1
M1
greater than
difference mux swap
shift value
2 3
overshift?
right shift
4 5
add/sub
6
priority encoder denormal?
round
left shift value
7 1 or 2 subOrConstant
8
left shift 9
subOrConstant
E
M
5 7 6
7
8
9
left shift
4
left shift
3
add/sub round
1
right shift
(a)
2
priority encoder
right shift
swap
2
greater than
676
9
(b)
FIGURE 31.3
I
Adder block (a) and adder layout (b) diagrams.
31.1 Why Is Floating Point Difficult?
677
identical, or very close, exponents are subtracted and require normalization. The normalization logic, including a priority encoder to locate the first 1, uses another 39 percent of the logic. For comparison, the actual add and round logic consumes only 9 percent of the area.
31.1.3
Multiplier Implementation
The relationship between a floating-point multiplication and a fixed-point multiplication is a little more unusual. A fixed-point multiplier grows with the square of the width of the input. At the core of a floating-point multiplier is a fixed-point multiplier that multiplies the mantissas. Since the mantissa is significantly narrower than the floating-point number, a 64-bit fixed-point multiplier actually has a much larger core operation than a 64-bit floating-point multiplier because the floating-point multiplier only has to multiply two 53-bit mantissas. It does, however, have a lot of other work to do that more than makes up for the difference. Floating-point multiplication starts with two numbers: (−1)S0 × 2exp0−bias × 1.mantissa0
(31.5)
(−1)S1 × 2exp1−bias × 1.mantissa1
(31.6)
and that produce the result: (−1)(S0⊕S1) × 2(exp0−bias) + (exp1−bias) × 1.mantissa0 × 1.mantissa1
(31.7)
Conceptually, the dataflow shown in Figure 31.4(a) is quite simple. The first three stages unpack the IEEE format looking for special cases and preparing a possible denormal mantissa for the multiplier core. Stages F4 through F6 operate concurrently with the multiplier core and compute the resulting exponent and determine whether the result is denormal. The four backend stages provide shifting for creating denormal numbers, rounding, and normalization, which includes adjusting the exponent when required. Figure 31.4(b) gives the approximate layout of the logic for the front- and backends of the multiplier. The multiplier core (not shown in the figure) uses nine 17 × 17 multiplier blocks plus additional logic to sum the partial products to create a 53 × 53 multiplier core. The logic used in the core is about 40 percent of the total multiplier logic. Unlike the adder, it is not possible to place all of the required pipelining registers in slices used by the logic. The black regions in Figure 31.4(b) are either unused or used by pipelining registers. The logic required to support the IEEE format is nontrivial. Support for denormals consumes 40 percent of the multiplier area and includes logic to gather information about the mantissa, swap the mantissa, and shift the mantissa. Thus, supporting denormals requires approximately the same amount of logic resources as the multiplier core. An additional 7 percent of the area is used for rounding and normalization to put the number back into the IEEE format.
Chapter 31
I
The Implications of Floating Point for FPGAs E0 E1
F1
add
M0
M1
mant info
mant info
flags swap sel
F2
sub bias
F3
left shift 7
concat
F4
swap
shift value
pri encoder
sub F5
exponent
right shift
Multiplier core
value
F6
B1
right shift
B2 B3
round
E = 0?
E = MAX?
add one?
B4
normalize
sub E
M (a)
B1 F1
F4
F2
B4
B2
B3 B4
F3
(b)
FIGURE 31.4
I
Multiplier block (a) and multiplier layout (b) diagrams.
round
normalize
left shift
mant info
F6
swap
F5
right shift
right shift
B3
add
678
31.2 Floating-point Application Case Studies
31.2
679
FLOATING-POINT APPLICATION CASE STUDIES Floating-point applications that are appropriate to map to FPGAs differ dramatically from integer applications that are typically mapped to FPGAs. The differences can be understood by realizing that a single floating-point operation can easily consist of 30 integer operations; thus, where a 2005-era FPGA can easily implement 1000 integer operations, it is more likely that it can only implement 32 double-precision floating-point operations. Furthermore, floatingpoint operations are much higher latency than corresponding integer operations, which significantly affects designs. This section considers three kernel operations implemented with doubleprecision floating point to demonstrate three important considerations when using floating point operations on FPGAs. The first operation is matrix multiply, which demonstrates the FPGA’s ability to exploit high degrees of parallelism and to programmably manage local storage to significantly reduce the amount of external RAM bandwidth needed. The second kernel is a vector dot product, which highlights the ability of the FPGA to provide large amounts of RAM bandwidth; plus it highlights limitations introduced by the high latency of the floating-point units. The third kernel is the fast Fourier transform (FFT), which can find similar advantages in mitigating the need for memory bandwidth as the matrix multiply, but has similar limitations from the latency of the floating-point units to the dot product.
31.2.1
Matrix Multiply
The standard matrix multiply (the DGEMM BLAS routine) is defined as: Cij + =
N−1
∑ Aik Bkj
(31.8)
k=0
The operation multiplies two matrices and adds it to a third (in place). Conceptually, this means performing the dot product of a single row of A with a single column of B and adding the result to a single point of C. Each dot product is completely independent, which means there are N2 independent dot products. In practice, neither microprocessors nor FPGAs implement it this way because of the nature of modern memory hierarchies. In all modern systems (including FPGAs), main memory is “far away” and there is one or more caches significantly “closer.” The primary performance characteristic of matrix multiply is that it does O(N3 ) operations on O(N2 ) data. Thus, for every data item loaded from memory, it should be hypothetically possible to do O(N) operations. Performing matrix multiplication as a series of independent dot products would throw away this advantage; thus, all matrix multiply implementations attempt to exploit some form of locality within the cache structure.
680
Chapter 31
FIGURE 31.5
I
I
The Implications of Floating Point for FPGAs
C1
C2
C3
C4
5
A1
A2
A3
A4
3
B1
B2
B3
B4
1
C1
C2
C3
C4
Block decomposition of a matrix multiply.
FPGA implementation To understand an FPGA implementation of matrix multiply, it helps to first understand how it is done on a microprocessor. To exploit (or rather compensate for) the nature of modern memory hierarchies, the typical approach to matrix multiplication on a microprocessor breaks the matrices into smaller S×S blocks [16]. A given block from each matrix is loaded into the processor, a matrix multiply is performed on the block, and partial results are stored. An example for an 8×8 matrix multiply is shown in Figure 31.5. Each matrix is broken into four regions that are 4 × 4. A row of these blocks is then multiplied by a column of these blocks to create a 4×4 block of the result; thus, C1 = A1∗B1+A2∗B3+C1. In the process, the partial result (a 4 × 4 block) is updated two times (although typically in local storage or cache). The same approach can be used on FPGAs. After all, FPGAs and microprocessors are similar in that they have a small amount of local memory with high bandwidth and a large amount of external, slower memory. FPGAs differ, however, in that they have a drastically large number of floating-point units that should be kept fully utilized. Whereas microprocessors must supply inputs to two functional units per cycle, FPGAs must supply inputs to 32 functional units (in a 2005 FPGA). A matrix multiply can be decomposed into a series of multiply–accumulate (MACC) operations that multiply the individual elements of a row with elements of a column and accumulate the result into one element of the final matrix. The MACC unit has a multiplier, an adder, and a feedback path. In an FPGA, 16 MACC units are operating concurrently. Unfortunately, the latency of the adder is very high (10 cycles). This means that we must keep at least 10 concurrent operations (row × column operations) in progress at all times to hide the latency of the adder. In a perfect world, each unit could work on a block of the matrix, with the concurrent operations happening on the independent row– column dot product in that block. Unfortunately, this would require far more internal memory than is available in typical FPGAs. To exploit the parallelism available in FPGAs without exhausting the limited internal memory, we can further decompose the view of the problem. A simple way to view one block-level matrix multiplication is as a collection of S matrix–vector multiplications. As such, significantly more parallelism is obvious. Figure 31.6 shows an FPGA-based implementation that first decomposes the problem into blocks and then distributes portions of the work to multiply the two blocks as matrix–vector multiplications.
31.2 Floating-point Application Case Studies
681
Store C
1
1 3
3
FIFO
FIFO
…
FIFO
FIFO
Replicate
Replicate
FIFO
FIGURE 31.6
I
123 24
123 24
123 24
A
B
C
Matrix multiply implementation.
To perform the full matrix multiplication, each matrix is decomposed into S × S blocks. In Figure 31.6, S is 4, but in practice, S is typically set large enough to cover the adder’s latency (currently 10 cycles). Blocks of B are broken into m columns, where m is the number of MACC units (m is assumed to be 4 in the figure); thus, independent columns of a block of B go to each MACC unit. All the blocks of A are broadcast to all MACC units. Thus, in Figure 31.6, one column of block B is multplied by all four rows from the A block. This requires that four copies (in the general case S copies) of the B block be made by the replicate unit. This creates the concurrency needed to cover the latency of the adder. Matrix C is managed similarly. A block of C is loaded and distributed in the same order as the block of B, but there is no need to replicate it. In addition, taking the example from Figure 31.5, two A blocks and two B blocks are needed for each C block. Thus, A1, B1, and C1 are loaded and used to create an intermediate product C1 − 2 that is used as the C block when A2 and B2 are multiplied. Overall, this requires no more than 6S2 elements of storage at 8 bytes
682
Chapter 31
I
The Implications of Floating Point for FPGAs
per element. This includes two copies of each matrix block—one to operate on and one to change it from row-major to column-major order. Performance By nature, a matrix multiply requires at least 4N2 memory accesses2 and performs 2N3 floating-point operations. This yields N2 floating-point operations for each element retrieved from memory, but it assumes that two matrices (A and B) can be kept resident in the chip (processor or FPGA) for the entire operation. In the perfect scenario, the maximum sustainable floating-point rate would be FLOPs =
N 2
× BW 8
(31.9)
where BW is the memory bandwidth in bytes per second, N is the dimension of the matrix, and 8 bytes are required to store a double-precision floating-point number. While this is unrealistic for all but relatively small matrices, using blocking techniques [16] to manage the local storage makes it possible to sustain a high percentage of peak performance with relatively low memory bandwidth. The result is that the matrices are fetched several times more than would otherwise be necessary. For blocks of dimension S, this yields a factor of N S increase in 3
accesses to the A and B matrices, leading to 2N2 + 2N S memory accesses. For large matrices, this approaches a floating-point rate of FLOPs =
S × BW 8
(31.10)
This is shown in Figure 31.7(a) as MFLOP/s versus MB/s on a log–log graph. Delineations that map memory bandwidth needs to the generation of FPGAs are provided for clarity, based on earlier work [14, 15]. A slightly different perspective is presented in Figure 31.7(b) where the total amount of on-chip memory needed to sustain peak performance is graphed. What is notable about these graphs is the relatively small amount of memory and relatively small amount of memory bandwidth needed to sustain peak performance on FPGAs. This stands in stark contrast to modern microprocessors (2005 era) that only sustain 85 to 90 percent of peak performance on a matrix multiply using several times as much on-chip memory and off-chip memory bandwidth. This is a product of the ability of the FPGA to directly manage local storage and to separate data prefetching from computation. We can also compare performance over time using data from 2004 [see 14,15]. Table 31.2 shows parts used for comparison. The performance of FPGAs gained rapidly on microprocessors during this era, as shown in Figure 31.8. 2
This assumes square matrices and includes retrieving three matrices and storing one matrix.
31.2 Floating-point Application Case Studies
683
Performance (MFLOP/s) 3.5e106 3e106 2.5e106 2e106 1.5e106 1e106 500000 0 1000 800 600 400 Block size 200 (elements)
10 100 1000 Memory bandwidth (MB/s)
10000
0
(a) Cache required (MB) 90 80 70 60 50 40 30 20 10 0 1000 800 600 400 Block size 200 (elements)
10 100 1000 Memory bandwidth (MB/s) 10000
0
Insufficient to sustain FPGA peak in 2003 Insufficient to sustain FPGA peak in 2005 Insufficient to sustain FPGA peak in 2007 Insufficient to sustain FPGA peak in 2009 Sufficient to sustain FPGA peak in 2009
(b)
FIGURE 31.7 I Maximum achievable performance versus memory bandwidth and block size (a); on-chip memory needed versus memory bandwidth and block size (b).
31.2.2
Dot Product
The standard vector dot product (the DDOT BLAS routine) is the sum of the pairwise products of two vectors, or N−1
p=
∑ x i yi
i=0
(31.11)
Chapter 31
I
The Implications of Floating Point for FPGAs TABLE 31.2 Year
I
Parts used for performance comparison FPGA
1997 1999 2000 2001 2003
XC4085XLA-09 Virtex 1000-5 Virtex-E 3200-7 Virtex-II 6000-5 Virtex-II Pro 100-6
CPU Pentium 266 MHz Athlon 1.2 GHz Pentium-4 3.2 GHz
100000 CPU matrix multiply CPU matrix multiply trend FPGA matrix multiply FPGA matrix multiply trend
10000
MFLOP/s
684
1000
100
10 1997
1998
1999
2000
2001
2002
2003
Year
FIGURE 31.8
I
Matrix multiply performance of FPGAs and microprocessors from 1997–2003.
which requires 2N memory accesses to perform 2N floating-point operations. This means that a double-precision floating-point number (8 bytes) must be fetched from memory for every floating-point operation that will be done. Modern processors are not built with this type of balance between memory bandwidth and floating-point capability. A processor capable of providing five GFLOP/s may only have 6.4 GB/s of memory bandwidth. Streaming problems (like this one) provide FPGAs an opportunity to excel—processors have a fixedmemory bandwidth that is configured based on a balance between the requirements for various markets and the cost of providing that bandwidth. In contrast, each board containing an FPGA can decide how many FPGA pins are used for memory bandwidth, including dedicating almost all available user pins to memory connections.
31.2 Floating-point Application Case Studies
685
FPGA implementation Although the potential for increased memory bandwidth on an FPGA gives it a distinct advantage, it also faces significant challenges imposed by the large number of functional units and the high latency of the units. Like many BLAS routines, DDOT is based on multiply–accumulate operations; however, it differs from many BLAS routines in that it exposes a relatively limited amount of parallelism. Where a DGEMM operation computes N2 independent results and a DGEMV operation computes N independent results, a DDOT operation produces a single number as the final result. This means than any partial products must be reduced through a long, slow pipeline. The nature of the problem is best realized through a comparison to microprocessors. Current microprocessors typically have a floating-point pipeline depth of four to six cycles for the functional unit running at 2 GHz or more. Obviously, we would not want every addition to depend on the previous addition, so the microprocessor can easily keep six running sums in progress and then reduce those sums to one result. This leads to several pipeline stalls in the final reduction, but the total time is a small number of nanoseconds. In contrast, FPGAs differ in three dramatic ways: I I
I
The adder pipeline is deeper. Multiple MACC units are required to fully utilize high bandwidth memory. The clock rate is lower.
A modern FPGA would have tens of functional units with a pipeline depth of 10-cycles running at approximately 300 MHz. Assuming 16 adders with a pipeline depth of 10 cycles means that there must be 160 concurrent summations. This is impossible for short vectors and challenging even for longer vectors. Furthermore, the process of reducing these partial sums to a single result is slow and cumbersome. To achieve reasonable performance, additional control logic is required inside and outside the multiply–add and MACC units. First, a multiplier bypass multiplexer (labeled MB) is required in the multiply–add (Figure 31.9(b)) to reuse the adder for portions of the final summation. Second, the adder has a 10-cycle latency; thus, the MACC must perform 10 concurrent operations to keep the adder pipeline filled. This requires a second feedback path (with associated control) through the FP multiplexer in the MACC (Figure 31.9(c)) to sum the 10 results. The added logic is shown with dashed lines in Figure 31.9(b) and (c). Performance If we work from the memory bandwidth as the typical limiting factor, the maximum sustainable floating-point rate is FLOPs =
BW 8
(31.12)
where BW is the memory bandwidth in bytes per second and 8 bytes are required to store a floating-point number. This is graphed in Figure 31.10(a)
686
Chapter 31
I
The Implications of Floating Point for FPGAs
1
1
1
3
MB
FIFO
3 FIFO
MB
FP
FIFO 3
(a)
(b)
(c)
FIGURE 31.9 I A standard multiply–accumulate (a); a modified multiply–add for the dot product (b); a modified multiply–accumulate for the dot product (c).
on a log–log graph. Like Figure 31.8, Figure 31.10(b) compares performance projections for both FPGAs and microprocessors [14, 15]. In this case, however, the FPGA shows a much more dramatic advantage over a microprocessor. This is because large FPGAs provide sufficient I/O resources to obtain much higher memory bandwidths than commodity microprocessors offer. Since this is a memory bandwidth-limited problem, the platform with the most memory bandwidth wins. The other notable feature of Figure 31.10(b) is that it is somewhat more crowded than the matrix–multiply comparison. This is because FPGAs face a second challenge in implementing the dot product operation: the latency of the floating-point unit. Thus, the size of the vector has a much greater impact on sustained performance on the FPGA than the microprocessor. The top FPGA line represents a scenario whereby the FPGA achieves 90 percent of its peak performance, but this requires a nearly 6000-element vector.3 The second FPGA line shows the FPGA achieving 50 percent of peak performance by using an 800-element vector. Despite this hefty penalty, the FPGA still has a remarkable advantage (4× in 2003) over the microprocessor.
31.2.3
Fast Fourier Transform
The fast Fourier transform (FFT) is a reduced-complexity implementation of the discrete Fourier transform (DFT), which takes as input N complex numbers and returns as output N complex numbers where each of the outputs is determined by the following equation: 3
Earlier work by Underwood and Hemmert [15] specified a 7500-element vector, but the floatingpoint unit latency has been optimized since then.
31.2 Floating-point Application Case Studies
687
10000
Performance (MFLOP/s)
Maximum sustained dot product performance Maximum sustained matrix vector multiply performance
1000
100
10
1000 Memory bandwidth (MB/s) (a)
10000
100000 CPU dot product Extrapolated CPU dot product Peak single FPGA dot product Extrapolated peak single FPGA dot product 50% peak single FPGA dot product 50% extrapolated peak single FPGA dot product
MFLOP/s
10000
1000
100
10 1997
1998
1999
2000 Year (b)
2001
2002
2003
FIGURE 31.10 I Maximum achievable performance versus memory bandwidth (a) and dot product performance on FPGAs and microprocessors from 1997–2003 (b).
N−1
Y[j] =
∑ X[k]WN
jk
(31.13)
k=0
−i2πjk jk where WN = e N . The FFT exploits symmetries in the DFT and is implemented in stages, where each stage combines r items to create r outputs. The value r is known as the radix. For the implementation discussed here, r = 2 (radix-2). For the radix-2
Chapter 31
688
I
The Implications of Floating Point for FPGAs
FFT, each stage operates pairwise on the data, although there are different formulations of the algorithm that determine how the data are combined. These operations are commonly referred to as butterflies and in the formulation used in this example, each pairwise operation is identical and consists of one complex multiply and two complex adds. This is shown graphically in Figure 31.11(a). Even after selecting the formulation that gives the structure of the butterfly, there is some flexibility in the structure of the stages. The basic stage structures are shown in Figure 31.12. Both structures require data reordering, either on Real (Xi) Real (Xj)
R R
add
R
mult
Real Xi / Xj
Real (W n ) 1
Xi ∗ Wn (a)
Xj
2
Xi Xj
Img (W n)
Img (Xj)
C
S
add
mult
R
R
add
R
Img (Xi)
Img Xi / Xj
(b)
FIGURE 31.11 I Basic butterfly operation (a) and basic butterfly datapath (b). The component S is a switch that directs inputs to alternate outputs. The components marked as R replicate the input once and C is a crossover to facilitate the complex multiply.
X0 X4
W0
X3
X4
W2
W0
W0
X4
X2
X3
X6
W1
W2
W3
W2
X4
X1 W0
X5 X6
W0
X7
(a) I
X2
W0 W0
FIGURE 31.12
X1
X0
W0
W2
W0
X3 X7
X1 X2
W0
X1 X5
X0
W0
X2 X6
X0
X5
X5
W0
X6
W1
W0
W2
W0
W2
X7
X3
W3
X7
(b)
Variations of the 8-point, radix-2 FFTs with reordered inputs (a) and reordered outputs (b).
31.2 Floating-point Application Case Studies
689
the frontend or backend, and produce the identical set of computations (though in different orders). This example uses the ordering shown in Figure 31.12(b), because this structure provides an increasing number of independent datasets as the computation progresses. This approach is easier for implementations that use units in parallel to process data within a single stage since all interunit communication can reside at the front of the pipeline. FPGA implementation The butterfly computation requires four multiplications and six additions to implement one complex multiply and two complex adds. The hardware presented here uses two double-precision multiplies and three double-precision adds (see Figure 31.11(b)). Each floating-point unit is used twice for each set of inputs, which results in an average throughput of one data item per clock cycle. Although it is possible to design a datapath that accepts two data items per clock cycle, this design was chosen because it matches the available bandwidth of internal RAM blocks in the target architecture and because it provides the greatest flexibility when scaling the parallelism of the final implementation. Parallelism in the FFT computation can be exploited in two ways: (1) pipelined units, or parallelism in the stages (S), and (2) parallel units, or parallelism (P) within a stage. Three architectures, which exploit the two types of parallelism to differing degrees, are explored. Parallel architecture The parallel implementation exploits only parallelism within a stage (P). This is shown in Figure 31.13(a). In this implementation, data are read from external memory, processed iteratively, and written back to external memory. Each of the butterfly units operates on a subset of the data and is able to work independently of the other units for a large part of the computation (the datasets are completely independent after log2 (P) stages). The advantages of this architecture are that the utilization of the units is high because the pipeline depth is short. The parallel version can also take advantage of higher-memory bandwidths. The disadvantages of this architecture as implemented are that it requires a large amount of internal memory and it requires a parallelism that is a power of 2. This second restriction is important because it can limit the number of butterfly units that can be used. For example, if six butterfly units fit in an FPGA, the parallel architecture is still only able to use four. Pipelined architecture At the other extreme, one butterfly unit can be dedicated to each of the stages of the FFT in a pipelined fashion, as illustrated in Figure 31.13(c). Data is read from memory and passed through a series of butterfly units before being written back to memory. Data delays and permutations are needed between each of the stages and between the pipelined FFT unit and DRAM memory. When the number of stages, S, that can be implemented in the FPGA is less than the number of stages needed by the FFT (log2 (N)), then log2 (N) passes to memory are needed, with the final pass using a subset, R, of the S stages. For each pass to memory, data must be read and written in a particular permutation to optimize the delay and storage requirements in the pipeline.
Dataflow control
Off-chip DRAM
On-chip data storage
Butterfly datapath
On-chip data storage
Butterfly datapath
On-chip data storage
Butterfly datapath
On-chip data storage
Butterfly datapath
P54
(a)
Butterfly datapath
Butterfly datapath
Dataflow control
∗
P 5 degree of parallelism
Dataflow control
Butterfly datapath
Dataflow control
Dataflow control
BW 5 2P
Off-chip DRAM
Butterfly datapath
Dataflow control
S 5 number of stages
* First log(P ) stages must be able to communicate data between butterfly units in the stage.
(b)
Butterfly datapath
Dataflow control
Dataflow control
Dataflow control
BW 5 2
Off-chip DRAM
S 5 number of stages
Butterfly datapath
(c)
FIGURE 31.13 I Three architectures: (a) parallel, (b) parallel–pipelined, and (c) pipelined for exploiting parallelism in the FFT—from using all parallelism within a single stage to using all parallelism in the stages.
690
31.2 Floating-point Application Case Studies
691
The pipelined architecture works well when streaming a large number of small FFTs. This is because the architecture gets good performance with minimal memory bandwidth requirements. Another benefit of this architecture is that it can take advantage of parallelism at a finer granularity than the parallel version (i.e., it can use a nonpower of 2 number of processors). However, there are some major disadvantages to this architecture. First, for single FFTs, the unit utilization is low because of the depth of the overall pipeline. Second, it is unable to take advantage of higher-memory bandwidth. Last, the buffer space required between stages for data reordering grow as 2S , where S is the number of stages in the circuit. For a large number of stages, the memory required for buffering can easily exceed available on-chip memory. Parallel–pipelined architecture Figure 31.13(b) is a cross between the two previous architectures. Data moves from external memory, through a set of P parallel pipelines—each with S stages—and back to external memory. The first log2 (P) stages must have additional data exchange circuits (for the first pass through the pipeline) because these stages have data dependencies between the pipelines. This approach leverages the ability of the pipelined architecture to reduce bandwidth demands and the ability of the parallel architecture to tolerate shorter input vectors (as well as a wider variety of vector lengths) than the pure pipelined approach. In contrast, the parallel–pipelined hybrid has a higher bandwidth demand than the purely pipelined approach and less tolerance of short vectors than the parallel approach. Performance In evaluating the performance of the FFT, the floating-point operation count that is typically used is 5Nlog2 (N); there are log2 (N) stages that each contain 5N computations (four multiplies and six additions for each pair of data). To determine performance, it is necessary to know how long it will take the FPGA to compute the FFT. For the parallel version, the number of cycles required to complete the FFT is given by the following equation: T=
N 32N + BL + ( + BL)(log2 (N) − 2) BW P
(31.14)
The first term of equation 31.14 is the time to read and then write N items based on the memory bandwidth, BW, in bytes per cycle. The usable bandwidth is limited to the number of units, P. The second term is the latency of passing through the butterfly units during the read from memory. The third term is the time to perform the iterations—using P butterfly units of latency BL for log2 (N) − 2 iterations, assuming that the first and last iterations are performed as part of reading and writing the data. The pipelined and parallel–pipelined architectures share the same equation for determining the number of clock cycles required to complete the operation. The only difference is that the pipelined architecture is limited to a
692
Chapter 31
I
The Implications of Floating Point for FPGAs
bandwidth (BW) of 2. The number of cycles to compute the FFT for these architectures is log2 (N) + P(R) (31.15) T = P(S) × S 2N P(J) = BL × J + I(J) + (31.16) + (B − 1) × 2J BW K−1
I(K) =
∑ B × 2 i ≈ B × 2K
(31.17)
i=0
R = log2 (N) mod S
(31.18)
Each pass, P(J), through J butterfly stages (each having a latency of BL) requires the time shown in equation 31.16. Data dependencies between the stages introduce a delay that doubles at each stage, and create a total interstage delay given by I(K). Using standard DRAM memories introduces a penalty associated with the burst length (B) required to maintain full memory bandwidth to both the interstage delay and a backend reordering time. The time to retrieve the data from memory and write them 2N . The final term represents the final pass through a subset back is defined by BW of the stages, R, with the corresponding delays. The preceding equations point to the fact that the best implementation for the FFT depends on many factors: memory bandwidth, size of the FFT, and size of the FPGA. The performance (in FLOPs per cycle) for a single FFT of the different FPGA architectures on a Xilinx Virtex-II Pro (a late 2005 part) are shown in Figure 31.14(a). For single short vectors, the parallel architecture provides the best performance. This is because of the high utilization of the floatingpoint units. For longer FFTs, all three units provide good performance, though the pipelined version requires less external memory bandwidth. Figure 31.14(b) shows that the FPGA implementations (running at 160 MHz) compare favorably to microprocessors for large FFTs.
31.3
SUMMARY Implementing floating-point arithmetic on FPGAs requires significant effort. Supporting the IEEE-754 standard poses particularly unique challenges, but much of the effort is expended in coping with the interaction between exponent logic and mantissa logic. Great care is required to minimize the latency through the unit without significantly decreasing clock rate by having two dependent carry chains in a single pipeline stage. Even with effort, floating-point operations are significantly bigger and have significantly deeper pipelines than their fixed-point counterparts. This adds additional challenges to the design of applications. Although FPGAs can now deliver impressive performance on double-precision floating-point operations, it requires a very different mind-set from working with fixed-point arithmetic. Increased operation latency leads to a need to find more parallelism to exploit in paths with the cyclic data dependencies typical of
31.3 Summary
693
40 Parallel BW-2 units-4 Parallel BW-4 units-4 Pipelined BW-2 units-6 Parallel-pipelined BW-2 units-6 (P = 2) Parallel-pipelined BW-4 units-6
Performance (FLOPs/cycle)
35 30 25 20 15 10 5 0
64
256
1024 4096 16384 FFT size (elements)
65536
262144
(a) 6000 Pentium-4 2.8 GHz Pentium-4 3.8 GHz (estimated) Parallel BW-4 units-4 Pipelined BW-2 units-6 Parallel-pipelined BW-4 units-6
Performance (MFLOPs)
5000
4000
3000
2000
1000
0 64
256
1024 4096 16384 FFT size (elements)
65536
262144
(b)
FIGURE 31.14 I A comparison of performance for different FFT architectures in FLOPs per cycle (a) and a comparison of FFT implementation on FPGAs and CPUs (b).
iterative solutions. Simultaneously, the increased size of a single operation reduces the portion of a given dataflow graph that can be implemented directly and pushes a designer toward more iterative solutions. The dot product is an excellent example because it is forced to reuse adders to compute a summation
694
Chapter 31
I
The Implications of Floating Point for FPGAs
that would typically be done as a tree of adders in a fixed-point solution. The result is that only longer vectors make sense.4 Even simple feedforward paths incur a penalty from the high latency of the floating-point units. The FFT provides an example whereby the latency of a single butterfly path can approach the length of a short vector. Thus, if the FFT implementation in an FPGA is not used for a long FFT or a series of short FFTs, it cannot offer competitive performance. There are, however, floating-point kernels that offer abundant parallelism for the FPGAs to exploit. Matrix–multiply (DGEMM), for example, is an N3 operation with minimal data dependencies. Similar things can be said about LU solvers, which form the basis of the traditional Linpack benchmark [4]. Three-dimensional FFTs are another example in which hundreds of onedimensional FFTs can be carried out simultaneously.
References [1] P. Belanovic, M. Leeser. A library of parameterized floating-point modules and their use. Proceedings of the International Conference on Field-Programmable Logic and Applications, 2002. [2] M. deLorimier, A. DeHon. Floating point sparse matrix-vector multiply for FPGAs. Proceedings of the ACM International Symposium on Field-Programmable Gate Arrays, February 2005. [3] J. Dido, N. Geraudie, L. Loiseau, O. Payeur, Y. Savaria, D. Poirier. A flexible floating-point format for optimizing data-paths and operators in FPGA based DSPs. Proceedings of the ACM International Symposium on Field-Programmable Gate Arrays, February 2002. [4] J. J. Dongarra. The linpack benchmark: An explanation. First International Conference on Supercomputing, June 1987. [5] Y. Dou, S. Vassiliadis, G. Kuzmanov, G. Gaydadjiev. 64-bit floating-point FPGA matrix multiplication. Proceedings of the ACM International Symposium on FieldProgrammable Gate Arrays, February 2005. [6] B. Fagin, C. Renard. Field-programmable gate arrays and floating point arithmetic. IEEE Transactions on VLSI 2(3), 1994. [7] A. A. Gaar, W. Luk, P. Y. Cheung, N. Shirazi, J. Hwang. Automating customisation of floating-point designs. Proceedings of the International Conference on FieldProgrammable Logic and Applications, 2002. [8] G. Govindu, S. Choi, V. K. Prasanna, V. Daga, S. Gangadharpalli, V. Sridhar. A highperformance and energy-efficient architecture for floating-point based LU decomposition on FPGAs. Proceedings of the 11th Reconfigurable Architectures Workshop (RAW), April 2004. [9] G. Govindu, L. Zhuo, S. Choi, P. Gundala, V. K. Prasanna. Area and power performance analysis of a floating-point based application on FPGAs. Proceedings of the Seventh Annual Workshop on High-Performance Embedded Computing, September 2003. [10] K. S. Hemmert, K. D. Underwood. An analysis of the double-precision floatingpoint FFT on FPGAs. Proceedings of the IEEE Symposium on Field-Programmable Custom Computing Machines, April 2005. 4
A long series of short vectors can also be made to work using an appropriate architecture.
31.3 Summary
695
[11] IEEE Standards Board. IEEE Standard for Binary Floating-Point Arithmetic. Technical Report ANSI/IEEE Std. 754-1985, The Institute of Electrical and Electronics Engineers, 1985. [12] L. Louca, T. A. Cook, W. H. Johnson. Implementation of IEEE single precision floating point addition and multiplication on FPGAs. Proceedings of the IEEE Symposium on FPGAs for Custom Computing Machines, 1996. [13] N. Shirazi, A. Walters, P. Athanas. Quantitative analysis of floating-point arithmetic on FPGA based custom computing machines. Proceedings of the IEEE Symposium on FPGAs for Custom Computing Machines, 1995. [14] K. D. Underwood. FPGAs vs. CPUs: Trends in peak floating-point performance. Proceedings of the ACM International Symposium on Field-Programmable Gate Arrays, February 2004. [15] K. D. Underwood, K. S. Hemmert. Closing the gap: CPU and FPGA trends in sustainable floating-point BLAS performance. Proceedings of the IEEE Symposium on Field-Programmable Custom Computing Machines, April 2004. [16] R. C. Whaley, A. Petitet, J. J. Dongarra. Automated empirical optimizations of software and the ATLAS project. Parallel Computing 27(1–2), 2001. [17] L. Zhuo, V. K. Prasanna. Scalable and modular algorithms for floating-point matrix multiplication on FPGAs. 18th International Parallel and Distributed Processing Symposium, April 2004. [18] L. Zhuo, V. K. Prasanna. Sparse matrix–vector multiplication on FPGAs. Proceedings of the ACM International Symposium on Field-Programmable Gate Arrays, February 2005.
This page intentionally left blank
CHAPTER
32
FINITE DIFFERENCE TIME DOMAIN: A CASE STUDY USING FPGAs Wang Chen, Miriam Leeser Department of Electrical and Computer Engineering Northeastern University
This chapter presents a reconfigurable hardware accelerator that implements the FDTD method. We first present background, including applications of the FDTD method. We then provide analysis and design details of the FPGA accelerator for FDTD.
32.1
THE FDTD METHOD Modeling electromagnetic behavior has become a requirement for key technologies such as cellular phones, mobile computing, lasers, and photonic circuits. The finite-difference time-domain (FDTD) method, which provides a direct, time domain solution to Maxwell’s equations in differential form with relatively good accuracy and flexibility, has become a powerful method for solving a wide variety of electromagnetic problems [1–3]. The main drawback to FDTD is its high computational complexity.
32.1.1
Background
The discovery of Maxwell’s equations was one of the outstanding achievements of nineteenth-century science. The equations give a unified and complete theory for understanding electromagnetic (EM) wave phenomena. Solving Maxwell’s equations is an important method for investigating the propagation, radiation, and scattering of EM waves. The FDTD method, first introduced by Yee in 1966 [4], is a way to solve Maxwell’s equations. The differential form of these equations and constitutive relations can be written as follows: →
→ → ∂B − σm H − M ∇×E=− ∂t →
(32.1)
→
→ → ∂D ∇×H= + σe E + J ∂t →
(32.2)
698
Chapter 32
I
Finite Difference Time Domain →
→
∇ · D = ρ e ; ∇ · B = ρm →
→
→
D = ⑀E;
→
B = μH
(32.3) (32.4)
In equations 32.1 through 32.4, the following symbols are used: →
E: → D: → J: ⑀: σe :
electric field electric flux density electric current density electrical permittivity electric conductivity
→
H: → B: → M: μ:
magnetic field magnetic flux density equivalent magnetic current density magnetic permeability
σm : equivalent magnetic conductivity →
→
→
First, the FDTD method replaces D and B in equations 32.1 and 32.2 with E → and H according to the constitutive relations in equation 32.4, which yields Maxwell’s curl equation. →
μ
→ → → ∂H = −∇ × E − σm H − M; ∂t
→
⑀
→ → → ∂E = ∇ × H − σe E − J ∂t
(32.5)
All of the curl operators are then written in differential form and replaced by → → partial derivative operators, as shown in equation 32.6, with the E and H vectors → separated into three vectors in three dimensions (i.e., E is separated into Ex , Ey , →
Ez , and H is separated into Hx , Hy , Hz ): →
→
ˆ curl F = ∇ × F = x(
∂Fy ∂Fx ∂Fz ∂Fy ∂Fx ∂Fz − ) + yˆ ( − ) + zˆ ( − ) ∂y ∂z ∂z ∂x ∂x ∂y
(32.6)
We then can rewrite Maxwell’s curl equations into six equations in differential form in rectangular coordinates. μ
∂Hx ∂Ey ∂Ez = − − σm Hx − M x ; ∂t ∂z ∂y
⑀
∂Ex ∂Hz ∂Hy = − − σe Ex − Jx ∂t ∂y ∂z
(32.7)
μ
∂Hy ∂Ez ∂Ex = − − σm Hy − M y ; ∂t ∂x ∂z
⑀
∂Ey ∂Hx ∂Hz = − − σe Ey − Jy ∂t ∂z ∂x
(32.8)
μ
∂Hz ∂Ex ∂Ey = − − σm Hz − M z ; ∂t ∂y ∂x
⑀
∂Ez ∂Hy ∂Hx = − − σe Ez − Jz ∂t ∂x ∂y
(32.9)
Second, in preparation for “discretizing” the model in the next step, the model size, unit size, and unit timestep must be determined. The FDTD method establishes a model space, which is the physical region where Maxwell’s equations are solved or the simulation is performed. The model space is then discretized to a number of cells, and the time duration, t, is discretized to a number of timesteps. The unit cell size should be small enough to ensure the accuracy of the result, but large enough to minimize the number of cells in order to save computation resources.
32.1 The FDTD Method
699
Although half of the EM wavelength is an upper bound of the cell size by the Nyquist sampling theorem, the cell size is often set to less than one-tenth of the EM wavelength for better results [1]. The model size depends on the number of cells in the model space, which is usually inversely proportional to the size of the unit cell. The unit timestep is calculated by following the Courant condition, which states that it must be less than the time the EM wave spends traveling to the adjacent unit cell. For a ground-penetrating radar example, assuming a central frequency of 1.25 GHz, the central wave length is 0.24 m. We set the unit cell size to 0.012 m, which is one-twentieth of the central wave length, for good simulation quality. The timestep can be set to 0.02 ns, which meets the Courant condition. Every cell in the model space has its associated electric and magnetic fields. The material type of each cell is specified by its permittivity ⑀, permeability μ, and conductivity σ. The three-dimensional grid shown in Figure 32.1, the “Yee cell” [4], is helpful for understanding the discretized EM model space. The Yee cell is a small cube that can be treated as a single cell picked from the discretized model space; Δx, Δy, and Δz are the three dimensions of this cube. We use (i, j, k) to denote the point whose real coordinate is (iΔx, jΔy, kΔz) in the model space. Instead of placing the E and H components in the center of each cell, the E and H field components are interlaced so that every E component is surrounded by four circulating H components and every H component is surrounded by four circulating E components. Maxwell’s equations in rectangular coordinates—equations 32.7 through 32.9—can be clearly illustrated by Yee’s cell. For example, the Hx component located at point (i, j + 12 , k + 12 ) is surrounded by four circulating E components, two Ey components, and two Ez components, matching equation 32.7, which states that the Hx component increases directly in response to a curl of E components in the x direction. Similarly, the Ex component increases directly in response to the curl of the H components, as shown in Figure 32.2, also matching equation 32.7. We represent an electric component Ez at the discretized three-dimensional coordinate (iΔx, jΔy, (k + 21 )Δz) as Ez |i, j, k+ 1 , and when the 2
Dy
z-axis
Dx
Ey
Hz
Ex
Ex Ez
Ez Hx (i,j,k)
is -ax
x
FIGURE 32.1
I
(i, j112, k112) Ey
Ez
Dz
Ey
Hy
Ex y-axis
The geometrical representation of the three-dimensional Yee cell.
700
Chapter 32
I
Finite Difference Time Domain
Hy Hz
Ex
Hz
Hy
FIGURE 32.2
I
Example of electric and magnetic components on a 4-cell grid.
current time is in the discretized Nth timestep, we denote the same component . as Ez |N i, j, k+ 12 Third, all of the partial derivative operators in equations 32.7 through 32.9 are replaced by their central difference approximations, as illustrated in equation 32.10. The second-order part of the Taylor series expansion is discarded to keep the algorithm simple and reduce the computational cost. Also, the variable without partial derivative can be approximated by time averaging, as shown in equation 32.11, which has a similar structure to the central difference approximation. ∂f(uo ) f(uo + Δu) − f(uo − Δu) = + O[(Δu)2 ] ∂u 2Δu
f(uo ) =
f(uo + Δu) + f(uo − Δu) 2
(32.10)
(32.11)
For example, equation 32.7 is changed to Hx (t0 + Δ2t ) − Hx (t0 − Δ2t ) Ey (z0 + Δ2z ) − Ey (z0 − Δ2z ) = Δt Δz y Δ y Δ Hx (t0 + Δ2t ) + Hx (t0 − Δ2t ) Ez ( y0 + 2 ) − Ez ( y0 − 2 ) − σm − Mx − Δy 2 μ
(32.12)
After these modifications, the FDTD method turns Maxwell’s equations into a set of linear equations from which we can calculate the electric and magnetic fields in every cell in the model space. We call these equations the electric and magnetic field updating algorithms. Six field-updating algorithms form the basis of the FDTD method. For example, the field-updating algorithm for the Hx
32.1 The FDTD Method
701
component, derived from equation 32.12 or equation 32.7, is given by μ σ N+ 1 μ σ N− 1 m m 2 Hx 21 Hx = (32.13) + − i, j+ 2 , k+ 12 i, j+ 12 , k+ 21 Δt 2 Δt 2 1 1 N N Ey N E − − E + i, j+1, k+ 1 z y 1 1 i, j+ 2 , k+1 i, j+ 2 , k Δz Δy 2 − Mx N − Ez N i, j+ 1 , k+ 1 i, j, k+ 1 2
32.1.2
2
2
The FDTD Algorithm
The FDTD algorithm, whose flow diagram is shown in Figure 32.3, is based on these equations. It first establishes the model space and specifies the material properties and the excitation source. The source can be a point source, a plane wave, an electric field, or another option depending on the application. The algorithm then runs through the electric and magnetic updating algorithms on every cell in the model space and loops through every timestep. The output of the FDTD algorithm can be any electric or magnetic field data from any cell in any timestep. The electric and magnetic fields depend on each other. As we can see from equation 32.13, the current timestep’s magnetic field depends on the electric fields in the surrounding cells. Similarly, the current timestep’s electric field depends on the magnetic fields in the surrounding cells. Because of this dependence between the electric and magnetic fields, we cannot update them Initialization
• Establishes the model space • Specifies the material properties • Specifies the excitation source
Update magnetic field data on all cells Update electric field data on all cells Timestep ++
Boundary conditions Update source excitation Time over? Yes
No, go to next timestep
End
FIGURE 32.3
I
The flow diagram of the FDTD algorithm.
702
Chapter 32
I
Finite Difference Time Domain
in parallel. So the FDTD algorithm updates the electric and magnetic fields in an interlaced manner, timestep by timestep, until the job finishes. First, all the magnetic fields in all cells in the model space are updated; next, all the electric fields in all cells, then the source excitation and boundary conditions, are given to the model space; finally the algorithm goes to the next timestep and starts from the magnetic fields again. The boundary condition computation consists of special algorithms to deal with the unit cells located on the boundary of the model space. The preceding electric and magnetic updating algorithms work accurately in the interior of the model space; however, because the cells on the boundary do not have the adjacent cells needed, the algorithm does not work properly and, as a result, there are algorithm-introduced reflections on the boundary. Special techniques, called absorbing boundary conditions (ABC), are necessary to deal with boundary cells, to prevent nonphysical reflections from outgoing waves, and to simulate the extension of the model space to infinity. The development of efficient ABCs is very important for the FDTD method. The perfect matched layers (PMLs) ABC [5] sets the outer boundary of the model space to an absorbing material medium layer, which absorbs most of the impinging wave and has low reflection for most incidence angles. The UPML (uniaxial PML) ABC [3]—a modification of PML—uses a generalized formulation on the entire FDTD model space that integrates the boundary condition and electric field updating algorithms, simplifies the FDTD algorithm, and makes a good model for hardware datapath design. Although UPML introduces extra computation and memory consumption, the quality of the uniaxial PML is especially good for dispersive media, which is useful in solving many realistic problems (e.g., the dispersive soil found in modeling ground-penetrating radar and medical studies of EM waves’ effects on dispersive human tissue). The FDTD algorithm is an accurate and successful modeling technique for computational electromagnetics. It is flexible, allowing the user to model a wide variety of EM materials and environments on most scales. It is also easy to understand, with its clear structure and direct time domain calculation. However, FDTD is data and computationally intense. It needs to visit all the cells in every step of the calculation, forcing a large working set. The amount of data in the FDTD model space can be very large for large model sizes, creating a heavy burden on both memory storage and access. The computation is also intense for each cell in the FDTD model space, including updating six electric and magnetic fields and the boundary conditions. This complexity makes the FDTD algorithm run slowly on a single processor—modeling an electromagnetic problem using the FDTD method can easily require several hours. Without powerful computational resources, FDTD models are too time consuming to be implemented on a single computer node. Accelerating FDTD with inexpensive and compact hardware will greatly expand its application and popularity, which is the purpose of an FPGA implementation. The FDTD algorithm can be viewed as a cellular automata (CA) (see Section 5.2.5). A cellular automaton is a discrete model that consists of an infinite or finite grid of cells, where the state of every cell at discrete time t is a
32.1 The FDTD Method
703
function of the states of a finite number of neighborhood cells at discrete time t − 1. Every cell has the same rule for updating. The updating algorithm loops through the whole discrete model and then goes to the next discrete time t + 1. The FDTD algorithm exactly fits the definition of a CA. First, it creates a discrete model space, discretizing both physical space and time with a uniform grid. Second, every cell in the model space follows the same rule (six uniform updating algorithms) for updating the electric and magnetic fields. Finally, the calculation loops though cells to simulate the phenomenon of the whole model space through time. A hardware implementation of the FDTD method is thus a template hardware design for most CA problems.
32.1.3
FDTD Applications
The FDTD method is an important tool for investigating the propagation, radiation, and scattering of EM waves. Before the 1990s the cost of solving Maxwell’s equations directly was high and most of the related research was for military– defense purposes. For example, engineers used huge parallel supercomputing arrays to model the radar wave reflection of airplanes by solving Maxwell’s equations, trying to develop an airplane with a low radar cross-section [6]. The difficult task of solving Maxwell’s equations has had more economical solutions since 1990 with the development of fast computing resources applied to the FDTD method. Now FDTD has spread to many areas, including discrete scattering analysis, antenna and radar design [3], EM wave phenomena analysis on multilayer circuit boards [6], subsurface sensing and ground-penetrating radar (GPR) detection [7, 8], studies of EM wave phenomena in the human body, and the study of breast cancer detection using EM waves [9,10]. We apply our FDTD solution to landmine detection using GPR, breast cancer detection, and spiral antenna modeling. Ground-penetrating radar The FDTD method has been used to simulate GPR applications for buried landmine detection [7,8]. A three-dimensional FDTD model, as shown in Figure 32.4, simulates the wave propagation and scatter response of three-dimensional GPR geometries with realistic dispersive soil along with air, metal, and dielectric media. The UPML ABC produces good results for this application. The threedimensional model has been validated by experiments performed with a commercially available GPR system and realistic soil. Breast cancer detection Because of the large difference in electromagnetic properties between malignant tumor tissue and normal fatty breast tissue, microwave breast cancer detection has attracted much interest because it may overcome some of the shortcomings of X-ray detection. Accurate computational modeling of microwaves in human tissue with the FDTD method is promising for breast cancer detection research. Researchers built a three-dimensional model of the human breast [9,10], shown in Figure 32.5, that includes a semi-ellipsoid geometric representation of the
704
Chapter 32
I
Finite Difference Time Domain Receiving antenna
z Transmitting antenna
Air x
Soil Landmine y
FIGURE 32.4
I
A three-dimensional FDTD application of landmine detection using GPR.
15
100 80
0.04
100 80
10
60 40
0.02
60 40
20
5 50
100 (a)
150
0
20 50
0
100
150
20.02
(b)
FIGURE 32.5 I Three-dimensional FDTD application of microwave breast cancer detection: (a) geometry map; (b) simulated model space.
breast and a planar chest wall. The modeling is in the range of 30 MHz to 20 GHz, and the UPML ABC is implemented. Spiral antenna model The spiral antenna is a popular frequency-independent antenna. As shown in Figure 32.6, we use the FDTD method to simulate the radiation of the Archimedean spiral antenna as an example of its application to antenna design. Clearly, FDTD is a powerful tool that can be used in many different applications. However, its data-intense and computationally intense properties make it run slowly on a single processor. The reconfigurable hardware implementation of the FDTD method can greatly accelerate the running speed of the algorithm and maintain its accuracy and flexibility. For example, the breast cancer detection FDTD algorithm running on a single processor may require hours, while the hardware implementation delivers results in minutes, enabling a medical device that
32.1 The FDTD Method 3
120 100
2
80
120
1
100
0.5
80 1
60 40
0
20
0
60
20.5
40
21
20 20
40
60
80
100 120
21
705
21.5 20
(a)
40
60
80
100 120
(b)
FIGURE 32.6 I (a) The floorplan of the spiral antenna model; (b) an FDTD-simulated two-dimensional space.
delivers an answer during the examination. With the help of faster computing technology, the FDTD method will be applied to more research areas and applications.
32.1.4
The Advantages of FDTD on an FPGA
Compared to software running on a general-purpose processor, the advantages of an FPGA implementation are evident—faster speed, smaller size, lower power consumption; the last two advantages are significant, especially compared to a large computer cluster. Compared to an ASIC finite-difference time-domain design, the FDTD fieldprogrammable gate array (FPGA) implementation has the advantage of flexibility while accelerating the algorithm. The FDTD method models a wide variety of electromagnetic problems that are difficult to cover with a single hardware design. With an FPGA, a designer can modify the model size, the materials, and the parameters and even introduce new updating algorithms and boundary conditions easily. While the ASIC may outperform the FPGA as to speed, size, and power, the reconfigurable property of an FPGA makes it more suitable for the FDTD algorithm. We can achieve fast computation in an FPGA for finite-difference timedomain because FDTD has properties that make it very suitable for hardware implementation. These properties are its favorable structure for pipelining and parallelism and its constrained data ranges, which are good for fixedpoint representation. They make the FDTD method especially suitable for FPGA implementation. Parallelism and deep pipelining The FDTD algorithm repeats the same electric and magnetic updating algorithms on every cell of the model space. These calculations are independent between each cell. As long as there are adequate hardware resources, the
706
Chapter 32
I
Finite Difference Time Domain
fields for several cells can be calculated in parallel. Also, although the electric and magnetic updating algorithms depend on each other, the hardware design can still run these calculations in parallel with a carefully designed memory interface. The parallelism between electrical and magnetic fields and the parallelism between space cells make the FDTD algorithm very suitable for parallel hardware implementation, which is a key method for hardware acceleration. The six electric and magnetic updating algorithms can also be constructed with deep pipelining because they repeat the same calculation on each cell. Deep pipelining, another key method for hardware acceleration, maximizes data throughput and greatly increases overall design performance. Most cellular automata have properties similar to the FDTD algorithm with repeated, independent computation on every cell of the model space. The CA computation can be constructed with deep pipelining, and the parallelism between discrete cells is the same as that available in any CA problem. Fixed-point arithmetic Floating-point representation provides high resolution and large dynamic range, but it can be costly. In hardware design, floating point uses slower arithmetic components and consumes more area. In contrast, fixed-point components have much faster speed and occupy less area. In applications where data resolution and dynamic range can be constrained, such as the FDTD algorithm, fixed-point arithmetic can provide similar precision and much faster speed than floatingpoint arithmetic. The majority of the data in the FDTD algorithm is the six EM field variables and nine intermediate field variables for each cell in the model space. Since all the calculations in the FDTD method are linear, we can maintain the EM field data at a certain level of magnitude by normalizing the incoming source field magnitude. For example, if the source fields are between −1 and 1, all the EM field variables are between −1 and 1. In rare cases, we simulate the model space with a focus lens to magnify the EM field data. In this case we can estimate the EM data range and still keep the variables between −1 and 1 by normalizing the source field. Since all the EM field variables can be controlled in a fixed range, a fixed-point representation can be used for better performance with a relatively low error rate. The uniaxial PML FDTD algorithm must be optimized for fixed-point representation. Several parameters in the algorithm have a much different order of magnitude than the EM fields. They may not be representable in fixed point directly or may result in a large error when quantized. Additional error can arise from arithmetic calculations with these parameters in fixed point. These errors can be canceled by making a few changes to the original FDTD algorithm. For example, very large and small coefficients can be multiplied together to create a medium-value coefficient to be used in the new equation. The modification has no effect on the result of the algorithm. Careful analysis is important for fixed-point quantization to avoid errors. For normalized EM field values that range between −1 and 1, the data tends to be accurate to a relative error of 0.5 percent. The resolution of the fixed-point
32.2 FDTD Hardware Design Case Study
707
representation is determined by its data bit width. The longer the bit width, the higher the resolution, so the smaller the error. However, longer bit-width data uses more hardware resources. After careful study of the FDTD algorithm and representative data, we can pick a suitable bit-width with relatively small error (see also Chapter 23). In conclusion, the FDTD algorithm is very suitable for hardware implementation. The FPGA implementation of the finite-difference time-domain method will empower many FDTD applications in medical, military, and other areas by providing fast, small, low-power, and inexpensive implementations. Many cellular automata, which share similar properties, are also suitable for FPGA hardware implementation. The FDTD hardware design we present in the next section is a good example of hardware implementations for CA.
32.2
FDTD HARDWARE DESIGN CASE STUDY The FDTD algorithm has a clear structure for hardware design. For each cell in the model space, it reads the electric and magnetic data out of the memories, passes them through the updating algorithms, and writes the results back to the memories. The algorithm repeats this processing until it completes the model space; then it goes to the next timestep and does the same calculations again. It is easy to separate any hardware design into datapath, memory interface, and control logic. For FDTD, the datapath implements all the electric and magnetic updating algorithms; the memory interface controls data reading, writing, and caching; and the control logic uses a finite-state machine (FSM) to control the progress of the whole design. However, because of its complexity, an efficient hardware implementation of FDTD is not straightforward. The FDTD algorithm is data intense. The electric and magnetic updating algorithms interface a lot with the input and output memories, which creates a heavy burden on the memory interface and data bandwidth. Also, the EM field dataset for the whole model space can be very large for a large model size (a 100×100×100 model may require 60 MB of memory space), meaning that local FPGA memory is insufficient to contain the entire problem. The FDTD algorithm is also computationally intense. Every EM field has its own updating algorithms and boundary conditions. A special interlaced mechanism is used between the electric and magnetic updating algorithms, making them depend on each other. Many problems arise when considering the pipelining and parallelism of the datapaths. The FDTD algorithm is complex enough to reach the resource limits of most advanced FPGAs available on the market. Consideration of fixed-point quantization and resource performance trade-offs is very important for efficient hardware design. One of the main purposes of a hardware implementation is to achieve better performance. To implement the FDTD algorithm on an FPGA efficiently, we need to consider the following:
708
Chapter 32 I
I
I
I
Finite Difference Time Domain
Determining the right precision for fixed-point representation (Section 32.2.2) Determining the memory hierarchy and designing the memory interface and cache module (see Memory hierarchy and memory interface subsection of Section 32.2.3) Determining the pipelining and parallelism by considering the trade-off between resources and performance (see Pipelining and parallelism subsection of Section 32.2.3)
It is important to analyze the data structures, algorithm structure, hardware architecture, and resource limits before design of hardware implementation. This section introduces a target reconfigurable platform, the WildStar-II Pro FPGA board, and lists its detailed specifications. Then we choose the suitable fixedpoint representation by analyzing the quantization error of a fixed-point FDTD algorithm and the hardware resource limits. Then we go through the problems in the FDTD hardware implementation and provide detailed solutions and analyses. By carefully considering the trade-offs between hardware resources and performance, we can design the FDTD accelerator with the memory interface, pipelining, and parallelism optimal to the current FPGA computing board.
32.2.1
The WildStar-II Pro FPGA Computing Board
The FPGA board used here is a WildStar-II Pro/PCI reconfigurable FPGA computing board from Annapolis Micro Systems [12]. Its main features are summarized in Table 32.1; a block diagram of this board is shown in Figure 32.7. There are two Xilinx Virtex-II Pro FPGAs, each with 328 embedded 18×18 signed multipliers and 328×18-Kb BlockRAMs. The embedded multipliers are much faster than a multiplier component implemented with reconfigurable logic, so it is best to use them if possible. The BlockRAMs are the fastest memory the designer can use in an FPGA design, operating as fast as 200+ MHz on the Virtex-II Pro chip. Critical data interchange and interfacing can be programmed using the BlockRAMs. A pair consisting of an embedded multiplier and a BlockRAM shares the same data and address buses in the Xilinx Virtex-II architecture, so once the embedded multiplier is used, we cannot use its
TABLE 32.1
I
The main features of the WildStar-II Pro FPGA board
FPGA chips
Two Xilinx Virtex-II Pro XC2V70 FPGAs (33,088 slices, 328 embedded multipliers, and 5904 Kb BlockRAM)
Memory ports
Twelve DDRII SRAM ports totaling 54 MBytes (6×4.5 MBytes for each FPGA chip)
Memory bandwidth
Eleven GB/s memory bandwidth (6×72 bits for each FPGA chip)
PCI interface
133 MHz/64-bit PCI-X up to 1.03 GB/s
32.2 FDTD Hardware Design Case Study DDRII/ QDRII SRAM
DDRII/ QDRII SRAM
DDRII/ QDRII SRAM
36
36
36 32
I/O
80
DDR DRAM
DDR DRAM
50 50 20 20
PE 1 Virtex-II Pro XC2VP 70,100,125
36
36
36
DDRII/ QDRII SRAM
DDRII/ QDRII SRAM
DDRII/ QDRII SRAM
32 Rocket I/O Differential pairs Single ended
32
DDRII/ QDRII SRAM
DDRII/ QDRII SRAM
36
36
DDRII/ QDRII SRAM
36
32
PE 2 Virtex-II Pro XC2VP 70,100,125
32
709
36
36
36
DDRII/ QDRII SRAM
DDRII/ QDRII SRAM
DDRII/ QDRII SRAM
PCI
80
I/O
32
Switches 32/64 bits
33/66/133 MHz PCI bus
FIGURE 32.7
I
A block diagram of the WildStar-II Pro FPGA board.
corresponding BlockRAM, and vice versa. Thus, the sum of the total number of embedded multipliers and BlockRAMs used must be less than 328. Each FPGA is connected to six independent onboard memories, which are 1-M×36-bit DDRII SRAM that have 72-bit data bandwidth and speeds up to 200 MHz. The size of each SRAM is 36 Mbits, or 4.5 MBytes, so the total SRAM attached to each FPGA is 27 MBytes. The WildStar-II Pro board is connected to the desktop computer via a PCI-X interface, with a DMA data transfer rate up to 1 GB/s between the host PC and the FPGA. The WildStar-II Pro is a typical commercial off-the-shelf (COTS) FPGA computing board, which is widely available and easy to set up. These boards normally contain one or two FPGA chips. Each FPGA chip may be connected to several onboard memories consisting of SRAM or DRAM. The computing boards are often PCI boards for a desktop computer or PCMCIA cards for a laptop. Data and control signals can be transferred between the FPGA computing board and the host PC via either standard PCI transfer or fast DMA transfer. The FDTD hardware design is based on the WildStar-II Pro board but can be easily modified for other COTS FPGA boards.
32.2.2
Data Analysis and Fixed-point Quantization
Because of its limited data range and favorable algorithm properties, the FDTD method is suitable for fixed-point arithmetic (see Section 32.1.4). To use fixedpoint representation with the algorithm, we need to first decide its representation and the right data precision. For simplicity, we use a 2’s complement fixed-point representation that has a fixed number of digits before and after the binary point. Because the EM
710
Chapter 32
FIGURE 32.8
I
I
Finite Difference Time Domain S
I
1
1
Fractional bits N
The data structure of the fixed-point representation.
field data in the FDTD algorithm fits in the range −1 to 1, and the results of the intermediate calculations (i.e., add, subtract, and multiply) fit in the range −2 to 2, we set the fixed-point data structure as one sign bit S, one integer bit I before the binary point, and N fractional bits Fi after the binary point, as shown in Figure 32.8. The fixed-point data value is V = −S · 2 + I + 21N ∑iN=−0 1 2i Fi . The data range given by this representation is between −2.0 and 1.999. The data precision depends on the smallest absolute value that can be represented. Because the binary point position is fixed, the smallest absolute value is 2−N , which depends solely on the bit width N of the fractional part. To determine the right value for N, we need to consider the trade-off between quantization error and resource costs. To avoid quantization error, which is the difference between the fixed-point and corresponding floating-point data, a longer data bit width is preferable. However, longer data bit widths require larger and slower arithmetic components and put more burden on memory bandwidth and data storage. The problem is how to pick the optimal data bit width such that the fixed-point FDTD algorithm generates acceptable quantization error and consumes a reasonable amount of hardware resources. To determine this, we wrote the FDTD algorithm in C code both in doubleprecision floating-point and fixed-point arithmetic and compared the results. Fixed-point representation is simulated by long integers in C, which have a 32-bit maximum bit width. We used two long integer variables to represent one fixed-point datum up to 64 bits. Based on this representation, we created add, subtract, and multiply components for each fixed-point bit width. The C code simulates the fixed-point arithmetic and produces results that are exactly the same as the hardware output. Thus, this C code also can be used for hardware results verification. By comparing floating-point and the corresponding fixed-point data results for the same model space, we can calculate the relative error, defined in equation 32.14, over the time period that the algorithm runs. Relative error =
|floating-point data − fixed-point data| |floating-point data|
(32.14)
We studied the following six experimental FDTD models to investigate quantization errors: I
I
I
The two-dimensional and three-dimensional soil media–based GPR landmine detection models The two-dimensional and three-dimensional human tissue media–based tumor detection models The two-dimensional and three-dimensional spiral antenna models
32.2 FDTD Hardware Design Case Study TABLE 32.2
I
Detailed specifications of the experimental FDTD models 2D landmine detection
I
3D landmine detection
2D breast detection
3D breast detection
2D spiral antenna
Relative error between fixed-point and floating-point representation Timestep (%)
Bit width
3D spiral antenna
150×100 50×50×50 240×140 80×60×40 120×120 120×120×25 2000 2000 2000 2000 2000 2000 Plane wave Point source Point source Soil, air, dielectric Human tissue, dielectric Metal, air, dielectric
Size Time duration Source Media
TABLE 32.3
711
Field
400
600
1000
1400
1600
Average across timestep (%)
29
Ex Hy Hz
9.187 12.440 2.706
3.503 0.124 1.925
0.280 1.431 0.472
0.182 0.244 0.200
0.558 0.264 0.235
2.742 2.901 1.108
31
Ex Hy Hz
3.861 3.681 1.905
0.941 0.025 0.461
0.058 0.295 0.105
0.032 0.042 0.039
0.110 0.001 0.046
1.001 0.809 0.511
33
Ex Hy Hz
2.155 2.101 1.479
0.209 0.007 0.120
0.016 0.077 0.029
0.010 0.012 0.010
0.031 0.014 0.013
0.484 0.442 0.330
35
Ex Hy Hz
1.729 1.420 1.314
0.063 0.002 0.030
0.004 0.021 0.007
0.002 0.003 0.003
0.008 0.004 0.003
0.361 0.290 0.271
The specifications of these models are listed in Table 32.2. For all of them, we studied the average relative errors between the floating-point and the fixed-point results. This section analyzes the GPR model results. The other model spaces are similar. Table 32.3 shows average relative errors for the fractional data bit-width range from 29 to 35 bits in the two-dimensional GPR landmine detection model. Ex , Hy , and Hz are electric and magnetic field data. The relative errors are plotted in Figure 32.9. Those of both electric and magnetic field data decrease as bit widths increase. However, the rate of decrease slows as the bit widths increase. Considering both the relative error and the bit-width cost, a 33-bit fractional part is a good choice for the trade-off between data precision and hardware resources. The average absolute error for this representation is on the order of 10−8 for magnetic field data and on the order of 10−6 for electric field data; the average relative error is about 0.3 to 0.5 percent. Thus, this representation satisfies the accuracy requirement that the relative error is less than 0.5 percent. In addition to quantization error analysis, we need to consider the resource limits of the real hardware device in determining the fixed-point data bit width. The FDTD model space will be stored in the onboard SRAMs on the WildStar-II
Chapter 32
I
Finite Difference Time Domain 3.500 3.000 2.500
Percent
712
2.000 1.500 1.000 0.500 0.000
FIGURE 32.9 bit widths.
I
29
31 33 Bit width after the binary point
35
EX HY HZ
The relative error between fixed-point and floating-point arithmetic for different
Pro FPGA board. The SRAM memory chip we used has size 512K × 36 bit. The data is stored in the memory in units of 36 bits. Any data more than 36 bits wide will take two memory units. To keep the memory interface working efficiently, we want to set the data bit width less than or equal to 36 bits. The embedded multiplier provided on the Xilinx Virtex II-Pro FPGA chip, an 18×18-bit 2’s complement signed multiplier, is much faster than the multiplier component implemented by normal reconfigurable logic. Four embedded multipliers can form a 35×35-bit signed multiplier. However, to construct a 36×36-bit signed multiplier, nine embedded multipliers are needed. Because the number of multipliers is limited and very useful in the FDTD algorithm, it is uneconomical to use a 36×36 multiplier or 36-bit data. A data bit width of 35 bits is more efficient for the embedded multiplier. Because the fixed-point quantization error analysis performed in the last section also recommends a data bit width of 35, we choose 35 bits of data as the fixed-point data structure based on both quantization error and resource limits.
32.2.3
Hardware Implementation
After choosing the fixed-point data representation, we then study two very important problems in the FDTD hardware implementation: memory interfacing and pipelining and parallelism. Memory hierarchy and memory interface Because the EM field data is proportional to the number of cells in the FDTD model space, the dataset can be very large. Every cell in the FDTD model space has 6 EM field data and 9 intermediate field data for the UPML computation, adding up to 15 field data. An FDTD model space may have millions of cells, require hundreds of megabytes of memory space, and easily exceed the limits of the memory available inside the FPGA chip. Therefore, the data must be stored
32.2 FDTD Hardware Design Case Study
713
in larger memories, which are normally slower than the fast on-chip memories, outside the FPGA chip. The data stored in the slower memories needs to be transferred to the processing core in the FPGA. The processing core is composed of six electric and magnetic updating algorithms, which require very large amounts of input data. In the worst case, three electric updating algorithms require 36 input data and three magnetic ones require 18, adding up to 54 input data for each dispersive UPML FDTD cell. In other words, to make sure that the processing core works at full speed, we need to transfer 54 input data from off-chip memory to the FPGA for each cell. The data transfer puts a heavy burden on the interface between the off-chip memories and the FPGA design. To provide the necessary data at the right time and to optimize the efficiency of the memory interface, we need to determine how to organize the memory resources efficiently by considering the size, speed, and interface bandwidth of each memory resource. There are three levels of memory hierarchy, based on the WildStar-II Pro/PCI FPGA computing board: I
I
I
The fast and wide data-width on-chip memory (BlockRAM) integrated on the FPGA chip The fast but limited data-width onboard memory located on the FPGA computing board The slow memory for the FPGA to access located in the host PC
BlockRAMs are programmable memories that are integrated inside modern FPGA chips. A Xilinx Virtex-II Pro XC2V70 FPGA contains 328 BlockRAMs, 18 Kb each, with a maximum data width of 36 bits. They can be implemented as small memory blocks or cascaded to form large memory blocks. They also can be programmed to be different depths and widths to fit the hardware design and data structures. They are fast memory units in terms of latency, with only one clock cycle delay for clock cycles up to 200 MHz. Although BlockRAMs are fast and flexible memory resources, there is much less BlockRAM available compared to off-chip memory. So normally we do not fit the entire model space’s data into BlockRAMs. Instead, they are used to build cache modules that read from and write to off-chip memories continuously and feed data to the processing core. What’s more, the BlockRAMs are true dualported RAM units, and a group of BlockRAMs can provide a very wide data width to the processing core when aggregated together. For example, 54 BlockRAMs on the input side can provide a 54×36-bit data width every clock cycle, which allows the FDTD processing core to run at full speed. The data width is the number of bits that can be transferred in one clock cycle. Along with clock frequency, data width determines the data transfer speed (bandwidth) of the memory interface. Onboard memories, which directly communicate with the FPGA chip, are relatively slower than BlockRAMs in terms of latency, but they are usually much larger in size, varying from megabytes to hundreds of megabytes. The interface between the memory chips and the FPGA chips follows the read/write cycles of the specific memory chips, which are normally single-ported data access
Chapter 32
I
Finite Difference Time Domain
with limited data transfer width. Because of the heavy data access required by the FDTD algorithm, the onboard memory bandwidth is very important to the performance of the FDTD design. As discussed before, the six electric and magnetic updating algorithms need 54 input data for each FDTD cell, which is around 54×36-bit×100 MHz = 194 Gb/s—far beyond the onboard memory bandwidth of typical FPGA boards. The input data of a single cell have to be transferred to the updating algorithms in several clock cycles, while the updating algorithms can calculate results with a throughput of one cell per clock cycle. So, the onboard memory data transfer bandwidth is the bottleneck of the FDTD design. Memory bandwidth is an important specification in choosing the FPGA computing board for a finitedifference time-domain implementation. To solve this bottleneck, we introduce the managed-cache module that is explained in the next subsection. The memories in the host PC can be accessed by the FPGA via the PCI or other interfaces. These interfaces are normally slower than the two memory interfaces we have discussed, so we treat the memories in the host PC as the slowest memory, no matter what the actual speed. This memory can be used for data initialization at design startup and data retrieval at the end. At the start of processing, the model space data are loaded from the host PC to the onboard memory and loaded back to memory in the PC at the end of the design. If the onboard memory is not big enough to hold the whole model space, the memory in the host PC will be the primary memory and the data need to be transferred to and from the onboard memory throughout the entire calculation, slowing down the whole design. The size of onboard memories is thus another critical specification in choosing an FPGA computing board. The memory hierarchy and memory interface structure used in this design is shown in Figure 32.10. We use one FPGA and six onboard memories on the WildStar-II FPGA board. The FDTD field data stored in the onboard memories are sent to the electric and magnetic field–processing cores for calculation via
Onboard memory
Input BlockRAMs
Onboard memory
Electric field pipeline module Magnetic field pipeline module
Caching module
DESIGN Onboard memory
Caching module
714
Ouput BlockRAMs FPGA
Onboard memory
Onboard memory
I
Memory in PC
Memory in PC
Onboard memory
COTS FPGA computing board
FIGURE 32.10
PCI bus
A structural diagram of the memory interface.
PC host
32.2 FDTD Hardware Design Case Study
715
the caching modules built using the BlockRAMs on the FPGA chip. The 3-level memory hierarchy formed from the host PC, the onboard memories, and the BlockRAM caching modules ensure that the electric and magnetic field updating algorithms work at optimal speed. As shown in Figure 32.10, the BlockRAM caching modules are split into two parts: input and output. The six onboard memories, which are used to store EM field data, are split into two parts also. The entire FDTD model space of the previous timestep is stored in the input onboard memories, and the calculation results, which comprises the data in the current timestep, will be stored in the output onboard memories. In the next timestep, the role of the onboard memories is swapped. The original output onboard memories, which store the current timestep’s data, will be connected to the input caching module and the original input onboard memories will be connected to the output module to store the next timestep’s result. The separation of the input and output onboard memories eliminates the need for simultaneous read/write access to the same memory. Because the onboard memories are single ported, shifting between reading and writing to the same memory will create overhead and greatly reduce the speed of the design. By separating input and output memory, we can read from and write to the onboard memories at the same clock cycle, and continue reading and writing a group of data on every clock cycle. So, although the separation of the memory interface does not change the memory bandwidth, the data-transfer rate of the memory interface is increased. Also, the separation makes the structure of the memory interface clearer and the swapping mechanism avoids the extra effort of transferring data from output memories to input memories at the end of every timestep. This swapping of input and output memories is a common hardware design technique to increase throughput. Managed-cache module As introduced in the previous section, onboard memory data bandwidth is limited on the FPGA computing board, so the EM field data cannot be transferred to the FPGA fast enough to allow the processing core to run at full speed. To solve this memory transfer bottleneck, we need to introduce the managedcache module, which is an important part of the memory interface design. Memory transfer bottleneck Although the FDTD processing core requires a large amount of input data, the input data for each cell are the EM field data in their nearest-neighbor cells. For two cells located near each other in the FDTD model space, some of the nearest-neighbor cells are the same. The cache module between the onboard memories and the hardware processing cores is designed to avoid reading the same data multiple times from onboard memories. All of the input data for each cell are from their near neighbors, which means the data are located in a small cubic window around the current cell. If the managed-cache module is designed to be larger than this cubic window, when we calculate the fields of the next cell, the processing core can get all the necessary input data from the cache module. Among the input data,
716
Chapter 32
I
Finite Difference Time Domain
only a little is new, so we only need to fetch the new data from the onboard memories every clock cycle, which greatly reduces the data-transfer burden. At the same time, some of the old data becomes obsolete. In the managed-cache module, we can replace the obsolete data with the new data fetched from onboard memory. Ideally, we keep the processing core running at full speed so that it calculates one cell’s EM data per clock cycle. The managed-cache module needs to be designed to provide all the necessary input data for the processing core, while fetching only one new cell’s data from onboard memory every clock cycle. Since every UPML FDTD cell has 15 field data and the processing core needs up to 54 field data inputs, an ideal managed-cache module will fetch 15 field data from onboard memory every clock cycle and provide a data width of 54 field data to the processing core, solving the memory bandwidth bottleneck problem by reducing the number of fetches to 15 every clock cycle, which is 15×36bit×100 MHz = 54 Gb/s. This rate can be supported by the WildStar-II Pro FPGA computing board. We explain how to realize this ideal cache module in the next two subsections. Dataflow and processing core optimization To simplify the explanation of how to optimize the dataflow and how to optimize the processing core, we start from a two-dimensional FDTD algorithm, which can be directly reduced from the three-dimensional FDTD algorithm by considering only one plane in the threedimensional model. The two-dimensional algorithm updates three EM field data instead of six, handling much less data transfer and calculation, but it keeps the same algorithm structure and datapath. For a two-dimensional model plane of size N×N, we assume that each N cell row is a basic processing unit. Calculating one row of data means updating all EM field data for this row. The cache modules separate the whole dataflow of the FDTD design into three processes: (1) READ from the input onboard memory and store to the input cache module; (2) read from the input cache module, CALCULATE, and write the result to the output cache module; (3) read from the output cache module and WRITE to the output onboard memory. These three processes can be run in parallel since the cache module can be read from and written to at the same time (i.e., because the cache modules are built from dual-ported BlockRAMs). The parallelism of READ, CALCULATE, and WRITE means that the FDTD design can, at the same time, READ one row of data, CALCULATE the previous loaded row, and WRITE out the results of the row before that. We can understand this as systemwide pipelining in the dataflow. Each process is a pipeline stage. Rows of data are pushed into this 3-stage pipeline, one at a time. Compared to running the three processes serially, this optimized dataflow structure increases the throughput by a factor of 3. For a two-dimensional plane of size N × N, a simple 2-row cache module (size 2×N) realizes the READ/CALCULATE/WRITE pipelining. As shown in Figure 32.11, the data can be READ from input onboard memory and stored in the second input cache row while the CALCULATE process works on the previously loaded data in the first input cache row. The result is stored in the first output cache
32.2 FDTD Hardware Design Case Study READ Onboard memory input
CALCULATE Cache module BlockRAM input 1st cache row
Processing core
I
Onboard memory output
Cache module BlockRAM output 1st cache row
A structural diagram of the simple 2-row cache module.
READ Onboard memory input
WRITE
2nd cache row
2nd cache row
FIGURE 32.11
717
CALCULATE Cache module BlockRAM input
Electric field pipeline module Electric
WRITE Cache module BlockRAM output
Onboard memory output
Magnetic Magnetic field pipeline module
FIGURE 32.12
I
A structural diagram of the two-dimensional managed-cache module.
row while the previous row’s result is read from the second output cache row and WRITTEN to output onboard memory. This cache module structure can be applied to other CA designs. Furthermore, for FDTD implementation the managed-cache module enables parallel implementation of the electric and magnetic updating algorithms in the implementation of the processing core. Because of the data dependency of the electric updating algorithm on the magnetic updating algorithm—the former needs the current result of the latter—we cannot directly update the M-field and E-field in parallel until we introduce two extra rows in the managed-cache module (see Section 32.1.2). Why two extra rows? The electric updating algorithm needs to have newly updated magnetic data in the current cell and newly updated magnetic data in the cell below as inputs. So, the electric updating algorithm needs to wait until the magnetic updating algorithm finishes two rows of computation. As long as the cache has two extra rows to save the newly calculated magnetic data, we can run the magnetic updating algorithms two rows ahead of the electric updating algorithms and partially overlap their computation. This is illustrated in Figure 32.12. For a two-dimensional model space of size N×N, the managed-cache module stores four rows (4×N) of field data. While the READ process is working on the fourth cache row, the magnetic updating algorithm can work on the data in the third row, which was just read from the memories by the last READ. At
718
Chapter 32
I
Finite Difference Time Domain
the same time, the electric updating algorithm can work on the first cache row, which is two rows after the magnetic algorithm. Finally, WRITE also works on the fourth row, sending out both calculation results from the electric and magnetic updating algorithms. The four rows of field data roll over in the cache modules until the entire model space is calculated. This 4-row cache module improves the total computation time by a factor of almost 2, or (N+2)/(2N+2), by partially parallelizing the electric and magnetic updating implementations. Thus, the managed-cache module optimizes the design here in two ways: (1) systemwide pipelining of the design dataflow, and (2) processing-level parallelism of the electric and magnetic updating algorithms. Expansion to three dimensions Here we expand the two-dimensional cache module design to three dimensions. The memory interface and the cache modules are more complex in the three-dimensional FDTD hardware implementation, which handles many more data transfers and calculations. There are two possible approaches for upgrading the cache module to three dimensional. The first is a direct upgrade of the two-dimensional memory interface, as shown in Figure 32.13. Instead of a 4-row cache module, we need to build a 4-slice cache module. Here we READ one slice, CALCULATE one slice, and WRITE out one slice of data at each time interval. However, a 4×100×100 cache module consumes more than 1200 18-Kb BlockRAMs, which is over three times all the BlockRAMs on the targeted Virtex-II Pro XC2V70. This approach is not feasible for large three-dimensional model spaces. The second approach reduces the size of the cache module to 4×3 rows of field data by cutting the model space into slices and then into rows. As shown in Figure 32.14, the cache module reads three rows of field data at each time interval, goes through the current vertical slice until it finishes, and then goes to the next vertical slice in the model space. Instead of a 4-slice cache module, we only need to build a 4×3 row cache module. This method minimizes BlockRAM consumption; however, it sacrifices overall design speed to achieve larger model space compatibility. We READ three rows of data at each time interval to CALCULATE only one row of results. This is because we need the current row Onboard memory input
Cache module BlockRAM input
Four slices of data: READ one slice of data while CALCULATE one slice; WRITE one slice of data at the same time
FIGURE 32.13
I
A structural diagram of the 4-slice caching design.
32.2 FDTD Hardware Design Case Study Onboard memory input
719
Cache module BlockRAM input
4 3 3 rows of data: READ approximately two rows while CALCULATE one row
FIGURE 32.14
I
A structural diagram of the 4×3 row caching module.
and adjacent two rows of data to calculate the current row’s results. Because only one row of results is calculated from the field-updating pipelines, the READ process is longer than the CALCULATE and WRITE processes. At this point the other two processes need to wait for the READ process. This waiting process slows down hardware design. Fortunately, we do not need to READ all three rows (45 data per cell) to start processing since the field-updating algorithm only needs part of the data in adjacent rows. We only need to READ approximately two rows of data (36 data per cell), CALCULATE one row, and WRITE one row at each time interval. Due to the limited number of BlockRAMs, the second approach is more practical. From the preceding analysis of the managed-cache modules, we conclude that the efficiency of the memory interface plays a key role in the performance of the complete FPGA design. The speed and manner in which the memory interface handles the input data often limits the speed of the entire design. Pipelining and parallelism Given an efficient memory interface and proper fixed-point data representations, the designer next needs to adjust the architecture and optimize design performance by considering pipelining and parallelism. As discussed before, we can implement the electric and magnetic updating algorithms in parallel with the correct cache structure. We can also implement the three key processes—READ, CALCULATE, and WRITE—in parallel by separating the input and output memory interfaces and building dual-ported cache modules. In hardware design, parallelism translates to faster speed; however, it also “costs” more in hardware resources. The FDTD algorithm is large enough to reach the resource limits of the most advanced FPGAs on the market. One of the important problems in FDTD hardware design is determining the design architecture by considering the trade-offs between resources and performance. The hardware resource limit of each FPGA chip and computing board is different. The resource– performance trade-off analysis here is based on the targeted WildStar-II Pro FPGA computing board. Pipelining The FDTD algorithm repeats the same electric and magnetic updating algorithms, which are independent of each other, on every cell of the model
720
Chapter 32
I
Finite Difference Time Domain
space. The algorithms can be implemented with complex combinational logic with long delay. Building them with deep pipelining helps reduce the clock cycle and increase the throughput of the hardware design. Because of the advantages, we pipeline all the updating algorithms. The embedded multipliers, which are the slowest components in the datapath, can also be pipelined to several stages to reduce delay. Because the lengths of the electric and magnetic updating pipelines are different, state machines are used to control the start and end of the pipelines and to synchronize them. Parallelism Because the updating calculations on every cell in the FDTD model space are independent of each other, as long as there are adequate hardware resources, the computation of two or more FDTD cells can be implemented in parallel. However (see Section 32.2.3), the bandwidth of the memory interface is the bottleneck of the FDTD hardware design. The memory data width here is 3×72 bits, which can transfer six 35-bit field data inputs at each clock cycle. This memory bandwidth needs 6 clock cycles to prepare one cell’s 36 input data when using the 4×3 row cache module. Can this memory interface handle the increased parallelism? Running two cells in parallel actually saves memory bandwidth per cell. As shown in Figure 32.15, two adjacent FDTD cells share a portion of their nearestneighbor cells. For each single cell, we need to read three rows of data (36 field data per cell) from the onboard memories, which is when running two cells in parallel, we only need to read four rows of data, or 24 data per cell. Because the bottleneck of the design is the memory bandwidth, the 2-cell parallelism mechanism improves the performance of the whole design. We can use the ratio between input data and result data as a metric to measure the efficiency of the memory interface. After implementing 2-cell parallelism, the input–result ratio decreases from 6:1 to 4:1. Running two cells in parallel creates an extra burden on the cache size and the calculation pipelines, however. The cache module needs to hold 4×4 rows of data at the same time instead of 3×4 rows. Fortunately, the Virtex-II Pro XC2V70 FPGA has adequate BlockRAMs for the 4×4 row cache, but there is no space for increasing the cache beyond this, which is why we choose not to run three cells in parallel, even though this would further save memory bandwidth per cell and improve the input–result ratio.
FIGURE 32.15
I
Running two cells in parallel.
32.2 FDTD Hardware Design Case Study
721
Also, the Virtex-II Pro FPGA XC2V70 does not have enough reconfigurable logic to implement all the updating pipelines in parallel. Instead, because the memory interface takes four clock cycles to transfer enough input data for one cell’s calculation, the number of parallel updating pipelines can be reduced. The calculation core can run several updating algorithms serially in one updating pipeline, taking more than one clock cycle to finish the calculation for one cell. The serial calculation reduces the level of parallelism, saves reconfigurable logic, and still maintains the performance of the hardware design. Two hardware implementations The preceding input–result ratio is calculated based on the input data needed for the uniaxial PML FDTD algorithm. This algorithm treats the whole model space as UPML cells and provides a uniform structure for both the UPML cells and the non-UPML center cells, as shown in Figure 32.16. However, the UPML FDTD algorithm requires nine extra field data for each cell in the model space, which adds overhead to the memory interface. The cells in the center of the model space that are not located in the UPML layer can be calculated by the normal FDTD algorithm, which has only six field data for each cell. Small modifications to the UPML updating pipelines can make the new updating pipelines work on both the UPML cells and non-UPML center cells. Therefore, we can save memory bandwidth and memory space on the center cells by combining the UPML and center cell algorithms in the hardware design. The input–result ratio of a center cell is 3:1 and will be 2:1 after applying 2-cell parallelism. For the normal model space, where half the cells are center cells and the other half are UPML cells, the overall input–result ratio will decrease to approximately (4:1 + 2:1)/2 = 3:1, raising the performance of the hardware design.
UPML cells
Center cells
FIGURE 32.16 I Uniaxial PML boundary condition cells and non-uniaxial PML center cells in the model space.
722
Chapter 32
I
Finite Difference Time Domain
We have two hardware implementations for the uniaxial PML FDTD algorithm. The first implementation treats the whole model space as UPML cells, with a simpler design structure and an input–result ratio of 4:1. The second implementation, which includes center cell and UPML cell calculations, has a more complex memory interface and better performance (the input–result ratio depends on the number of center cells and UPML cells). The analysis of resources and performance trade-offs here is based on the WildStar-II Pro FPGA computing board. For other FPGA devices, the analysis is similar. A wider onboard memory data width, which can ease the memory bottleneck, will raise the design performance proportionally. A bigger FPGA chip, which can hold larger cache modules and more updating pipelines, will speed up the hardware design by calculating more cells in parallel.
32.2.4
Performance Results
A comparison of performance results for three-dimensional FDTD software and hardware implementations is shown in Table 32.4. The sample model is a 50×50×50 three-dimensional uniaxial PML FDTD algorithm model with 500 timesteps of FDTD iteration. The fixed-point FDTD hardware design, which treats all cells as UPML boundary cells, runs at 90 MHz on the WildStar-II Pro FPGA board. The UPML FDTD FPGA implementation is 16 times faster than the floating-point Fortran software implementation running on a 3.0-GHz PC. Hardware times are measured on the board and include the time to transfer data between the FPGA board and the host PC at the start and end of computation. The hardware design speedup can increase to 25 times with the implementation that combines the center and UPML region. The Virtex-II Pro XC2V70 FPGA chip is almost fully utilized because the FDTD hardware design occupies 99 percent of the reconfigurable slices, 51 percent of the BlockRAMs, and 46 percent of the embedded multipliers. There are two Xilinx Virtex-II Pro FPGAs on a WildStar-II Pro FPGA board. Dual-FPGA parallel implementations of the FDTD algorithm are expected to double the speedup. TABLE 32.4
I
Three-dimensional FDTD hardware implementation performance results Software floating-point Hardware Hardware Hardware Fortran code fixed-point design fixed-point design fixed-point design on 3.0 GHz PC running at 90 MHz running at 90 MHz running at 90 MHz All cells as All cells as Combined center center cells UPML boundary cells and UPML region
Runtime (sec) Million nodes/sec Speedup
49 1.27 1
1.59 39.31
2.985 20.93
1.89 33.07
30.9
16.5
25.9
32.3 Summary
32.3
723
SUMMARY Implementing the FDTD algorithm in hardware greatly increases its computational speed. The speedup is due to three major factors: fixed-point representation, custom memory interface design, and pipelining and parallelism. FDTD is a data-intense algorithm; the bottleneck of the hardware design is its memory interface. With the limited bandwidth between the FPGA and data memories, a carefully designed custom memory interface allows for full utilization of the memory bandwidth and greatly improves performance. The FDTD algorithm is also a computationally intense algorithm; by considering the tradeoffs between resources and performance, we implement as much pipelining and parallelism as possible to speed up the design. The FDTD algorithm is also a cellular automata, sharing a similar algorithmic structure with many other CA problems. The hardware design techniques and memory interface architecture presented in this chapter can be applied to a wide range of other CA problems to achieve speedup on an FPGA and to provide fast, small, low-power, and inexpensive implementations.
References [1] K. S. Kunz, R. J. Luebbers. The Finite Difference Time Domain Method for Electromagnetics, CRC Press, 1993. [2] A. Taflove, S. C. Hagness. Computational Electrodynamics: The Finite-Difference Time-Domain Method, 2nd ed., Artech House, 2000. [3] A. Taflove. Advances in Computational Electrodynamics: The Finite-Difference TimeDomain Method, Artech House, 1998. [4] K. Yee. Numerical solution of initial boundary value problems involving Maxwell’s equations in isotropic media. IEEE Transactions on Antennas and Propagation 16, 1966. [5] J. P. Berenger. Three-dimensional perfectly matched layer for the absorption of electromagnetic waves. Journal of Computational Physics 127, 1996. [6] A. Taflove. Reinventing electromagnetics: Emerging applications for FD–TD computation. IEEE Computational Science and Engineering 2(4), 1995. [7] B. Yang, C. Rappaport. Response of realistic soil for GPR applications with two-dimensional FDTD. IEEE Transactions on Geoscience and Remote Sensing, June 2001. [8] P. Kosmas, Y. Wang, C. Rappaport. Three-dimensional FDTD model for GPR detection of objects buried in realistic dispersive soil. SPIE Proceedings 4742, April 2002. [9] P. Kosmas, C. Rappaport. Modeling with the FDTD method for microwave breast cancer detection. IEEE Transactions on Microwave Theory and Technology 52(8), 2004. [10] P. Kosmas, C. Rappaport. Use of the FDTD method for time reversal: Application to microwave breast cancer detection. SPIE Proceedings Computational Imaginary 5299, 2004. [11] Xilinx, Inc. Virtex-II Pro and Virtex-II Pro X Platform FPGAs: Complete Data Sheet, 2004. [12] Annapolis Micro Systems. WildStar-II Hardware Reference Manual, 2004.
This page intentionally left blank
CHAPTER
33
EVOLVABLE FPGAs Andres Upegui, Eduardo Sanchez School of Computer and Communication Sciences Ecole Polytechnique F´ed´erale de Lausanne Reconfigurable and Embedded Digital Systems Institute Haute Ecole d’Ing´enierie et de Gestion du Canton de Vaud
One of the main advantages of living beings over engineered computing systems is their capacity to adapt. While computers are tied to a fixed architecture predefined at design time, the human brain exhibits an impressive structural plasticity whereby interconnections are constantly being reinforced or destroyed according to environmental interactions. This and other comparisons between computers and living beings have given rise to what we know today as bioinspired hardware design. Evolvable hardware is a bioinspired technique that has enjoyed impressive growth during the last decade. In 1993 Higuchi et al. and de Garis proposed an analogy between living beings and programmable hardware devices [1, 2]: In both cases specification of the system is by means of a finite string of symbols. In the case of living beings, DNA determines how the organism develops into its final phenotypic representation; in programmable hardware devices, a configuration bitstream drives behavior. This parallel suggests the utilization of so-called evolutionary algorithms in the design of hardware systems.
33.1
THE POE MODEL OF BIOINSPIRED DESIGN METHODOLOGIES Living organisms, from microscopic bacteria to giant sequoias, including animals such as butterflies and humans, have successfully survived on Earth for millions of years. If we had to propose but one key to explain this success, it certainly would be adaptation. In contrast with nature, adaptation has been very elusive to human technology. The model examples of adaptive systems are not among human’s creations but among nature’s—natural organisms show a striking capacity to adapt to changing circumstances, thus ensuring their continued functionality. During the last few years, computer scientists, inspired by certain biological processes, have given birth to domains such as artificial neural networks and evolutionary computation.
726
Chapter 33
I
Evolvable FPGAs
Living organisms are complex systems exhibiting a range of desirable characteristics, such as evolution, adaptation, and fault tolerance, which have proved difficult to realize using traditional engineering methodologies. Such systems are characterized by a genetic program—the genome—that guides their development, their functioning, and their death. If one considers life on Earth from its very beginning, the following three levels of organization can be distinguished [3]. Phylogeny: The first level is the temporal evolution of the genetic program, the hallmark of which is the evolution of species, or phylogeny. The multiplication of living organisms is based on the reproduction of the program, subject to an extremely low error rate at the individual level to ensure that the species of the offspring remains unchanged. Mutation (asexual reproduction) or mutation with recombination (sexual reproduction) gives rise to new organisms. The phylogenetic mechanisms are fundamentally nondeterministic, with the mutation and recombination rate providing a major source of diversity. This diversity is indispensable for the survival of living species, for their continuous adaptation to a changing environment, and for the appearance of new species. Ontogeny: This level constitutes the developmental process of multicellular organisms. The successive divisions of the mother cell, the zygote, into newly formed cells, each possessing a copy of the original genome, is followed by a specialization of the daughter cells in accordance with their surroundings (i.e., their position within the ensemble). This latter phase is known as cellular differentiation. The ontogenetic process is essentially deterministic: An error in a single base within the genome can provoke an ontogenetic sequence that results in notable, possibly lethal, malformations. Epigenesis: The ontogenetic program is limited in the amount of information it can store, rendering the complete specification of the organism impossible. A well-known example is the human brain, whose some 1010 neurons and 1014 connections are far too many to be completely specified in the 4-character genome with a length of approximately 3 × 109 . Therefore, when a certain level of complexity is reached, there must emerge a different process that permits the individual to integrate its vast quantity of interactions with the outside world. This is known as epigenesis, which primarily includes the nervous, immune, and endocrine systems. These systems are characterized by a basic structure that is entirely defined by the genome (the innate part), which is then subjected to modification through the individual’s lifetime interactions with the environment (the acquired part). The epigenetic processes can be grouped under the heading of learning systems. Analogous to nature, the space of bio-inspired hardware systems can be partitioned along the phylogenic, ontogenic, and epigenetic axes; we refer to this as the POE model [3, 4]. The distinction between the axes cannot be easily drawn
33.2 Artificial Evolution
727
where nature is concerned. We therefore define each axis within the model’s framework as follows: I I
I
The phylogenetic axis involves evolution. The ontogenetic axis involves the development of a single individual from its own genetic material, essentially without environmental interactions. The epigenetic axis involves learning through environmental interactions that take place after the individual is formed.
As an example, consider the following three paradigms, whose hardware implementations can be positioned along the POE axes: I
I
I
P—evolutionary algorithms are the simplified artificial counterpart of phylogeny. O—self-replicating and self-repairing cellular automata are based on the concept of ontogeny, where a single mother cell gives rise through multiple divisions to a multicellular organism. E—artificial neural networks embody the epigenetic process, where the system’s synaptic weights and perhaps topological structure change through interactions with the environment.
The domains collectively referred to as soft computing [5] often involve the solution of ill-defined problems coupled with the need for continual adaptation or evolution. The paradigms listed yield impressive results, frequently rivaling those of traditional methods. We will talk about the phylogenetic axis of hardware bio-inspired systems, most known as evolvable hardware (EHW). The scope of EHW covers diverse areas ranging from analog circuits to antenna design, but this chapter focuses on evolution of digital circuits using reconfigurable computing devices, more precisely, field-programmable gate arrays (FPGAs).
33.2
ARTIFICIAL EVOLUTION The idea of applying the biological principle of natural evolution to artificial systems, introduced more than three decades ago, has seen impressive growth in the past few years. Usually grouped under the term evolutionary algorithms (EAs) or evolutionary computation, we find the domains of genetic algorithms, evolution strategies, evolutionary programming, and genetic programming [6–9].
33.2.1
Genetic Algorithms
As a generic example of artificial evolution, we consider genetic algorithms (GAs) [10]. As illustrated in Figure 33.1, a GA is an iterative procedure applied to a constant-size population of individuals. Each individual represents a possible
Chapter 33
728
I
Evolvable FPGAs
Initialize random population 00100101110
(1)
10101101010
Decoding
00100101110 10101101010 01010101110 00101011101 00101101000 00010111100
01010101110
(2)
00101101000
FIGURE 33.1
I
0.72
0.21
0.54
Selection
(4a) Crossover
001001 | 11110 011101 11110 001001 01100
0.43
(3)
00010111100
New generation
011101 | 01100
0.35
00101011101
Population of genomes
(5)
0.78 Evaluation
(4c)
01110101100 00100111110 00101011101 00110100110 01101011101 00010011100
Mutation
(4b)
01010101110 00100101110 00101011101 00100101110 00101011101 00010111100
A genetic algorithm.
solution to the given problem, and eventually one is chosen as the searched solution. Each individual is coded by a finite string of symbols from a given alphabet, known as the genome. Each genome gives rise to the individual’s phenotype, which constitutes the actual solution (a program or a circuit) to the problem at hand (e.g., a robot controller for the example in Figure 33.1). The individual receives a score (better known as fitness) depending on the performance exhibited during its evaluation. The process from the genome to a fitness value can be seen as an n-dimensional function (where n is the genome size), and the set of all possible solutions can be seen as an n-dimensional search space. A GA can be summarized in the following steps: 1. Initialization: Create an initial population of individuals by defining a set of genomes in a random or heuristic manner. 2. Decoding: Generate the phenotypes for the individuals in the current population by decoding (mapping) the genotypes. 3. Fitness evaluation: Evaluate individuals according to some predefined quality criterion, referred to as fitness or fitness function. 4. Genetic operators: Apply genetically inspired operators to the current population: (a) Selection: Individuals are selected into a mating pool for reproduction according to their fitness. With stochastic or deterministic
33.3 Evolvable Hardware
729
selection mechanisms, the fittest individuals have more chances to transmit their genetic material to the next generation. (b) Mutation: The genome is randomly changed; and (c) Crossover: Two genomes are selected to be split and swapped at a random position. 5. If a predefined convergence condition has not been met, go back to step 2 to evaluate a new generation. Otherwise, deliver the best individual evaluated. The basic components of GAs are always the same: a population of individuals, a decoding mechanism from a genotype to a phenotype, a fitness evaluation, genetic operators, and an iterative process. However, GAs allow variants: There exist several methods for defining each of the steps just listed. By running a large enough number of generations, the GA should eventually find an acceptable solution (i.e., one with high fitness). EAs can be considered as a family of stochastic global optimization algorithms, mainly differing from their deterministic counterparts [11] by the lower knowledge of the problem they require and by the absence of mathematical proofs of convergence due to their stochastic nature. For highly nonlinear search spaces, EAs have exhibited faster convergence than deterministic methods, given their population-based approach. In most cases, the applications solved by EAs can also be tackled with deterministic optimization methods. EAs are very common, having been successfully applied to numerous problems from domains as diverse as optimization, circuit design, disease diagnosis assistance, precision agriculture, self-organizing systems, automatic programming, machine learning, economics, immune systems, ecology, population genetics, studies of evolution and learning, and social systems [9].
33.3
EVOLVABLE HARDWARE In the case of humans, adaptation due to evolution comes about through modifications in our DNA (deoxyribonucleic acid), which constitutes the encoding of every living being on Earth. DNA is a double-stranded molecule composed of two sugar-phosphate chains linked together by pairs of the bases adenine, cytocine, guanine, and thymine, constituting a string of symbols from a quaternary alphabet (A, C, G, T). Similarly, reconfigurable logic devices are configured by a string of symbols (the configuration bitstream) from a binary alphabet (0, 1). This string determines the function implemented by each of the programmable components and the connectionism of each of the switch matrices. With this description, a rough analogy arises naturally between DNA and a configuration bitstream and between a living being and a circuit (Figure 33.2). In both cases there is a mapping from a string representation to an entity that will perform one or more actions: growing, moving, reproducing, and so forth, for living beings; computing a function for circuits.
730
Chapter 33
Evolvable FPGAs
I
Genotype
Genotype
T
A
G
C
C
G A
Circuit on an FPGA
A
I
G
C
C
G
FIGURE 33.2
Phenotype
G
C
T
101010001100101101
T
A
T
Phenotype
The analogy between living beings and digital circuits.
Evolutionary process Genotype 001110100110100101 Phenotype
(a)
Genotype After several generations
0010010111101010010 Phenotype
(b)
Genotype After several generations
101010001100101101 Phenotype
(c)
FIGURE 33.3 I The evolutionary design of digital circuits: (a) intial random circuit, (b) intermediate circuit, and (c) final circuit.
This analogy between living beings and digital circuits suggests the possibility of applying the principles of artificial evolution to circuit design (Figure 33.3). Designing analog and digital electrical circuits is, by tradition, a hard engineering task vulnerable to human error, and for large circuits the optimality of a solution cannot be guaranteed. Design automation has become a challenge for tool designers, and given the increasing complexity of circuits, higher abstraction levels are needed. Evolvable hardware arises as a promising solution to this
33.3 Evolvable Hardware
731
problem: From a given behavior specification of a circuit, an EA will search for a bitstream describing a circuit that satisfies it. If we carefully examine the EHW work carried out to date, it becomes evident that it mostly involves the application of EAs to the synthesis of digital systems [12–23]. From this perspective, EHW is simply a subdomain of artificial evolution, where the final goal is the synthesis of an electronic circuit. The work of Koza [8], which includes the application of genetic programming to the evolution of a 3-variable multiplexer and a 2-bit adder, may be considered an early precursor along this line. It should be noted that, in Koza’s time, the main goal was to demonstrate the capabilities of the genetic programming methodology rather than to design actual circuits. We argue that the term evolutionary circuit design would be more descriptive of such work than the term evolvable hardware [24]. For now, we will stay with the latter (popular) term; however, we will return to the issue of definitions in Section 33.4. Taken as a design methodology, EHW offers a major advantage over classical methods. The designer’s job is reduced to constructing the evolutionary setup, which involves specifying the circuit requirements, the basic elements, a decoding mechanism, and the testing scheme used to assign fitness (this last phase is often the most difficult). If the setup has been well designed, evolution may then (automatically) generate the desired circuit. Currently, most evolved digital designs are suboptimal with respect to traditional methodologies; however, improved results are regularly demonstrated. There are two critical questions to ask when setting up a system to be evolved: how to map a phenotype from a genotype and how to compute the fitness of a circuit. The answers to these questions are critical and can make the difference between a successful and an unsuccessful evolution.
33.3.1
Genome Encoding
In examining the EHW work carried out to date, we can derive a classification of current EHW in accordance with genome encoding (i.e., the circuit description) and the calculation of a circuit’s fitness. High-level languages Using a high-level functional language to encode the evolving population implies an additional step to obtain the final circuit implementation: The chosen individual must be synthesized. Koza’s evolved solution [8] was a program that described the (desired) multiplexer or adder rather than an interconnection diagram of logic elements (the actual hardware representation). Mermoud et al. [25] used fuzzy rules as evolvable components, and Murakawa et al. [26] and Upegui et al. [27] proposed the evolution of artificial neural network topologies at the neuron and layer levels. Hemmi et al. [28] used a high-level HDL to represent the genomes. Koza et al. [29] used the rewriting operator, in addition to crossover and mutation, to form a hierarchical structure.
732
Chapter 33
I
Evolvable FPGAs
Low-level languages The idea of directly incorporating the bit string representing the configuration of a programmable circuit within the genome was presented early on by Atmar [30] and more recently by Higuchi et al. [1] and de Garis [2]. As a first step, a set of basic logic gates must be chosen (e.g., AND, OR, and NOT) and suitably codified, along with the interconnections between gates, to produce the genome encoding. For example, Higuchi et al. [31] used a low-level bit-string representation of the system’s logic diagram to describe small-scale programmable array logics (PALs), where the circuit is restricted to a logic sum of products. The limitations of PAL circuits have been overcome to a large extent by the introduction of FPGAs, as used initially by Thompson [32,33] and later by a number of research groups. The use of a low-level circuit description that requires no further transformation is an important step forward because it potentially enabled the placing of the genome directly into the actual circuit and thus paved the way toward true EHW (we will elaborate on this in Section 33.4). However, FPGAs presented two major problems: (1) The genome’s length was on the order of tens of thousands of bits, rendering evolution practically impossible using current technology, and (2) within the circuit space, consisting of all representable circuits, many circuits were invalid. With the introduction of the Xilinx XC6200 [34] family of FPGAs, these problems were reduced. As with previous FPGA families, there was a direct correspondence between the bit string of a cell and the actual logic circuit; however, because the XC6200 was completely multiplexer based, the result was always a viable system with no short circuits. Moreover, as opposed to previous FPGAs where the entire system had to be configured, the XC6200 family permitted the separate configuration of each cell, which was markedly faster and more flexible. Thompson [32] employed this feature to reduce the genome’s size, although he did not introduce real-time, partial system reconfigurations. Unfortunately, the XC6200 was discontinued after a few years; however, the results achieved by directly evolving its bitstream led to increased visibility for the EHW community and made possible the growth of this research field. Fitness calculation Note the following with regard to calculations for fitness with evolvable hardware. I
I
Off-chip. The use of a high-level language for genome representation means that we have to transform the encoded system to evaluate its fitness. This is usually carried out by simulation, and only the final solution found by evolution is actually implemented in hardware. On-chip. As noted previously, the low-level genome representation enables a direct configuration (and reconfiguration) of the circuit, which leads to the possibility of using real hardware during the evolutionary process. An example of on-chip fitness calculation is presented in the next section in the form of an intrinsic evolvable system.
33.4 Evolvable Hardware: A Taxonomy
EVOLVABLE HARDWARE: A TAXONOMY In EHW, the phylogenetic axis admits four qualitative subdivisions of evolution (Figure 33.4) according to the level of bio-inspiration: extrinsic, intrinsic, complete, and open ended.
33.4.1
Extrinsic Evolution
At the bottom of this axis, we find what is in essence evolutionary circuit design, where all operations are carried out in software, and the resulting solution may be loaded onto a real circuit. Though a potentially useful design methodology, this falls completely within the realm of traditional evolutionary techniques. This category is also widely known as extrinsic EHW. Extrinsic EHW has typically targeted the synthesis of circuits—that is, from a desired behavior specification, an EA finds a schematic of a circuit implementing a function that satisfies the specification [29]. This category supports different levels of abstraction, allowing to evolve logical gates, arithmetic operations, more complex functional blocks, or HDL code; however, it is not suited for evolving circuits at the bitstream level. Evolution has also been used in other extrinsic aspects of circuit design such as placement and routing [35, 36] and scheduling and allocation [37].
Open-ended evolution
Phylogeny
33.4
733
Complete evolution
Intrinsic evolution
Extrinsic evolution
is
es
n ge
Ontogeny
i
Ep
FIGURE 33.4
I
The divisions of phylogenetic hardware.
Chapter 33
734
33.4.2
I
Evolvable FPGAs
Intrinsic Evolution
Moving upward along the axis, we find research in which a real circuit is used during the evolutionary process for fitness computation, although most operations are still carried out offline, in software, as depicted in Figure 33.5. The very first intrinsic evolution was reported by Thompson [32]. He evolved a section of an XC6216 FPGA, consisting of 10×10 cells (the full array size was 64 × 64), to discriminate between square waves of 1 kHz and 10 kHz presented as inputs. His complete system setup is depicted in Figure 33.6 (see Thompson [33]). From a PC, he configured the FPGA with a configuration bitstream generated by a GA, which used a genome of 1800 bits (18 configuration bits per cell) to represent a possible circuit. Then the individual’s fitness was automatically evaluated as follows: 1. The tone generator, driven by the PC, presented five bursts each of both waves (1 kHz and 10 kHz) to the circuit. The analog integrator was reset before the generation of each burst, and it then integrated the circuit’s output during the presentation of the burst. 2. Back in the PC, the individual’s fitness was computed by a function aiming to maximize the difference between the average output voltages when presenting both waves. 3. After running the experiment for 2 to 3 weeks, during which 5000 generations of 50 individuals were evaluated, the resulting circuit achieved successful discrimination of the waves. However, the perfect desired behavior was obtained around generation 4100. In another interesting project, Thompson et al. [38] evolved a hardware controller for a two-wheeled autonomous mobile robot that was required to display simple wall avoidance behavior in an empty rectangular arena. A very important aspect of Thompson’s work is the unconstrained use of hardware. Conventional (human) design requires that constraints be applied to the circuit’s spatial structure and dynamic behavior, but evolution can do away with
Phenotype 5 FPGA circuit EA execution 1 Fitness computation
Genotype 5 configuration bitstream 101010001100101101
Results
FIGURE 33.5
I
Intrinsic evolution.
33.4 Evolvable Hardware: A Taxonomy
735
Analog integrator
Output (to oscilloscope)
XC6216 FPGA
Configuration
Desktop PC
Tone generator
FIGURE 33.6
I
Adrian Thompson’s intrinsic evolvable system setup.
these. The circuits evolved by Thompson [33, 38] and Ly and Mowchenko [37] had no enforced spatial structure (e.g., limitations on recurrent connections), no impositions upon modularity, and no dynamic constraints such as a synchronizing clock or handshaking between modules. Unconstrained circuit design can better exploit the dynamics of the circuit supporting it; however, such circuits exhibit two main drawbacks. One is the impossibility of reproducing a solution: The same bitstream does not behave in the same manner in two different devices. The other is the circuit’s high sensitivity to external conditions: Slight temperature changes can modify its behavior. Two more examples from this subdivision of the phylogenetic axis are the works of Murakawa et al. [39] and Iwata et al. [40]. One of the major obstacles these researchers hoped to overcome was large genome size (defining the FPGA’s full configuration). They suggested two solutions: 1. Variable-length chromosome GAs (VGA), where the genome does not directly represent the configuration bit string but rather codifies the possible logical operations and interconnections [40]. 2. Evolution at the function level, where the basic units are not elementary logic gates (e.g., AND, OR, and NOT) but rather higher-level functions (e.g., sine-wave generator, multiplier) [39]. Because no such commercial FPGA currently exists, Murakawa and Iwata and their colleagues proposed a novel architecture, dubbed F2 PGA (function-based FPGA). It is important to note that while experiments of the above type have been referred to by some as intrinsic evolution, they have a prominent extrinsic aspect because the population is stored in an external computer, which also controls the evolutionary process.
Chapter 33
736
33.4.3
I
Evolvable FPGAs
Complete Evolution
Still further along the phylogenetic axis, we find systems in which all operations (selection, crossover, mutation), as well as fitness evaluation, are carried out intrinsically, in hardware (Figure 33.7). This category, called complete evolution by Haddow and Tufte [41], has as its main motivation attaining adaptive systems that are able to accomplish difficult tasks, possibly involving real-time behavior in a complex, dynamic environment. The major aspect missing here, compared with biological evolution, is that the evolution is not open ended (i.e., there is a predefined goal and no dynamic environment to speak of). Within the category of complete evolution, we find two subdivisions: centralized and population oriented. Centralized evolution The main characteristic of centralized evolution is the existence of a single evolvable circuit and a single evolvable algorithm computation (Figure 33.7(a)). With this approach an on-chip genetic machine, a hardwired EA, is implemented. The approach also comprises implementations where the EA is executed in an on-chip processor. Centralized evolution holds special interest because it greatly enhances the autonomy of the circuit, allowing the EHW to adapt to a changing environment during its lifetime. Implementations of EAs in generalpurpose processors, in spite of their lower performance compared to their fully hardwired counterparts, exhibit several important advantages that permit them to benefit from a more general framework: They provide a more user-friendly interface for implementing chromosome manipulations, fitness evaluations, and memory access; they support easier algorithm upgrades; and they enhance the possibilities of immediately using the evolving circuit for useful computations.
FPGA
FPGA EA 1 fitness computation (specialized or general-purpose processor)
EA 1 fitness G
R
EA 1 fitness G
R
EA 1 fitness G
R
P
P
P
EA 1 fitness
EA 1 fitness
EA 1 fitness
Genotype (G) Results (R) G
Phenotype (P) (a)
FIGURE 33.7
I
R P
G
R P
(b)
Complete evolution: centralized (a) and population oriented (b).
G
R P
33.4 Evolvable Hardware: A Taxonomy
737
One example of a self-reconfigurable platform that performs online and on-chip evolution is that of Upegui and Sanchez [42, 43]. Their standalone platform consists of a MicroBlaze processor with memory access control, ICAP (internal configuration access port) access, and a reconfigurable evolvable section, as depicted in Figure 33.8. The full system, implemented in a Virtex-II FPGA, runs an EA on the MicroBlaze processor, reads a section of the configuration bitstream through the ICAP, modifies the bitstream according to the genome currently evaluated in the MicroBlaze, sends back the bitstream though the ICAP for partially reconfiguring the FPGA, and evaluates the fitness of the current individual by interacting with the reconfigurable evolvable section through the standard OPB bus. Upegui and Sanchez [42] evolve nonuniform cellular rules and FPGA lookup table (LUT) configurations with fixed interconnectivity. In Upegui and Sanchez [43], Boolean networks are evolved as well, but in this case the interconnectivity is not fixed, so the system topology is also driven by the evolutionary algorithm. Other interesting experiments were carried out by Haddow and Tufte [41] in which a hardware implementation of a GA, the “GA pipeline,” evolves a robot controller. Glette and Torresen [44] report the implementation of a GA on an embedded PowerPC processor in a Virtex-II Pro FPGA that evolves a circuit in the same FPGA.
FIGURE 33.8
I
BRAM
UART
HW_ICAP
The setup of a complete and centralized self-reconfigurable evolvable platform.
SRAM
OPB bus SRAM controller
RBN cell array
LMB bus
Microblaze core
Reading interface
Writing interface
Population-oriented evolution A hardware implementation of the full population, not only of one individual (as was the case in previous categories), is the distinctive feature of the populationoriented approach (Figure 33.7(b)). A significant example is the work of Goeke et al. [45], where an evolving cellular system was implemented in which evolution takes place completely on-chip. This system is based on the cellular automata model—a discrete dynamic system that performs computations in
738
Chapter 33
I
Evolvable FPGAs
a distributed fashion on a spatially extended grid. A cellular automaton consists of an array of cells, each of which can be in one of a finite number of possible states, updated synchronously in discrete timesteps according to a local, identical interaction rule [46]. The state of a cell at the next timestep is determined by the current state of a surrounding neighborhood of cells. This transition is usually specified in the form of a rule table, which delineates the cell’s next state for each possible neighborhood configuration. The cellular array (grid) is n-dimensional, where typically n = 1, 2, 3. Nonuniform cellular automata have also been considered in which the local update rule need not be identical for all grid cells [47]. Based on the cellular programming EA of Sipper [47], Goeke et al. [45] implemented an evolving, one-dimensional, nonuniform cellular automaton. The main feature of the cellular programming algorithm is the fact that genetic operators are computed in a distributed way: Each automaton modifies its own rule based on its own and its neighbors’ fitness. Each of the system’s 56 binarystate cells contains a genome that represents its rule table. These genomes are initialized at random and then are subjected to evolution. The environment imposed on the system specifies the resolution of a global synchronization task: On presentation of a random initial configuration of cellular states, the system must reach, after a bounded number of timesteps, a configuration for which the states of the cells oscillate between all zeros and all ones on successive timesteps. This may be compared to a swarm of fireflies, thousands of which may flash on and off in unison, having started from totally uncoordinated flickerings. Each insect has its own rhythm, which changes only through local interactions with its neighbors’. Because of the local connectivity of the system, this global behavior, which involves the entire grid, makes for a difficult task. Nonetheless, applying the evolutionary process of Sipper [47], the system evolves (i.e., the genomes change) such that the task is completed. The evolving cellular system described here exhibits complete on-chip evolution in that all operations are performed in hardware in a distributed population-based manner with no reference to an external computer.
33.4.4
Open-ended Evolution
The last subdivision, situated at the top of the phylogenetic axis, involves a population of hardware entities evolving in an open-ended environment. When the fitness criterion is imposed by the user in accordance with the task to be performed (currently the rule with artificial evolution techniques), we attain a form of guided, or directed, evolution. This is to be contrasted with the open-ended evolution that occurs in nature, which admits no externally imposed fitness criterion but rather an implicit, emergent, dynamic one (which can arguably be summed up as reproducibility). Open-ended undirected evolution is the only form of evolution known to produce such devices as eyes, wings, and nervous systems and to give rise to the formation of species. Undirectedness may have to be applied to artificial evolution if we want to observe the emergence of completely novel systems.
33.5 Evolvable Hardware Digital Platforms
739
We argue that only open-ended evolution can be truly considered EHW, which is still an elusive goal at present. We point out that a more correct term would probably be evolving hardware. A natural application area for such systems is the field of autonomous robots—that is, machines capable of operating in unknown environments without human intervention [48]. Specifically, collective robotics exhibits a population of individuals interacting in a common environment, in which they can learn to cooperate or to compete for achieving their goals [49]. In their interactions the individuals exhibit a high level of emergence as a first step to open endedness. Modular robotics, a subtype of collective robotics, also offers a promising open-ended real environment. A modular robotic platform well suited for evolving distributed hardware is YaMoR. This is a modular robot composed of mechanically homogeneous modules [50], each of which contains an FPGA-based system that allows wireless FPGA configuration and on-board self-reconfiguration. Another interesting example is what we call Hard-Tierra. This involves the hardware implementation (e.g., FPGA circuits) of the Tierra “world,” which consists of an open-ended environment of evolving computer programs [51]. Hard-Tierra is important because it demonstrates that open-endedness does not necessarily imply a real, biological environment.
33.5
EVOLVABLE HARDWARE DIGITAL PLATFORMS The hardware substrate that supports evolution is one of the most important initial decisions to make when evolving hardware. The hardware architecture is closely related to the type of solution being evolved. Hardware platforms usually have a cellular structure composed of uniform or nonuniform components. In some cases, we can evolve the components’ functionality; in others, the connectivity; or sometimes both, with the most powerful ones. FPGAs fit well into this third category because they are composed of configurable logic elements interconnected by configurable switch matrices. FPGA configuration is contained in a configuration bitstream, which holds every function and switch position to be configured for implementing a given design. Current FPGAs allow the processing of partial bitstreams, reconfiguring just a sector of the FPGA while the remaining logic stays the same. When evolving a circuit on an FPGA, we consider the logic cell as the basic element. The logic cells’ configuration and their interconnectivity are defined by the evolution. However, this implies a huge search space to explore and can prevent the EA from finding a solution. A common technique to constrain the search space is to define a basic block as a set of logic cells. In this way each basic block can be an artificial neuron, a fuzzy rule, or a more complex cell in general. Another option is to constrain the connectionism, using layered architectures, to a certain neighborhood, or by just defining it as fixed. The most basic requirement when evolving hardware is to have a set of highor low-level evolvable components and a hardware substrate supporting them.
740
Chapter 33
I
Evolvable FPGAs
These evolvable components are the basic elements from which the evolved circuits will be built (transistors, logic gates, arithmetic functions, functional cells, etc.), and the evolvable substrate must be a flexible hardware platform that allows arbitrary configurations mapped from a genome. FPGAs constitute the perfect hardware substrate, given their connectivity and functional flexibility. The evolvable substrate can be implemented using one of two main techniques: (1) exploiting the flexibility provided by the FPGA’s configuration logic and (2) building a virtual flexible substrate on top of the logic. In the first approach the configuration bitstream of the FPGA is directly generated. In this way, we can make better use of FPGA resources—logic functions are directly mapped into the FPGAs LUTs, and connections are directly mapped to routing switch matrices and multiplexers—but the penalty is very low-level circuit descriptions [33, 38, 52]. In the second approach a virtual reconfigurable circuit is built on top of the actual circuit [53]. In this way the designer can also define the configuration bitstream and determine which features of the circuit to evolve. This approach has been widely used by several groups, as it produces enhanced flexibility and ease of implementation. The penalty here is the cost of an inefficient use of logic resources [25, 27, 42, 45, 53–60]. Different custom chips have been proposed for this purpose with very interesting results: The main interest in proposing an architecture is that commercial FPGAs are designed for general-purpose applications, so they do not necessarily fit the requirements for evolvable architectures. For example, commercial devices may have illegal configurations that cause short circuits; this is reasonable for standard FPGA users who rely on the CAD flow to create the design, but it can be disastrous for genetically evolved bitstreams. Custom evolvable chips generally provide dynamic and partial reconfiguration, contain multi-context configuration memories, and can be configured with arbitrary bitstreams. However, although the custom chips are better suited to EHW applications, the commodity devices benefit from economies of scale and access to more advanced fabrication processes. Different chips and platforms have been developed to provide the flexibility necessary for evolving analog, digital, and mixed circuits; some of them have been designed specifically for EHW, while for others EHW is just another application field. Among them we find different levels of granularity, different types of reconfiguration including dynamic and static reconfigurations, and the possibility of loading partial configuration bitstreams, and the utilization of context memories.
33.5.1
Xilinx XC6200 Family
The obsolete Xilinx XC6200 family [61] deserves a special mention in a discussion of EHW platforms. For several years, the XC6200 family constituted the perfect platform for intrinsic EHW, because it made possible downloading any arbitrary bitstream without risking contention given its multiplexer-based connection architecture. It also allowed dynamic reconfiguration, making it more flexible for adaptive algorithms in a general sense. The results reported
33.5 Evolvable Hardware Digital Platforms
741
by Thompson [32, 33, 38, 62], discussed previously, are a very good example of the XC6200’s potential for evolving circuits. The XC6200 represents an important initial stepping-stone in the EHW field. It has also been used for implementing several types of applications, among them cooperative robot controllers [63], sorting networks [64], and imageprocessing algorithms [65].
33.5.2
Evolution on Commercial FPGAs
After the XC6200 disappeared, many research groups turned to the Xilinx XC4000 family. However, these FPGAs had an important drawback for evolving hardware: They were not partially reconfigurable, and no arbitrary bitstreams were allowed. When the Virtex FPGAs appeared, they exhibited two wellappreciated features for the EHW community: partial and dynamic reconfiguration. However, not all the evolution-friendly features from the XC6200 were kept. Specifically, the connection mechanism does not support arbitrary bitstreams, making these FPGAs susceptible to damage by internal short circuits. Recent work on evolvable circuits in commercial FPGAs has focused on the Virtex and Virtex-II architectures from Xilinx [66] and will extend its focus to Virtex-4 in the near future. Two main approaches have been used for evolving Virtex circuits: using virtual reconfigurable circuits [67] and partially reconfiguring the FPGA. Virtual reconfiguration Two solutions were used in order to replace the obsolete XC6200 family: implementing an ASIC evolvable circuit (only achievable by some privileged groups, summarized in Section 33.5.3) and building a reconfigurable circuit on top of another reconfigurable circuit (i.e., a virtual reconfigurable device [53]). The concept of a virtual reconfigurable circuit is depicted in Figure 33.9, where a reconfigurable neuron cell constitutes the device’s basic logic cell. In the beginning, the most intuitive method was to reconstruct the XC6200 architecture. At the University of York, a virtual XC6200 CLB was implemented in Virtex FPGAs [68, 69]. Slorach and Sharman [54] also used virtual XC6200 cells in the Xilinx XC4010 and Altera EPF6010A, evolving configuration bitstreams that configured not the FPGA itself but the virtual XC6200 CLBs. Afterward, other research groups developed different reconfigurable architectures with enhanced features, several of which had the goals of flexibility and easy reconfiguration [54–59, 70–72]. For example, Sekanina and Drabek [70] developed a virtual reconfigurable cell called a functional block (FB) and used an array of FBs for image compression. Durbeck and Macias [71] implemented an 8 × 8 cell matrix using a Xilinx Spartan-2 FPGA. With this approach came the possibility of designing any desired reconfigurable fabric. In most cases the architecture consists of a fine-grained cellular array in which a general-purpose evolvable architecture is proposed. However,
742
Chapter 33
I
Evolvable FPGAs
D SET Q
LUT
CLR Q
FPGA
D SET Q
LUT
CLR Q
clk clr set Logic cell
W1 W2
1
W3 W4
Virtual reconfigurable cell
FIGURE 33.9
I
A virtual reconfigurable circuit with a reconfigurable neuron.
problem-oriented reconfigurable fabrics can use coarser-grained architectures, where a reduced set of features is evolved. Dynamic partial reconfiguration In addition to the Xilinx XC6200, other commercial platforms have been partially reconfigured for evolving circuits, with the main focus on the Xilinx Virtex families. However, there are two main issues in evolving circuits by partially reconfiguring Virtex architectures. The first is the size of their configuration bitstreams, which implies a huge search space for the EA. The second is the generation of invalid bitstreams—that is, bitstreams that cause internal contentions. Different solutions to these problems have been suggested. Haddow and Tufte proposed a two-dimensional array of Sblocks [72], each containing a flip-flop, a 5-input LUT, and some routing resources. Sblocks provide a reduced configurability compared to Virtex cells in order to reduce the search space size and to guarantee contention-free configurations. Even though the Sblock array is virtually reconfigurable, the functionality is reconfigured by partially reconfiguring a Virtex FPGA. Haddow and Tufte used a partial bitstream for reconfiguring only the LUT contents.
33.5 Evolvable Hardware Digital Platforms
743
At the University of York, JBits [73] has been used for evolving circuits. JBits is a Java API for describing circuits and manipulating configuration bitstreams. It allows safe generation of partial bitstreams, permitting the modification of internal modules in the FPGA design. At York, LUT contents have been mapped from a genome for evolving simple combinatorial functions [74], fault tolerance circuits [69], and robot controllers for obstacle avoidance [75]. Also using JBits, Levi and Guccione from Xilinx developed a tool called GeneticFPGA [76], which translates a configuration bitstream from a chromosome, making it easy to generate legal bitstreams. Even though JBits provides interesting features for EHW, it has several limitations, such as the impossibility of running on an embedded platform (for on-chip evolution), dependence on supported FPGA families and supported boards, incompatibility with other hardware description languages (HDLs), and limited support from Xilinx, mainly reflected in insufficient documentation. Several ways to overcome these limitations have been proposed at the EPFL. Upegui and Sanchez [52] summarize three techniques for EHW by partially reconfiguring Virtex and Virtex-II families dynamically, without using JBits. The first is a coarse-grained high-level solution based on the modular partial reconfiguration flow proposed by Xilinx [77]. It defines large evolvable functions, implemented as modules, that are well suited for architecture exploration [27]. The second and third techniques are fine-grained low-level solutions. In both of the cases, hard-macros are used to define an evolvable component. Then by placing the hard-macros they modify, the bitstream partially reconfigures components of the hard macros. The second technique uses the difference-based partial reconfiguration flow proposed by Xilinx [77]. The third technique directly manipulates the bitstream in a manner similar to the XC6200, by adding some constraints (only LUT and multiplexer configuration modifications are allowed). These techniques are well suited for finetuning. With the difference-based approach, Mermoud et al. [25] report the intrinsic evolution of a fuzzy classifier; and with the bitstream manipulation, they report a complete evolution of cellular automata [42] and Boolean networks [43].
33.5.3
Custom Evolvable FPGAs
One of the more recent evolvable chips is the POEtic tissue [78, 79], a computational substrate optimized for the implementation of digital systems inspired by the POE model presented in the introduction to this chapter. The POEtic tissue is a self-contained, flexible physical substrate designed (1) to interact with the environment through spatially distributed sensors and actuators; (2) to develop and adapt its functionality through a process of evolution, growth, and learning to a dynamic and partially unpredictable environment; and (3) to self-repair parts damaged by aging or environmental factors in order to remain viable and retain the same functionality.
Chapter 33
I
Evolvable FPGAs
The POEtic tissue is composed of a two-dimensional array of POEtic cells, each designed as a 3-layer structure following the three axes of bio-inspiration (Figure 33.10): I
I
I
The phylogenetic layer acts on a cell’s genetic material. It can be used to find and select the genes of the cells for the genotype layer, which is conceptually the simplest of the three tissue layers as it is mainly a memory containing the genetic information of the organism. Ontogeny concerns the development of the individual and thus the mapping or configuration layer of the cell, which implements cellular differentiation and growth. In addition, it has an impact on the system as a whole for self-repair. The configuration layer selects which gene will be expressed depending on a user-defined differentiation algorithm. The epigenetic axis modifies the behavior of the organism during its operation and is therefore best applied to the phenotype, which is probably the most application-dependent layer. If the final application is a neural network, the phenotype layer will consist of an artificial neuron.
A key aspect of the applicability of the POEtic tissue, in addition to its architecture, is its reconfigurability. A molecule can be partially reconfigured by an on-chip microprocessor or by neighbor molecules. For EHW, this feature is
Execution unit
Epigenesis
pe
Ph
n
t (o erp pt re io t na er l)
er e lo ntia gi c tio
I
r
ye
la
M
The organizational layers of the POEtic cell.
ge
rN
e
to ra
tiv
e
G
e
yp
t no
ra
pe
pe
O
pe
O
O
r1
r0
to
r
ra
to ra
pe
O
pe O
r2
o at
no m
e
D n D
FIGURE 33.10
ge en no tia m tio e n
er e ta nti iff bl ati er e o
iff
Phylogenesis
ng
pi
ap
In
iff
Ontogenesis
r
ye
la
r
ye
la
ty
o en
Communication unit
D
744
33.6 Conclusions and Future Directions
745
very important in terms of execution time. Because only two clock cycles are needed for a write, and three words of 32 bits define a complete molecule, the configuration of the entire array (or a part of it) is very fast. In comparison with commercial FPGAs, such as the Virtex-II, in which at least a full configuration frame must be sent each time, reconfiguration takes place in parallel, allowing a huge speedup. A distinctive feature of the POEtic tissue is its two-dimensional array of routing units that implement a dynamic routing algorithm [80]. It is used for intercellular communication, allowing the tissue to dynamically create paths between cells. The dynamic routing can be performed by a distributed algorithm [80] or by the on-chip processor. Another very important circuit is the evolvable LSI chip developed by Higuchi’s group [81]. It includes a GA unit and has the ability to process two chromosomes in parallel. Higuchi’s group is famous for the large number of applications implemented in their chips [82, 83]. They have implemented an adaptive prosthetic hand controller [84, 85] that can adapt to the user’s electromyographic signals in less than 10 minutes with a much more compact circuit than required with a neural network (before that, the user had to adapt to the hand instead of the hand to the user, requiring more than a month of training). They have also evolved data compressors for electrophotographic printing [86, 87], often attaining compression ratios twice those obtained with international standard compression algorithms such as Lempel-Ziv, JBIG, and JBIG2. It must be noted that Higuchi’s applications often finish as part of a commercial product. Other interesting applications implemented by the same group include robot navigation controllers [88] and low-power integrated circuits [89]. This chapter focused primarily on evolution for digital devices; however, several platforms have been proposed for analog and mixed-signal circuit evolution. At the Jet Propulsion Laboratory of the California Institute of Technology, a field-programmable transistor array (FPTA) [90] has been developed that is the basis of the Standalone Board-level Evolvable System (SABLES) [91]. Layzell [92] proposed the evolvable motherboard: a diagonal matrix of analog switches connected to up to six plug-in daughter boards, which contain the desired basic elements for evolution.
33.6
CONCLUSIONS AND FUTURE DIRECTIONS EHW has been shown to be effective at finding solutions [82, 83] for real-world applications. Additionally, some solutions have proven to perform better than their engineered counterparts [83, 89, 93]. On the other hand, EHW generally performs poorly, as a system-level solution: Microprocessor architectures, for example, are not among evolution results. As a matter of fact, evolution works better when the target is a complex cellular architecture: cellular automata, neural networks, or gate arrays.
746
Chapter 33
I
Evolvable FPGAs
If we look at the EHW work carried so far, we find many common characteristics spanning most current systems that often differ from biological evolution (this difference is not necessarily disparaging): I
I
I
I
Evolution pursues a predefined goal: The design of an electronic circuit is subject to precise specifications. On finding the desired circuit, the evolutionary process terminates. The population has no material existence. At best, in what has been called intrinsic and complete evolution, there is one circuit available onto which individuals from the population are loaded one at a time to evaluate their fitness. The absence of a real population in which individuals coexist simultaneously entails notable difficulties in the realization of interactions between “organisms.” This usually results in a completely independent fitness calculation, contrary to nature, which exhibits a coevolutionary scenario. The different phases of evolution are carried out sequentially, controlled by a central unit.
These limitations suggest that the simple application of EAs to hardware design is not enough and that future research in EHW must not be limited to exploration of architectures and substrates; there is also much to do at the algorithmic level. Human-made adaptable systems are still far from exhibiting an adaptation comparable to living beings, and even though we have yet to attain circuits of equivalent complexity, limitations are not just a matter of magnitude. Only by modeling together the three axes of life (phylogeny, ontogeny, and epigenesis) will we be able to build systems featuring naturelike adaptation. Future trends in nanotechnology are also guiding us toward “Avogadro computers”—that is, massively parallel devices with 1023 transistors. What to do with such huge number of transistors, and how to use, interconnect, and program them, goes beyond present engineering knowledge; however, EHW architectures and algorithms arise as a promising solution for dealing with the design complexity of these machines. In this chapter we focused on evolving silicon circuits, which constitute the main developments achieved by the EHW community. However, other types of substrates have been evolved that extend the domain and represent new directions for evolvable hardware. For example, NASA researchers have been working on evolving antennas for space missions [94, 95]. Miller and Downing are currently working on evolving liquid crystals (LC) [96]—by applying electric fields mapped from a genome, they modify the LC molecular alignment to implement a desired function. Molecular circuit design is another promising evolvable substrate. Masiero et al. [97] report the use of a GA for tuning component parameters in a molecular circuit. Quantum circuit synthesis, too, is a potential field for EHW [98], given that designing circuits in such a substrate will require new design paradigms.
33.6 Conclusions and Future Directions
747
References [1] T. Higuchi, T. Niwa, T. Tanaka, H. Iba, H. de Garis, T. Furuya. Evolving hardware with genetic learning: A first step towards building a Darwin Machine. From animals to animals 2. Proceedings of the International Conference on Simulation of Adaptive Behavior, 1993. [2] H. de Garis. Evolvable hardware: Genetic programming of a Darwin Machine. Proceedings of the International Conference on Artificial Neural Nets and Genetic Algorithms, 1993. [3] E. Sanchez, D. Mange, M. Sipper, M. Tomassini, A. Perez-Uribe, A. Stauffer. Phylogeny, ontogeny, and epigenesis: Three sources of biological inspiration for softening hardware. Evolvable Systems: From Biology to Hardware, LNCS 1259, 1997. [4] M. Sipper, E. Sanchez, D. Mange, M. Tomassini, A. Perez-Uribe, A. Stauffer. A phylogenetic, ontogenetic, and epigenetic view of bio-inspired hardware systems. IEEE Transactions on Evolutionary Computation 1(1), 1997. [5] S. Mitra, Y. Hayashi. Neuro-fuzzy rule generation: Survey in soft computing framework. IEEE Transactions on Neural Networks 11(3), 2000. ¨ [6] T. Back. Evolutionary Algorithms in Theory and Practice: Evolution Strategies, Evolutionary Programming, Genetic Algorithms, Oxford University Press, 1996. [7] D. B. Fogel. Evolutionary Computation: Toward a New Philosophy of Machine Intelligence, 2nd ed., IEEE Press, 2000. [8] J. R. Koza. Genetic Programming: On the Programming of Computers by Means of Natural Selection, MIT Press, 1992. [9] M. Mitchell. An Introduction to Genetic Algorithms, MIT Press, 1996. [10] M. D. Vose. The Simple Genetic Algorithm: Foundations and Theory, MIT Press, 1999. [11] J. Pinter. Global Optimization in Action (Continuous and Lipschitz Optimization: Algorithms, Implementations and Applications), Kluwer Academic Press, 1996. [12] E. Sanchez, M. Tomassini. Towards evolvable hardware. LNCS 1062. SpringerVerlag, 1996. [13] Y. Liu. Evolvable systems: from biology to hardware. Proceedings of the Fourth International Conference, ICES, October 2001. [14] A. M. Tyrrell, P. C. Haddow, J. Torresen. Evolvable systems: From biology to hardware. Proceedings of the 5th International Conference, LNCS, March 2003. [15] J. M. Moreno, J. Madrenas, J. Cosp. Evolvable systems: From biology to hardware. Proceedings of the Sixth International Conference, ICES 2005, September 2005. [16] T. Higuchi, M. Iwata, W. Liu. Evolvable systems: From biology to hardware. Proceedings of the First International Conference, October 7–8, 1996. LNCS 1259, Heidelberg: Springer-Verlag, 1997. [17] M. Sipper, D. Mange, A. P´erez-Uribe. Evolvable systems: From biology to hardware. Proceedings of the Second International Conference, September, LNCS 1478, Heidelberg: Springer, 1998. [18] J. Miller. Evolvable systems: From biology to hardware. Proceedings of the Third International Conference, ICES 2000, April 17–19, 2000. LNCS 1801, Heidelberg: Springer, 2000. [19] A. Stoica, D. Keymeulen, J. D. Lohn. Proceedings of the First NASA/DOD Workshop on Evolvable Hardware, July. IEEE Computer Society, 1999.
748
Chapter 33
I
Evolvable FPGAs
[20] A. Stoica, J. D. Lohn, R. Katz, D. Keymeulen, R. Zebulum. Proceedings of the 2002 NASA/DOD Conference on Evolvable Hardware, July. IEEE Computer Society, 2002. [21] J. D. Lohn, R. Zebulum, J. Steincamp, D. Keymeulen, A. Stoica, M. Ferguson. Proceedings of the 2003 NASA/DOD Conference on Evolvable Hardware, July. IEEE Computer Society, 2003. [22] R. Zebulum, D. Gwaltney, G. Hornby, D. Keymeulen, J. D. Lohn. A. Stoica. Proceedings of the 2004 NASA/DOD Conference on Evolvable Hardware, July 2004. IEEE Computer Society. [23] J. D. Lohn, D. Gwaltney, G. Hornby, R. Zebulum, D. Keymeulen. A. Stoica. Proceedings of the 2005 NASA/DOD Conference on Evolvable Hardware, June 2005. IEEE Computer Society. [24] X. Yao, T. Higuchi. Promises and challenges of evolvable hardware. IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews 29(1), 1999. [25] G. Mermoud, A. Upegui, C. A. Pena. E. Sanchez. A dynamically-reconfigurable FPGA platform for evolving fuzzy systems. Computational Intelligence and Bioinspired Systems, LNCS 3512, 2005. [26] M. Murakawa, S. Yoshizawa, I. Kajitani, X. Yao, N. Kajihara, M. Iwata, T. Higuchi. The GRD chip: Genetic reconfiguration of DSPs for neural network processing. IEEE Transactions on Computers 48(6), 1999. a-Reyes, E. Sanchez. An FPGA platform for on-line topology [27] A. Upegui, C. A. Pen exploration of spiking neural networks. Microprocessors and Microsystems 29(5), 2005. [28] H. Hemmi, J. Mizoguchi, K. Shimohara. Development and evolution of hardware behaviors. Towards Evolvable Hardware, LNCS 1062, 1996. [29] J. R. Koza, F. H. Bennett, D. Andre, M. A. Keane. Synthesis of topology and sizing of analog electrical circuits by means of genetic programming. Computer Methods in Applied Mechanics and Engineering 186(2), 2000. [30] J. W. Atmar. Speculation on the Evolution of Intelligence and Its Possible Realization in Machine Form, Ph.D. dissertation, New Mexico State University, Las Cruces, 1976. [31] T. Higuchi, M. Iwata, I. Kajitani, H. Iba, Y. Hirao, F. T. Furuya, B. Manderick. Evolvable hardware and its application to pattern recognition and fault-tolerant systems. Towards Evolvable Hardware, LNCS 1062, 1996. [32] A. Thompson. Silicon evolution. Proceedings of Genetic Programming, J. R. Koza et al. (eds.), MIT Press, 1996. [33] A. Thompson. An evolved circuit, intrinsic in silicon, entwined with physics. Evolvable Systems: From Biology to Hardware, LNCS 1259, 1997. [34] Xilinx, Inc. The Programmable Logic Data Book, 1996. [35] G. K. Venayagamoorthy, V. G. Gudise. Swarm intelligence for digital circuits implementation on field-programmable gate array platforms. Proceedings of the 2004 NASA/DOD Conference on Evolvable Hardware, July 2004. [36] B. C. Kahne. A Genetic Algorithm-Based Place-and-Route Compiler for a Run-time Reconfigurable Computing System, Master’s thesis, Virginia Polytechnic Institute and State University, Blacksburg, VA, 1997. [37] T. A. Ly, J. T. Mowchenko. Applying simulated evolution to high-level synthesis. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 12(3), 1993. [38] A. Thompson, I. Harvey, P. Husbands. Unconstrained evolution and hard consequences. Towards Evolvable Hardware, LNCS, 1996.
33.6 Conclusions and Future Directions
749
[39] M. Murakawa, S. Yoshizawa, I. Kajitani, T. Furuya, M. Iwata, T. Higuchi. Hardware evolution at function level. Parallel Problem Solving from Nature (PPSN IV), LNCS 1141, 1996. [40] M. Iwata, I. Kajitani, H. Yamada, H. Iba, T. Higuchi. A pattern recognition system using evolvable hardware. Parallel Problem Solving from Nature (PPSN IV), LNCS 1141, 1996. [41] P. Haddow, G. Tufte. Evolving a robot controller in hardware. Proceedings of the Norwegian Computer Science Conference, 1999. [42] A. Upegui, E. Sanchez. On-chip and on-line self-reconfigurable adaptable platform: The non-uniform cellular automata case. Proceedings of the 20th IEEE International Parallel and Distributed Processing Symposium, 2006. [43] A. Upegui, E. Sanchez. Evolving hardware with self-reconfigurable connectivity in Xilinx FPGAs. Proceedings of the First NASA /ESA Conference on Adaptive Hardware and Systems, 2006. [44] K. Glette, J. Torresen. A flexible on-chip evolution system implemented on a Xilinx Virtex-II Pro device. Evolvable Systems: From Biology to Hardware, LNCS 3637, 2005. [45] M. Goeke, M. Sipper, D. Mange, A. Stauffer, E. Sanchez, M. Tomassini. Online autonomous evolware. Evolvable Systems: From Biology to Hardware, LNCS 1259, 1997. [46] T. Toffoli, N. Margolus. Cellular Automata Machines: A New Environment for Modeling. MIT Press Series in Scientific Computation, 1987. [47] M. Sipper. Evolution of Parallel Cellular Machines: The Cellular Programming Approach, Springer, 1997. [48] R. A. Brooks. New approaches to robotics. Science 253, 1991. [49] Y. U. Cao, A. S. Fukunaga, A. B. Kahng. Cooperative mobile robotics: Antecedents and directions. Autonomous Robots 4(1), 1997. [50] R. Moeckel, C. Jaquier, K. Drapel, E. Dittrich, A. Upegui, A. Ijspeert. YaMoR and Bluemove: An autonomous modular robot with Bluetooth interface for exploring adaptive locomotion. Proceedings of the 8th International Conference on Climbing and Walking Robots (CLAWAR), 2005. [51] T. S. Ray. An approach to the synthesis of life. Artificial Life II, SFI Studies in the Sciences of Complexity 10, 1992. [52] A. Upegui, E. Sanchez. Evolving hardware by dynamically reconfiguring Xilinx FPGAs. Evolvable Systems: From Biology to Hardware, LNCS 3637, 2005. [53] L. Sekanina. Evolvable Components: From Theory to Hardware Implementations, Springer, 2004. [54] C. Slorach, K. Sharman. The design and implementation of custom architectures for evolvable hardware using off-the-shelf programmable devices. Evolvable Systems: From Biology to Hardware, LNCS, 2000. [55] Y. Zhang, S. Smith, A. Tyrrell. Digital circuit design using intrinsic evolvable hardware. Proceedings of the 2004 NASA/DOD Conference on Evolvable Hardware, July 2004. [56] L. Sekanina, S. Friedl. On routine implementation of virtual evolvable devices using COMBO6. Proceedings of the 2004 NASA/DOD Conference on Evolvable Hardware, July 2004. [57] K. Vinger, J. Torresen. Implementing evolution of FIR-filters efficiently in an FPGA. Proceedings of the 2003 NASA/DOD Conference on Evolvable Hardware, July 2003. [58] L. Sekanina. Towards evolvable IP cores for FPGAs. Proceedings of the 2003 NASA/DOD Conference on Evolvable Hardware, July 2003.
750
Chapter 33
I
Evolvable FPGAs
[59] P. C. Haddow, G. Tufte. An evolvable hardware FPGA for adaptive hardware. Proceedings of the 2000 Congress on Evolutionary Computation, 2000. [60] M. Sipper, M. Goeke, D. Mange, A. Stauffer, E. Sanchez, M. Tomassini. The firefly machine: Online evolware. Proceedings of the IEEE International Conference on Evolutionary Computation, 1997. [61] Xilinx, Inc. The XC6200 Data Sheet v.1.7, 1996. [62] A. Thompson, P. Layzell. Evolution of robustness in an electronics design. Evolvable Systems: From Biology to Hardware, LNCS 1801, 2000. [63] D.-W. Lee, C.-B. Ban, K.-B. Sim, H.-S. Seok, L. Kwang-Ju, B.-T. Zhang. Behavior evolution of autonomous mobile robot using genetic programming based on evolvable hardware. Proceeding of the 2000 IEEE International Conference on Systems, Man, Cybernetics, 2000. [64] J. R. Koza, F. H. Bennett, J. Hutchings, S. L. Bade, M. A. Keane, D. Andre. Evolving sorting networks using genetic programming and rapidly reconfigurable field-programmable gate arrays. Workshop on Evolvable Systems. International Joint Conference on Artificial Intelligence, 1997. [65] J. Dumoulin, J. A. Foster, J. F. Frenzel, S. McGrew. Special purpose image convolution with evolvable hardware. Real-World Applications of Evolutionary Computing, EvoWorkshops 2000, LNCS, 2000. [66] Xilinx, Inc. Virtex-II Platform FPGA User Guide (www.xilinx.com), March 2005. [67] L. Sekanina. Virtual reconfigurable circuits for real-world applications of evolvable hardware. Evolvable Systems: From Biology to Hardware, LNCS 2606, 2003. [68] G. Hollingworth, S. Smith, A. Tyrrell. Safe intrinsic evolution of Virtex devices. Proceedings of the Second NASA/DoD Workshop on Evolvable Hardware, 2000. [69] R. O. Canham, A. Tyrrell. Evolved fault tolerance in evolvable hardware. Proceedings of the Congress on Evolutionary Computation, 2002. [70] L. Sekanina, V. Drabek. The concept of pseudo evolvable hardware. Proceedings of the IFAC Workshop on Programmable Devices and Systems, 2000. [71] L. Durbeck, N. J. Macias. Defect-tolerant, fine-grained parallel testing of a cell matrix. Proceedings of SPIE ITCom 4867, 2002. [72] P. Haddow, G. Tufte. Bridging the genotype-phenotype mapping for digital FPGAs. Proceedings of the Third NASA/DoD Workshop on Evolvable Hardware, 2001. [73] S. A. Guccione, D. Levi, P. Sundararajan. JBits: A Java-based interface for reconfigurable computing. Proceedings of the Second Annual Military and Aerospace Applications of Programmable Devices and Technologies Conference, 1999. [74] G. Hollingworth, S. Smith, A. Tyrrell. The intrinsic evolution of Virtex devices through Internet reconfigurable logic. Evolvable Systems: From Biology to Hardware, LNCS 1801, 2000. [75] A. M. Tyrrell, R. A. Krohling, Y. Zhou. Evolutionary algorithm for the promotion of evolvable hardware. IEE Proceedings—Computers and Digital Techniques 151(4), 2004. [76] D. Levi, S. A. Guccione. Genetic FPGA: Evolving stable circuits on mainstream FPGA devices. Proceedings of the First NASA/DOD Workshop on Evolvable Hardware, 1999. [77] Xilinx, Inc. XAPP 290: Two Flows for Partial Reconfiguration: Module Based or Difference Based (www.xilinx.com), September 2004. [78] Y. Thoma, E. Sanchez. A reconfigurable chip for evolvable hardware. Proceedings of the Genetic and Evolutionary Computation Conference, 2004.
33.6 Conclusions and Future Directions
751
[79] Y. Thoma, G. Tempesti, E. Sanchez, J.M.M. Arostegui. POEtic: An electronic tissue for bio-inspired cellular applications. Biosystems 76(1–3), 2004. [80] Y. Thoma, E. Sanchez, J.M.M. Arostegui, G. Tempesti. A dynamic routing algorithm for a bio-inspired reconfigurable circuit. Proceedings of the International Conference on Field-Programmable Logic and Applications 2778, 2003. [81] M. Iwata, I. Kajitani, Y. Liu, N. Kajihara, T. Higuchi. Implementation of a gatelevel evolvable hardware chip. Evolvable Systems: From Biology to Hardware, LNCS 2210, 2001. [82] T. Higuchi, M. Iwata, H. Sakanashi, E. Takahashi, M. Murakawa, I. Kajitani. Dynamic adaptive devices and their applications. Bulletin of the Electrotechnical Laboratory, Special Issue: RWC Research Toward Realization of Real World Intelligence 64(4/5), 2000. [83] T. Higuchi, M. Iwata, D. Keymeulen, H. Sakanashi, M. Murakawa, I. Kajitani, E. Takahashi, K. Toda, M. Salami, N. Kajihara, N. Otsu. Real-world applications of analog and digital evolvable hardware. IEEE Transactions on Evolutionary Computation 3(3), 1999. [84] I. Kajitani, M. Iwata, M. Harada, T. Higuchi. A myoelectric controlled prosthetic hand with an evolvable hardware LSI chip. Technology and Disability, Special Issue: Advances in the Control of Prosthetic Arms 15(2), 2003. [85] I. Kajitani, T. Hoshino, N. Kajihara, M. Iwata, T. Higuchi. An evolvable hardware chip and its application as a multi-function prosthetic hand controller. Proceedings of the 16th National Conference on Artificial Intelligence, 1999. [86] H. Sakanashi, M. Iwata, T. Higuchi. Evolvable hardware for lossless compression of very high resolution bi-level images. IEE Proceedings—Computers and Digital Techniques 151(4), 2004. [87] H. Sakanashi, M. Iwata, D. Keymulen, M. Murakawa, I. Kajitani, M. Tanaka, T. Higuchi. Evolvable hardware chips and their applications. Proceedings of the International Conference on Systems, Man, and Cybernetics, 1999. [88] D. Keymeulen, M. Iwata, Y. Kuniyoshi, T. Higuchi. Online evolution for a selfadapting robotic navigation system using evolvable hardware. Artificial Life 4, 1998. [89] E. Takahashi, M. Murakawa, Y. Kasai, T. Higuchi. Power dissipation reductions with genetic algorithms. Proceedings of the 2003 NASA/DoD Conference on Evolvable Hardware, 2003. [90] A. Stoica, R. Zebulum, D. Keymeulen, R. Tawel, T. Daud, A. Thakoor. Reconfigurable VLSI architectures for evolvable hardware: From experimental fieldprogrammable transistor arrays to evolution-oriented chips. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 9(1), 2001. [91] A. Stoica, R. Zebulum, M. Ferguson, D. Keymeulen, V. Duong. Evolving circuits in seconds: Experiments with a stand-alone board-level evolvable system. Proceedings of the 2002 NASA/DOD Conference on Evolvable Hardware, July 2002. [92] P. Layzell. A new research tool for intrinsic hardware evolution. Evolvable Systems: From Biology to Hardware, LNCS, 1998. [93] L. Sekanina, R. Ruzicka. Easily testable image operators: The class of circuits where evolution beats engineers. Proceedings of the 2003 NASA/DOD Conference on Evolvable Hardware, July 2003. [94] J. Lohn, J. Crawford, A. Globus, G. Hornby, W. Kraus, G. Larchev, A. Pryor, D. Srivastava. Evolvable systems for space applications. Proceedings of the International Conference on Space Mission Challenges for Information Technology, 2003.
752
Chapter 33
I
Evolvable FPGAs
[95] J. Lohn, D. Linden, G. Hornby, W. Kraus, A. Rodriguez-Arroyo. Evolutionary design of an X-band antenna for NASA’s space technology 5 mission. Proceedings of the 2003 NASA/DoD Conference on Evolvable Hardware, 2003. [96] J. F. Miller, K. Downing. Evolution in materio: Looking beyond the silicon box. Proceedings of the 2002 NASA/DoD Conference on Evolvable Hardware, 2002. [97] L. P. Masiero, M. Pacheco, C. R. Hall, C. Santini. Molecular circuit design. Proceedings of the 2005 NASA/DOD Conference on Evolvable Hardware. June–July, 2005. [98] L. Spector, H. Barnum, H. J. Bernstein, N. Swamy. Quantum computing applications of genetic programming. Advances in Genetic Programming, MIT Press, 1999.
CHAPTER
34
NETWORK PACKET PROCESSING IN RECONFIGURABLE HARDWARE John W. Lockwood Washington University in St. Louis and Stanford University
This chapter will show, through an example, how networking systems have been built with reconfigurable hardware. It will describe how data can be switched, routed, buffered, processed, scanned, and filtered over networks using fieldprogrammable gate arrays (FPGAs). The chapter begins by describing the mechanisms by which Internet packets are segmented into frames and cells for transmission across a network. Internet Protocol (IP) wrappers are introduced, and it is shown how they simplify the implementation of large packet-processing systems. Next, a framework for building modular systems that implement Internet firewalls and intrusion prevention systems is presented. The chapter continues with a detailed explanation of how Bloom filters can scan streams of data for fixed strings and how finite automata can be used to scan for regular expressions. Case studies are provided that show how deep packet inspection systems are implemented in reconfigurable hardware. One circuit detects the spread of worms and viruses across an Internet link. Another circuit analyzes the semantics of the text in traffic flows to determine which language is used within attached documents. A hardware-accelerated version of the popular SNORT intrusion detection system is illustrated, and it is shown how the FPGA hardware works with the software on a host to analyze packets.
34.1
NETWORKING WITH RECONFIGURABLE HARDWARE 34.1.1
The Motivation for Building Networks with Reconfigurable Hardware
Although modern microprocessors continue to improve their performance, they are not improving as fast as the rate at which data flows over Internet connections. As the limits of Moore’s Law are reached, alternative computational methods are needed to route, process, filter, and transform Internet datastreams. Networking systems created with reconfigurable hardware are flexible and easily modified to provide new functionality. Reconfigurable hardware enables features on networking platforms to be implemented in ways that are quite different from current platform implementations. It allows new modular components to be created and then dynamically installed in remote networksystems.
754
Chapter 34
I
Network Packet Processing in Reconfigurable Hardware
By processing network packets in hardware rather than in software, networking applications do not suffer the performance penalty caused by sequential data processing. The Internet evolves as new protocols, features, and capabilities are added to the routers that implement the underlying network. Protocols, such as IP version 6 (IPv6), allow more devices to be individually addressed. Added features, such as per-flow queuing, allow voice and video to be reliably delivered in real time. Firewalls and intrusion prevention systems (IPSs) enhance Internet security. Network platforms have been built to route network traffic, filter packets, and queue data in reprogrammable hardware. With reconfigurable hardware, networking platform operation can change over time as packet-processing algorithms and protocols evolve. With FPGAs, all features of the packet-processing system are configurable down to the logic gates. These systems enable new services to deploy and operate at the rate of the highest-speed backbone links.
34.1.2
Hardware and Software for Packet Processing
For their packet-processing operations, today’s fastest routers use network processing elements implemented in custom silicon or in application-specific integrated circuits (ASICs). As shown in Figure 34.1, network processing elements reside between the line card where packets are transmitted and received and the Gigabit/second rate switch fabric that interconnects ports. They contain hundreds to thousands of parallel logic circuits and finite-state machines that are optimized to route, filter, queue, and/or process Internet datagrams in hardware. Several platform types have been developed, many of which use standard microprocessors such as the Intel Pentium, AMD Athlon, or Motorola/IBM PowerPC. Others use ASICs from vendors such as Agere, Intel, Motorola, Cavium, Broadcom and Vitesse. Although software-based systems have outstanding flexibility, their packet processing is limited because of the sequential nature of their instruction execution. ASICs and custom silicon networking chips have high performance, but they offer little flexibility as measured by their ability to reprogram. Figure 34.2 illustrates the trade-offs between flexibility and performance.
Line card Network packets
Gigabit switch fabric
Line card Network packets
Network processing element
Network processing element
FIGURE 34.1 I A reconfigurable network processing element located between a line card and switch fabric.
34.1 Networking with Reconfigurable Hardware Reprogrammable hardware
Microprocessor
Flexibility
755
Fully reprogrammable High performance Network processor ASIC Performance
FIGURE 34.2 I Flexibility and performance trade-offs for networking systems that use microprocessors, network processors, ASICs, and reprogrammable hardware.
34.1.3
Network Data Processing with FPGAs
Reconfigurable hardware devices share the performance advantage of ASICs because they can implement parallel logic functions in hardware. However, they also share the flexibility of microprocessors and network processors because they can be dynamically reconfigured. Using FPGAs for high-performance asynchronous transfer mode (ATM) networking was explored during the development of the Illinois Pular-based Optical Interconnect (iPOINT) testbed. In this project, an ATM switch with FPGAs [2] was developed and an advanced queuing module was implemented that provided per-flow queuing functionality in FPGA hardware. The FPGAs were used to implement the datapath of the switch and to control the state machines that buffered the ATM cells as they arrived on each switch port of the switch. The lookup tables (LUTs) in the FPGA fabric were used to build the multiplexers that switched the data between the ports. Finally, combinational logic was used to implement the state machines that controlled how packets were written to and read from SRAM [3]. FPGAs have also proven effective for implementation of bit-intensive function networking, such as forward error correction (FEC), and for boosting the performance of networking protocols [4]. The bitwise processing function maps well into the fine-grained logic on an FPGA. On-chip LUTs are used to encode data patterns as symbols with redundant bits of information. When the symbols are decoded, the redundant bits allow the receiver to reconstruct the data even with a few bits in error. Reconfigurable logic allows algorithms that use varying amounts and types of error correction to be programmed on-chip. Through the development of the Field-Programmable Port Extender (FPX) platform [1], it was demonstrated that high-performance network packetprocessing systems implemented with FPGAs are both useful and practical. The
756
Chapter 34
I
Network Packet Processing in Reconfigurable Hardware
FPX platform used two multi-Gigabit/second network interfaces, four banks of off-chip memory, and two FPGAs to implement over 30 networking applications. Applications developed for the FPX platform included modules that performed Internet Protocol IP address lookup for routing [7]; payload scanning for detection of fixed strings and regular expressions within the body of a packet; data queuing to provide quality of service (QoS); intrusion detection to determine when a network may be under attack; intrusion prevention to halt such attacks; and semantic processing of network data.
34.1.4
Network Processing System Modularity
Modularity is a key feature of networking systems. Network developers need standard interfaces to interface high-level network processing components to the underlying network infrastructure. In systems with reconfigurable hardware, modules can be implemented in regions of an FPGA and bound by a well-defined interface to the datapath and to external memory. Multiple modular data-processing components can be integrated to compose systems. Memory interfaces can connect logic to off-chip memory in order to buffer data and hold large lookup tables LUTs. For the FPX platform modules, data was received and transmitted via a series of ATM cells carried over a 32-bit-wide Utopia interface. ATM cells contained 48 bytes of payload data and 4 bytes of a header that included a virtual path identifier (VPI) and a virtual circuit identifier (VCI). Each ATM cell also included an 8-bit checksum that covered the ATM cell header. Larger IP datagrams were sent between modules using layered protocol wrappers that segmented and reassembled multiple cells into ATM adaptation layer 5 (AAL5) frames. These frames contained data from a series of ATM cells and a 32-bit checksum at the end that covered all bytes of the payload. Segmentation and reassembly of cells into frames were performed to transfer packets over the network. The FPX platform (Figure 34.3) stored and loaded data from two types of offchip memory. Two interfaces supported transfer of 36-bit-wide data to and from an on-chip SRAM. SDRAM interfaces provided 64-bit-wide interfaces to multiple banks of high-capacity, off-chip memory. In the implementation of the IP lookup module, the off-chip SRAM was used to store data structures for IP lookup, while the SDRAM was used to buffer packets. The lower latency of SRAM access was important for the implementation of lookup functions where there was a data dependency for the result; the larger capacity of the SDRAM was beneficial for reducing the cost of storing bulk data, including buffering dataflows. A switch was implemented using the reprogrammable application device (RAD) FPGA logic that allowed traffic to be routed to extensible modules. Layered protocol wrappers performed the segmentation and reassembly of AAL5 frames so that full packets could be processed by the FPGA hardware. To reprogram the RAD FPGA that contained the extensible modules, configuration and control logic was implemented on the network interface device (NID) FPGA. The FPX platform was integrated into the Washington University Gigabit Switch (WUGS) to process packets as they passed into and out of the networking
34.2 Network Protocol Processing
SDRAM
Flow buffer
SRAM
SRAM
Extensible modules
Route filter
SDRAM
Layered protocol wrappers
Memory RAD (FPGA)
PROM
Bitfile program logic
Program cache
NID (FPGA) Switch
Network interface
(a) FIGURE 34.3
757
I
(b)
A block diagram and a physical implementation of the FPX platform.
ports of a scalable network switch. The WUGS switching platform provided a backplane for transferring ATM cells between ports. By adding the FPX between the line cards and the switch fabric, the system was able to analyze, process, route, and filter IP packets as they flowed through the system. OC-3 to OC-48 line cards were used to directly send and receive ATM cells, while Gigabit Ethernet line cards were used to segment frames into multiple ATM cells and reassemble them. After data passed through the FPX, they were forwarded to the switch fabric, where cells were forwarded to other FPX modules in the chassis based on their VPI and VCI values.
34.2
NETWORK PROTOCOL PROCESSING The Open Systems Interconnection (OSI) Reference Model defines how multiple layers can be used to transport data over a computer network. OSI divides the functions of a protocol into a series of layers, each of which has two properties: (1) It uses only the functions of the layer below, and (2) it exports functionality only to the layer above. A system that implements protocol behavior consisting of a series of these layers is known as a protocol stack. Protocol stacks can be implemented in hardware, in software, or in a mixture of the two (typically, only the lower layers are implemented in hardware; the higher layers, in software). This logical separation makes reasoning about the behavior of protocol stacks much easier and allows their design to be elaborate but highly reliable. Each layer performs services for the next highest layer and makes requests for the next lowest layer [5]. For real systems that process Internet data, the OSI model is not directly implemented but instead serves as a reference for implementation of the real protocols. Layers are important for processing IP data, however, because they permit application-processing modules to abstract details of the lower-layer
758
Chapter 34
I
Network Packet Processing in Reconfigurable Hardware
Net app Wrapper
Low-level protocol wrapper
FIGURE 34.4
I
Integration of a network application within one or more wrappers.
network protocols. At the lowest layer, networks modify raw cells of data that move between interfaces. At higher layers, the applications process variablelength frames or IP packets. To send and receive data at the user level, a network application may transmit directly or receive user datagram protocol (UDP) messages by instantiating all wrappers and sending data from a network application down through a series of wrappers [6] (see Figure 34.4).
34.2.1
Internet Protocol Wrappers
Hundreds of millions of computers deployed throughout the world communicate over the Internet. Traffic from these machines is concentrated to flow over a smaller number of routers that forward traffic through the Internet core. Currently, Internet backbones operate over communication links ranging in speed from OC-3 (155 Mbps) to OC-768 (40 Gbps). Fast links that process small packets have the ability to process millions of IP packets per second. A library of layered protocol wrappers (see Figure 34.5) was developed to process Internet packets in reconfigurable hardware. Collectively, the wrappers simplified and streamlined the implementation of high-level networking functions by abstracting the operation of lower-level packet-processing functions. The library infrastructure was synthesized into FPGA logic and integrated into an FPX network platform. At the lowest levels, the library processes ATM cells. Complete frames of data are segmented and reassembled using ATM adaptation layer 5 (AAL5), over which IP messages are then transported. When only a single message needs to be transmitted, the UDP can send one packet over the Internet. UDP encapsulates a variable-length message into an IP packet and allows the system to specify source and destination port numbers that identify from which application on a machine the data was sent and to which application it should be delivered. UDP/IP also provides a checksum to ensure the integrity of the data. Using the FPX protocol-processing library, this checksum is automatically computed, using FPGA hardware, as the sum over the payload bytes of the message.
34.2.2
TCP Wrappers
Over 85 percent of all traffic on the Internet today uses the Transmission Control Protocol (TCP). TCP is stream oriented and guarantees delivery of data with
34.2 Network Protocol Processing
759
External memory interfaces
Data input
Application module TCP/UDP processor
Data output
IP packet processor Frame processor Cell processor
FIGURE 34.5
I
Implementation of layered protocol wrappers on the FPX platform.
an ordered byte flow. Processing TCP dataflows in the middle of the network is extremely difficult because network packets can be dropped, duplicated, and reordered. Packet sequences observed within the interior of the network may be different from packets received and processed at the connection endpoints. The complexities associated with tracking the state of end systems and reconstructing byte sequences based on observed traffic are significant. A TCP processing circuit was developed that handles the complexities associated with flow classification and TCP stream reassembly. It provided the FPGA logic with a view of network traffic flow data through a simple client interface. The TCP wrapper enabled other high-performance data-processing subsystems to operate on TCP network content without needing to implement their own state-tracking operations. The TCP module used a state store to track the status of each TCP/IP flow and, using a hash function, assigned a unique flow number to each session [8]. Figure 34.6 is a block diagram of the TCP processor. Internet packets arrive as frames of data to the input state machine of the TCP processing engine. The input state machine forwards the frames to a first in, first out (FIFO) that buffers the packet; a checksum engine that computes and verifies the correctness of the TCP checksum; and a flow classifier that computes a flow identifier (flow ID) using a hash over fields in the packet header. The flow ID is passed to the state store manager that retrieves the state associated with the particular flow. Results are written to the control and state FIFO, and the state store is updated with the current flow state. The output state machine reads data from the frame and control FIFO buffers and passes data to the packet-routing engine. Most traffic flows through the content-scanning engines, which scan the data. Packet retransmissions bypass these engines and go directly to the flow-blocking module. Data returning from the content-scanning engines also goes to the flowblocking module. This stage updates the per-flow state store with applicationspecific state information. If a content-scanning engine indicates that it has a need to block a flow, the flow-blocking module can enforce this rule by comparing the packet’s sequence number with the sequence numbers for which flow blocking should take place. If the packet meets the blocking criteria, the
Chapter 34
Network Packet Processing in Reconfigurable Hardware
I
Frame FIFO Checksum engine Flow classifier
Control and state FIFO
State store retrieval
State store update
Read engine
Write engine
Read interface
Write interface
Output state machine
TCP protocol processing
Input state machine
760
State store manager
SDRAM controller
Enhanced flow management
Blocking retrieval
Application update
Read/write engine
Read/write interface
512MB SDRAM module
FIGURE 34.6
I
A block diagram of the TCP processor.
flow-blocking module drops it from the network. Any remaining packets go to the outbound protocol wrapper. The state store manager processes requests to read and write flow state records. It also handles all interactions with SDRAM memory and caches recently accessed flow state information. The SDRAM controller exposes three memory access interfaces: a read/write, a write-only, and a read-only. The controller prioritizes requests in that order, with the read/write interface having the highest priority.
34.2.3
Payload-processing Modules
Many network applications have a common requirement for string matching in the payload of packets or flows. Once the data being transported over the network has been reconstructed using the IP and TCP modules, it can be examined in the payload. For example, the presence of a string of bytes (or a signature) can identify the presence of a media file, an attachment, or a security exploit. Well-known Internet worms, such as Nimda, Code Red, and Slammer, propagate by sending malicious executable programs identifiable by certain byte sequences in payloads [14]. Because the location (or offset) of such strings and
34.2 Network Protocol Processing
761
their length are unknown, such applications must be able to detect strings of different lengths starting at arbitrary packet payload locations. Packet inspection applications, when deployed at router ports, must operate at wire speeds. As network rates increase, the implementation of packet monitors that process data at Gigabit/second line rates has become increasingly difficult. Thus, the growth in network traffic has motivated specialized packet- and payload-processing modules in hardware.
34.2.4
Payload Processing with Regular Expression Scanning
A regular expression (RE) is a pattern that describes a set of strings. The basic building blocks for these patterns consist of individual characters, such as {a, b, and c}. These characters can be combined with meta-characters, such as: {*, |, and ?}, to form regular expressions with wildcards. For two regular expressions, r1 and r2, rules define that r1* matches any string composed of zero or more occurrences of r1; r1? matches any string composed of zero or one occurrence of r1; r1|r2 matches any string composed of r1 or r2; and r1r2 matches any string composed of r1 concatenated with r2. For instance, a is an RE that denotes the singleton set {a}, while a|b denotes the set {a, b} and a* denotes the infinite set {null, a, aa, aaa,. . .}. REs can be identified using nondeterministic finite automata (NFA). Research on RE matching in hardware has been performed by Sidhu and Prasanna [16] and Franklin et al. [17]. Sidhu and Prasanna were primarily concerned with minimizing the time and space required to construct NFAs. They ran their NFA construction algorithm in hardware as opposed to software. Franklin et al. followed with an analysis of this approach for the large set of expressions found in a SNORT database [18]. The search function FPgrep was implemented by Moscola et al. to search packet payloads for substrings that belong to the language defined by the RE [15]. When FPgrep matched a substring in a packet, it transmitted information about the packet to a monitoring host system. The information sent for network intrusion detection functions specified the content found and the sender’s and receiver’s IP addresses. The search ran in linear time (proportional to packet size), O(n) (where n was the number of bytes in a packet), and in constant space. That is, there was never a need to examine a character more than once and the amount of hardware was proportional to the size of the RE. Approximately one flip-flop was required per character. A streaming content editor, FPsed, was implemented as a module on the FPX platform. The FPsed module selectively replaced content in packet payloads. String replacement for an RE is not as straightforward or efficient as searching. It requires that the machine do more than simply determine the presence of matching substrings in a record—it must also determine the position of the first and last character of all complete substrings that are matched by it. It is this requirement that makes RE search and replace more complicated and less efficient than a simple search. Searching for the complete substring is logical when the goal is to replace it.
762
Chapter 34
I
Network Packet Processing in Reconfigurable Hardware
Consider the replacement of every occurrence of a certain hexadecimal string associated with a computer virus, 3n*4n*5n*B, with the text Virus Pattern Detected. For the sake of brevity, the previous expression uses n as shorthand for any hexadecimal character (i.e., 0|1|2|3|4|5|6|7|8|9|A|B|C|D|E|F). For the input string 3172F34435B6B7B8, the substring can be replaced from the point where the machine starts running, 34, to the point where the substring is accepted, just before B6 (i.e., substring 34435B). However, this would allow a portion of the virus to remain in the content stream. In most situations, it is preferable to replace complete substrings; here the complete substring match starts with 31 and includes everything to just before B8 (i.e., the substring 3172F34435B6B7B).
34.2.5
Payload Scanning with Bloom Filters
A hash table is one of the most attractive choices for quick lookups. Hash tables require only constant time, O(1), average memory accesses per lookup. Because of their versatile applicability in network packet processing, it is useful to implement these hashing functions in hardware [19, 20]. Bloom filters can detect strings of characters that appear in streaming data moving at very high data rates. A Bloom filter is a data structure that stores a set of signatures compactly by computing multiple hash functions on each member of the set. It queries a database of strings to check for the membership of a particular string. The answer to this query can be false positive but never false negative. The average computation time to perform a query remains constant so long as the sizes of the hash tables scale linearly with the number of strings they store. Because each table entry stores only a hashed version of the content, the amount of storage required by the Bloom filter for each string is independent of its length.
34.3
INTRUSION DETECTION AND PREVENTION Existing firewalls that examine only the packet headers do little to protect against many types of attack. Multiple new worms transport their malicious software, or malware, over trusted services and cannot be detected without examining the payload. Intrusion detection systems (IDSs) perform deep scanning of the payload to detect malware, but do nothing to impede the attack because they only operate passively. An intrusion prevention system (IPS), on the other hand, can intervene and stop malware from spreading. The configuration of a network intrusion prevention system is shown in Figure 34.7. One problem with software-based IDSs is that they cannot keep pace with the high volume of traffic that transits high-speed networks. Existing systems that implement IPS functions in software limit the bandwidth of the network and delay the end-to-end connection. A reconfigurable system that can keep pace with high-speed network traffic has been developed. It scans data quickly, reconfigures to search for new attack
34.3 Intrusion Detection and Prevention
…
Network intrusion prevention
763
Internet Internet
FIGURE 34.7 I Configuration of an in-line network IPS situated between two hosts attached to a router and to the Internet.
patterns, and takes immediate action when attacks occur. By processing the content of Internet traffic in real time within an extensible network, data that contains computer viruses or Internet worms can be detected and prevented. By adding only a few filtering devices at key network aggregation points, Internet worms and computer viruses can be quarantined to the subnets where they were introduced. A complete system has been designed and implemented that scans the full payload of packets to route, block, and track the packets in the flow based on their content. The result is an intelligent gateway that provides Internet worm and virus protection in both local and wide area networks. Network intrusion detection and prevention systems search for predefined virus or worm signatures in network traffic flows (see Section 34.2.3). Such signatures can be loaded into the system manually by an operator or automatically by a signature detection system. (Note that string is synonymous with signature throughout the chapter.) Once a signature is found, an intrusion detection and prevention system (IDPS) can use it to block traffic containing infected data from spreading throughout a network. To perform this operation on a high-speed network, the signature scanning and data blocking must operate quickly. Comparing a variety of systems running the SNORT rule-based NID sensor reveals that most generalpurpose computer systems are inadequate as NID sensor platforms even for moderate-speed networks. Factors such as microprocessor, operating system, main memory bandwidth, and latency limit the performance that an NIDS sensor platform can achieve [22].
34.3.1
Worm and Virus Protection
Computer virus and Internet worm attacks are pervasive, aggravating, and expensive, both in terms of lost productivity and consumption of network bandwidth. Attacks by Nimba, Code Red, Slammer, SoBig.F, and MSBlast have infected computers globally, clogged large computer networks, and degraded corporate productivity. It can take weeks to months for information technology professionals to sanitize infected computers in a network after an outbreak [24]. In the same way that a human virus spreads among people coming in contact with each other, computer viruses and Internet worms spread when computers communicate electronically [25]. Once a few systems are compromised, they infect other machines, which in turn quickly spread the infection throughout a network. As is the case with the spread of a contagious disease, the number
764
Chapter 34
I
Network Packet Processing in Reconfigurable Hardware
of infected computers grows exponentially unless contained. Computer systems spread contagion much more quickly than humans do because they can communicate instantaneously over large geographical distances. The Blaster worm, for example, infected over 400,000 computers in less than five days. In fact, about one in three Internet users are infected with some type of virus or worm every year. Malware can propagate as a computer virus, an Internet worm, or a hybrid of both. Viruses spread when a computer user downloads unsafe software, opens a malicious attachment, or exchanges infected computer programs over a network. An Internet worm spreads over the network automatically when malware exploits one or more vulnerabilities in an operating system, a web server, a database application, or an email exchange system. Malware can appear as a virus embedded in software that a user has downloaded. It can also take the form of a Trojan that is embedded in what appears to be benign freeware. Alternatively, it can spread as content attached to an email message, as content downloadable from a web site, or in files transferred over peer-to-peer systems. Modern attacks typically use multiple mechanisms to execute. Malware, for example, can spoof messages that lure users to submit personal financial information to cloaked servers. In the future, malware is likely to spread much faster and cause much more damage. Today, most anti-virus solutions run in software on end systems. To ensure that an entire network is secure from known attacks, integrated systems were developed that can perform multiple network processing functions.
34.3.2
An Integrated Header, Payload, and Queuing System
An integrated system that incorporated the payload-scanning function, a ternary content addressable memory (TCAM) for header matching, and a flow buffer and queue manager for packet storage was implemented [13]. It is shown as a block diagram in Figure 34.8. Interfaces to offchip memories Xilinx XCV2000E FPGA SDRAM controller
Payload scanner
TCAM filter
Payload match bits
SDRAM controller
Extensible module(s)
Flow ID
Free list manager
SRAM controller
Flow buffer
Queue manager
Packet scheduler
Layered protocol wrappers
FIGURE 34.8 I Complete on-chip networking header and payload processing integrated with a flow buffer and a queue manager.
34.3 Intrusion Detection and Prevention
765
SNORT is a lightweight NID sensor that can filter packets based on predefined rules over packet headers and payloads [18]. With the TCP option enabled, SNORT matches strings that appear anywhere within traffic flows. Each SNORT rule operates first on the packet header to verify that the packet is from a source or to a destination network address and/or port of interest. If the packet matches a certain header rule, its payload is scanned against a set of predefined patterns associated with that rule. Matching of one or multiple patterns implies a complete match of a rule, and further action can be taken on either the packet or the TCP flow. To provide complete detection of all known attacks, an intrusion system must process all packets. Several thousand patterns appeared in the version 2.2 rule set for SNORT. SNORT’s rule database continually expands as new threats are observed. As the number of headers and signatures to match increases, the CPU on a PC running SNORT becomes overloaded and not all packets are processed. A SNORT intrusion filter for TCP (SIFT) was implemented in reconfigurable hardware and is illustrated in Figure 34.9. SIFT data entered the system via the TCP de-serialize wrapper. Control signals marked specific locations in the
Communication wrapper
On-chip Xilinx BlockRAMs
Control FSM
Control
Alert generator
Alerts
Bloom filters Header check
Index
Index
Hash
Hash
DQ
DQ
DQ
DQ
SNMP alerter
DQ
DQ
TCP data
TCP data
Match decoder
Context storage
Action retriever
TCP deserialize wrapper
Off-chip SDRAM (64–512 MBytes)
FIGURE 34.9
I
A block diagram of SIFT.
Off-chip ZBT SRAM (2 MBytes)
766
Chapter 34
I
Network Packet Processing in Reconfigurable Hardware
packet that included the starts of the IP header, the TCP header, and the payload. The value of the header was sent to a header check component to determine if the packet matches a header-only rule. The payload was sent through an 8-stage pipeline where each byte offset is searched for signatures by Bloom filters. If a match was detected, the match decoder determines the string identifier (ID), which was next sent to the action retriever to determine what to do with the packet. Suspect packets were forwarded to software for further inspection. Those that had no match were not inspected further; those that did need additional processing were sent to the outgoing side of the TCP de-serialized wrapper. To match payloads, SIFT used Bloom filters to allow signatures to be incrementally programmed into hardware. Signatures could be added or deleted via messages embedded in UDP control packets. These packets were sent through the communication wrapper to a control finite-state machine (FSM). In turn, the FSM set the appropriate bits in BlockRAM memories on the FPGA to add the signature to the Bloom filter. To achieve high throughput, four engines ran in parallel [21].
34.3.3
Automated Worm Detection
Outbreaks of new worms constitute a major threat to Internet security. IDPSs described previously only filter traffic that contain known worms. Systems that automatically detect new worms in real time by monitoring traffic on a network allow detection and protection from new outbreaks. Internet worms spread by exploiting vulnerabilities in operating systems and application software that run on end systems. Once they infect a machine, they use it to attack other hosts; these attacks compromise security and degrade network performance, causing large economic losses for businesses resulting from system downtime and lowered worker productivity. The Susceptible/Infective (SI) model illustrates the spread of Internet worms [25]. With this model, a wellknown equation can be used to estimate how fast a worm will infect vulnerable machines. Worms can be prevented by writing code that has no vulnerabilities, and the computer security community has made great strides toward this goal. Programmers analyze the vulnerability that the worm exploits and release a “patch” to fix it. However, it takes time to analyze and patch software. In addition, many end users may never apply the patch, and as a result a significant number of machines in the network remain vulnerable. Another way to prevent the spread of worms is to have the network contain them. When intrusion prevention systems scan traffic for a predetermined signature and filter the flows that match, the spread of a known worm can be blocked. The EarlyBird System [26, 27] detects the signatures for unknown worms in real time, identifying them by their repeating content. Because worms consist of malicious code, frequently repeated content on the network can be a useful warning of worm activity. Large flows are identified by computing a hash of packet content in combination with a destination port.
34.4 Semantic Processing
767
A hardware-accelerated worm detection circuit implemented in reconfigurable hardware draws from two ideas presented in the EarlyBird system [23]. To detect commonly occurring content, a hash is computed over 10-byte windows of streaming data. The hash value is used to identify a counter in a vector that is instructed to increment by one. At periodic intervals (called timeouts), the counts in each of the vectors are decremented by the average number of arrivals due to normal traffic. When a counter reaches a predetermined threshold, an alert is generated and its value is reset to zero. For the implementation of the circuit on an FPGA, the count vector was implemented by configuring dual-ported, on-chip BlockRAMs as an array of memory locations. Each memory afforded one read operation and one write operation every clock cycle, which allowed a 3-stage pipeline to be implemented that reads, increments, and writes memory every clock cycle. Because the signature changes every clock cycle and because every occurrence of every signature must be counted, the dual-ported memories allow the occurrence count to be written back while another count is being read. When an on-chip counter crosses the threshold, the corresponding signature is hashed to a table in off-chip SRAM. The next time the same string causes the counter to exceed the threshold, it is hashed to the same location in SRAM and the two strings are compared. If they are the same, it is determined that the match is not a false positive and the counter is incremented. If they are different, the contents of the string stored in SRAM is overwritten with the value of the new string and the count is reset. On receiving confirmation from the SRAM analyzer that a signature frequently occurs, a UDP control packet is sent to an external computer. The packet contains the offending signature, which is the string of bytes by which the hash was computed. The computer, in turn, programs other IDS/IDP systems to filter traffic that contains this signature.
34.4
SEMANTIC PROCESSING Next-generation networks route and forward data based on the semantics of the data within documents. Rather than assigning arbitrary headers to packets, routers use the meaning of the text itself to determine the packet routing.
34.4.1
Language Identification
As of 2004, nearly two-thirds of the world’s Internet users spoke a non-English native language [29], and nearly one-third of the pages available on the Internet were written in a non-English language [29, 30]. As the rate at which data is transferred over the Internet increases, the rapid identification of languages becomes an increasingly difficult problem. A system capable of quickly identifying the primary language or languages used in documents can be useful as a preprocessor for document classification and translation services. It can also be used as a mechanism for language-based document routing.
768
Chapter 34
I
Network Packet Processing in Reconfigurable Hardware
A hardware-accelerated algorithm was designed to automatically identify the primary languages used in documents transferred over the Internet [28]. The module was implemented in hardware on the FPX platform. Referred to as Hardware-Accelerated Identification of Languages (HAIL), this complete system identified the primary languages used in content transferred over TCP/IP networks. It operated on streaming data at a rate of 2.4 Gigabits/second using FPGA hardware. This level of performance far outstripped software algorithms running on microprocessors. Several methods have been shown to be effective for the classification of document characteristics based on principles from linguistics and artificial intelligence. Some methods used dictionary-building techniques [31], while others used Markov Models, trigram frequency vectors [32], and/or n-gram–based text categorization [33, 34]. Although these methods are capable of achieving high degrees of accuracy, most require floating-point mathematics, large amounts of memory, and/or generous amounts of processing time. HAIL uses n-grams to determine the language of a document. These are sequential patterns of exactly n characters that are found in written documents, and when they are used as indicators of language, the primary language or languages of a document can be reliably determined. HAIL can use any n-gram length, although experiments have shown that n-grams of length 3 (trigrams) and length 4 (tetragrams) provide the most accurate results. Before processing data with HAIL, the target system is trained with information on languages. Training is performed by scanning a set of documents in the languages of interest. When an n-gram appears significantly more frequently in the documents of one language than in any other, it is associated with that language. After training has established which n-grams best correspond to particular languages, memory modules on the hardware platform implementing HAIL have to be programmed. Memory is populated by using a hash to map each n-gram to a particular memory location. The memory location that corresponds to a particular n-gram is labeled with the associated language. Once data processing begins, the n-grams are sampled from the datastream and used as addresses into memory to discern the language associated with the n-gram. The final language is determined by the statistics of the words that appear in each language.
34.4.2
Semantic Processing of TCP Data
Within the intelligence community, there is a need to search through massive amounts of multilingual documents that are encoded using different character sets. It has been shown that computational linguistics and text-processing techniques are effective for sorting through large information sets, extracting relevant documents, and discovering new concepts [33]. There is a problem, however, in that the computational complexity of the text-processing algorithms is such that the document ingest rate is too slow to keep up with the high rate of information flow [34]. To overcome this problem, a system using FPGA hardware was developed for accelerated concept discovery and classification algorithms [35, 36].
34.4 Semantic Processing
769
Circuits were implemented as reconfigurable hardware modules that dramatically increased data ingest rates. It was found that text analysis algorithms that perform “bag of words” processing were widely used and appropriate for many types of computational linguistics tasks. To investigate the utility of hardware-accelerated text analysis algorithms, a reconfigurable FPGA-based semantic-processing system was developed. The hardware tested a variety of target problems involving concept classification, concept discovery, and language identification [36]. A blend of high-speed network devices and reconfigurable hardware was used to rapidly ingest and process data [35]. Data were received from the network as text or HTML documents and carried over standard TCP/IP packets. The TCP processor decoded the packets that contained the document in one or more TCP/IP input flows. Every word (baseword) in the document was analyzed for its semantic meaning. All words in each document were then counted to determine their frequency of occurrence. A document vector was generated that characterized the document content. It was then scored against a set of vectors that represented known or emerging concepts. Thresholds were used to determine if content could be classified as existing or if a new cluster should be formed. Figure 34.10 diagrams the dataflow of the semantic-processing system. The FPGAs enabled streaming, computationally intensive semantic-processing functions to be performed in constant time. They performed all of the
Receive large volume of input content over network (e.g., HTML documents)
Decode input TCP datastreams and interpret content
Map basewords to semantic meaning
Count word frequencies in each document
Score documents against known and emerging concepts
Σ ... Σ Σ Σ
0.0
0.4
0.8
Automatically threshold, classify, and cluster content in to groups for analyst 0.0
FIGURE 34.10
I
0.2
0.4
0.6
0.8
1.0
Dataflow for the semantic processing system.
770
Chapter 34
I
Network Packet Processing in Reconfigurable Hardware
data-processing functions for the system shown in the figure except for threshold and classification (which were performed and displayed on a computer console). By using FPGAs to implement all parts of the text processing, the entire system could be dynamically reconfigured to allow variations of algorithms to be evaluated for their content classification or concept-clustering ability. Massive volumes of data were streamed through the system, and the system’s precision, recall, throughput, and latency were measured [36]. The RAD circuits on the FPX (shown in Figure 34.3) were used to implement the TCP processor, the baseword module, the count module, the score module, and the report module. All were implemented as modular hardware components on individual FPX platforms connected in a vertical stack. The high-speed network interfaces allowed the FPX platforms to communicate intermediate results of processing to other modules in the system and to send reports to software running on a computer outside the system using standard IP datagrams. Multiple copies of the FPX platform were stacked on each other to implement network intrusion detection and network intrusion prevention. Figure 34.11 is a photograph displaying how five FPX cards were stacked to implement the semantic processing system. Additional modules were added to tag tokens in a context-free grammar [37].
34.5
COMPLETE NETWORKING SYSTEM ISSUES To deploy complete network systems, additional issues must be considered. First, the hardware must be placed in a form factor appropriate for use in remote network closets. Second, the control and configuration of the hardware must be secure. And third, reconfiguration mechanisms are needed so that entire FPGAs, or (as needed) only parts, can be reconfigured over the network. With dynamic hardware plug-ins, most of the system can remain operational while parts of it are reconfigured. Partial bitfile reconfiguration allows the system itself to remain operational 24 hours a day (which is necessary to maintain a good network uptime) while individual components can still be modified quickly and efficiently. The PARBIT tool allows precompiled partial bitfile configurations to be generated and then quickly deployed into regions of FPGA networking hardware.
34.5.1
The Rack-mount Chassis Form Factor
Networking equipment is typically deployed in the form factor of a chassis that can be mounted into a 19-inch rack. Each unit (U) of a rack is 1.75 inches tall. In a 3U rack-mount chassis, up to four FPX modules could be stacked on each of two ports in the system. Data entered and left the system through the Gigabit Ethernet ports on the front panel. Figure 34.12 is a photograph of FPX modules integrated in a rack-mount chassis.
34.5 Complete Networking System Issues
771
Incoming network traffic
TCP processor Word mapping module Count module Score module Reporting module Outgoing scored document vectors
FIGURE 34.11
I
A stack of the FPX modules implemented the semantic processing system.
FIGURE 34.12
I
FPX modules integrated in a rack-mount chassis.
34.5.2
Network Control and Configuration
Reconfigurable hardware circuits perform a variety of functions in the networking system. Some parts of the system implemented the infrastructure while others implemented the dynamically reconfigurable logic. Static circuits are
772
Chapter 34
I
Network Packet Processing in Reconfigurable Hardware
used to switch cells between modules. The extensible modules implemented as plug-ins perform the reconfigurable features. The FPX used a combination of statically configured and dynamically configurable logic to implement the complete platform. On the FPX, the NID was statically configured using a bitfile stored in a PROM. It controlled how data was routed between network modules. and included switching modules that forwarded traffic flows based on virtual paths and circuits found in the ATM cell headers. The NID also contained the logic that enabled other hardware modules to be dynamically loaded over the network. This logic implemented a circuit that used a reliable network protocol to receive full and partial bitfiles over the network. The NID, in turn, buffered this data in a configuration cache and streamed the bitstream into the programming port of the attached FPGA. The RAD on the FPX was a Xilinx VirtexE-2000E FPGA that received the configuration data and performed application-specific functions implemented as dynamic hardware plug-in (DHP) modules. A DHP consisted of a region of FPGA gates and internal memory bound by the well-defined interface. For bitfiles that used all of the logic on the RAD, the interface was defined by user constraints file (UCF) pins. For partial bitfiles that used less than the entire FPGA, a standard on-chip interface was developed to transmit and receive packet data between modules. A full or partial bitfile was built using standard CAD tools [11].
34.5.3
A Reconfiguration Mechanism
The NID allowed modules created for the FPX platform to be remotely and dynamically loaded into the RAD. This bitstream was sent over the network into the configuration cache, which was implemented by a circuit that controlled an off-chip SRAM. Once a full or partial bitfile was received, a command was sent to the NID to initiate the RAD reconfiguration. On a Xilinx Virtex, the SelectMAP interface loaded a new bitstream into the FPGA. To reprogram the RAD, the NID read the configuration memory and wrote a preprogrammable number of configuration bytes into the RAD FPGA’s SelectMAP interface. Figure 34.13 illustrates this process. The NCHARGE API [9] was developed for debugging, programming, and configuring an FPX. Specifically, it included commands to check the status of an FPX, configure routing on the NID, and perform memory updates and full and partial RAD reprogramming. NCHARGE provided a mechanism for applications to define their own custom control interface. Control cells were transmitted by NCHARGE and processed by control cell processors (CCPs) on the RAD or NID. To configure routes for the traffic flowing through the system, NCHARGE sent control cells with commands that modified routing tables on the Gigabit switch or on the NID. To check the status of the FPX, NCHARGE sent a control cell to the NID on the FPX, the NID updated fields in the cell, and the software process received the response.
34.5 Complete Networking System Issues
773
Configuration cache RAD
1. New module created
IPP IPP IPP IPP IPP IPP IPP IPP
Switch Switch element Element
OPP OPP OPP OPP OPP OPP OPP OPP
2. Full or partial bitstream sent over network to NID on the FPX and stored in configuration cache
FIGURE 34.13
34.5.4
I
NID
3. Command issued to reconfigure hardware
4. NID reads memory and reprograms RAD via SelectMAP interface
Remote reconfiguration of the FPX platform.
Dynamic Hardware Plug-ins
Use of runtime reconfiguration in networking systems enables developers of hardware packet-processing applications to achieve a capability similar to that of the dynamically linked libraries (DLLs) used in software applications. Just as a DLL is a software module that can be attached to or removed from a running program as an application demands, DHPs can be loaded into or removed from a running FPGA without disturbing other circuits operating in it. The ability to change the hardware feature set in a running system is particularly useful in packet-processing applications such as firewalls and routers where it is not desirable to suspend the network operation during reprogramming. A practical system for implementing DHPs was implemented on the FPX and provided sufficient resources for networking, well-defined interfaces to hardware, a complete design methodology, scripts that ran physical implementation tools to place and route logic, and tools that allowed selective reconfiguration of portions of the bitstream. These five elements were analogous to an operating system platform, application programming interface, modular programming methodology, compiler, and linker needed to implement DLLs in the software domain.
34.5.5
Partial Bitfile Generation
Tools and a design methodology were developed to support partial runtime reconfiguration of DHP modules on the FPX platform. The PARBIT tool was developed to transform and restructure bitstreams created by standard computer-aided design tools into partial bitstreams that programmed DHPs.
774
Chapter 34
I
Network Packet Processing in Reconfigurable Hardware
The methodology allowed the platform to hot-swap application-specific DHP modules without disturbing the operation of the rest of the system [12]. To partially reconfigure an FPGA, it is necessary to isolate a specific area in it and download the configuration only for the bits related to that area. PARBIT transformed and restructured the Xilinx bitstreams to extract and merge data from the bitfile’s regions. To restructure the configuration bitfile, it read the original bitfile, a target bitfile, and parameters given by the user that specified the block coordinates of the logic implemented on a source FPGA, the coordinates of the area for a partially programmed target FPGA, and the programming options. After reading these data, PARBIT copied to the target bitstream only the part of the original bitstream related to the area defined by the user. The target bitstream was used by PARBIT to preserve the part of the configuration data that was in a column specified by the user but outside the partial reconfigurable area. On a Xilinx VirtexE FPGA, the use of the target bitstream was necessary because one reconfiguration frame could span all rows of a column but have a partial reconfigurable area smaller than the column’s height. PARBIT allowed arbitrary block regions of a compiled design to be retargeted into any similarly sized region of an FPGA. To relocate blocks from the original bitfile, a user defined the start and end columns and rows for the block in the original design. Then the user defined where to put this block in a target bitfile of the same device type. The tool generated the partial bitfile containing the area selected by the user (from the original bitfile). This data was used to reconfigure the target device. The configuration bits for the top and bottom input/output blocks (IOBs) from the target device did not change after the partial bitfile was loaded. Those for the columns from the original and target bitfile were merged according to the rows defined by the user.
34.5.6
Control Channel Security
For devices deployed remotely on the Internet, security of the control channel is critical. Remote systems need to be safe from both passive and active network attacks by malicious users. In passive attacks, malicious users glean information by monitoring the system. In active attacks, they attempt to change the system’s behavior or paralyze it. Access control mechanisms have been developed to protect remotely configured systems from unauthorized use. Common attacks include passive eavesdropping, active tampering, replay, and denial of service (DoS). For a passive eavesdropping attack, a malicious user taps the network to copy and analyze its traffic. If the attacker can see clear text control and configuration information, he or she may discover how to control and configure the system. In an active tampering attack, an unauthorized user attempts to gain control of the remote system by issuing bogus control packets. For a replay attack, a malicious user passively captures legitimate traffic and then attempts to change the operation of the system by resending the captured traffic at a later time. For an active DoS attack, the user paralyzes the system by overloading the network with massive amounts of traffic.
34.6 Summary
775
Remotely configurable network systems can be made safe by mechanisms that ensure confidentiality of data, provide authentication of the administrator, and guarantee integrity of the messaging. By encrypting messages with the Advanced Encryption Standard (AES) or other secure encryption algorithms, data confidentiality can be protected. With digital signatures generated by public key algorithms, the administrator of the system can be authenticated to guarantee that no one else attempts to modify its operation. The integrity of messages can be ensured by verifying that exactly what is transmitted by the administrator is received by the system. Use of a message authentication code (MAC) can assure users that data are not modified and that no additional control messages are inserted. The Internet Protocol Security (IPSec) standard provides a mechanism to secure communications across the Internet. Many companies, such as Cisco, have implemented IPSec capability in their networking products. To secure a remotely reconfigurable FPGA, an IPSec in transport mode was designed for a Xilinx Virtex-II Pro FPGA [10]. Security policies at network access points defined who could gain access and under what conditions access was granted. Encryption keys and hash keys remained secret using the security services previously described. The Internet key exchange (IKE) protocol negotiated and exchanged shared secrets between communication entities.
34.6
SUMMARY As the limits of processor clock scaling are reached, systems that route, process, filter, and transform Internet data scale better in reconfigurable hardware than in software alone. Networking platforms created with FPGA hardware are both fast and flexible. The FPX platform was used to implement over 30 core networking functions. The combination of Gigabit network interfaces, parallel banks of SRAM and SDRAM, and a large array of reconfigurable logic on the FPX platform enabled it to perform a wide range of networking applications. Modules and protocol wrappers created in reconfigurable hardware were developed on the FPX and provided functionality similar to the procedures and DLLs in software for network processing. Reconfiguration of the modules over the network proved to be as effective for remotely loading new functionality on the FPX as the reprogramming of software on remote PCs. By using IP wrappers, the FPX platform provided the ability to process ATM cells, AAL5 frames, IP packets, UDP datagrams, and/or TCP/IP flows. Parallel finite automata engines proved useful in detecting regular expressions in packet payloads and TCP traffic flows. Bloom filters that performed parallel hash lookups also proved to be effective for detecting fixed strings in packets and TCP flows. A complete IDS system was implemented that performed a large subset of SNORT using a combination of protocol-processing wrappers, IP header matching circuits, and Bloom filter payload-scanning circuits. A worm and virus
776
Chapter 34
I
Network Packet Processing in Reconfigurable Hardware
detection and blocking system was built using an FPX that demonstrated its utility in providing Internet security. Reconfigurable hardware holds great promise for new types of networking applications. A language detection circuit was demonstrated that routed traffic based on the language used in a document. A semantic-processing circuit was demonstrated that allowed documents to be classified based on their topic. Going forward, reconfigurable hardware is becoming the technology of choice for future networking systems. Reconfigurable hardware is the key feature of a new platform, called the NetFPGA. This open platform enables switching and routing of network packets on Gigabit Ethernet links. Because the NetFPGA has many of the same resources as the FPX, it can implement most of the features first prototyped on the FPX [38, 39].
References [1] J. W. Lockwood. Evolvable Internet hardware platforms. NASA/DoD Workshop on Evolvable Hardware, July 2001. [2] J. W. Lockwood, H. Duan, J. M. Morikuni, S. M. Kang, S. Akkineni, R. H. Campbell. Scalable optoelectronic ATM networks: The iPOINT fully functional testbed. IEEE Journal of Lightwave Technology, June 1995. [3] H. Duan, J. W. Lockwood, S. M. Kang, J. D. Will. A high-performance OC-12/OC-48 queue design prototype for input-buffered ATM switches. IEEE Infocom ’97, April 1997. [4] W. Marcus, I. Hadzic, A. McAuley, J. Smith. Protocol boosters: Applying programmability to network infrastructures. IEEE Communications Magazine 36(10), 1998. [5] Wikipedia. OSI model. http://wikipedia.org/wiki/OSI_model, July 2006. [6] F. Braun, J. W. Lockwood, M. Waldvogel. Protocol wrappers for layered network packet processing in reconfigurable hardware. IEEE Micro 22(3), February 2002. [7] D. E. Taylor, J. S. Turner, J. W. Lockwood, T. S. Sproull, D. B. Parlour. Scalable IP lookup for Internet routers. IEEE Journal on Selected Areas in Communications 21(4), May 2003. [8] D. Schuehler, J. W. Lockwood. A modular system for FPGA-based TCP flow processing in high-speed networks. Proceedings of the 14th International Conference on Field-Programmable Logic and Applications, August 2004. [9] T. S. Sproull, J. W. Lockwood, D. E. Taylor. Control and configuration software for a reconfigurable networking hardware platform. IEEE Symposium on FieldProgrammable Custom Computing Machines, April 2002. [10] J. Lu, J. W. Lockwood. IPSec implementation on Xilinx Virtex-II Pro FPGA and its application. Reconfigurable Architectures Workshop, April 2005. [11] E. D. Horta, J. W. Lockwood, D. E. Taylor, D. Parlour. Dynamic hardware plugins in an FPGA with partial run-time reconfiguration. Design Automation Conference, June 2002. [12] E. Horta, J. W. Lockwood. Automated method to generate bitstream intellectual property cores for Virtex FPGAs. Proceedings of the 14th International Conference on Field-Programmable Logic and Applications, August 2004. [13] J. W. Lockwood, C. Neely, C. Zuver, D. Lim. Automated tools to implement and test Internet systems in reconfigurable hardware. SIGCOMM Computer Communications Review 33(3), July 2003.
34.6 Summary
777
[14] J. W. Lockwood, J. Moscola, D. Reddick, M. Kulig, T. Brooks. Application of hardware accelerated extensible network nodes for Internet worm and virus protection. International Working Conference on Active Networks, December 2003. [15] J. Moscola, J. W. Lockwood, R. P. Loui, M. Pachos. Implementation of a contentscanning module for an Internet firewall. IEEE Symposium on Field-Programmable Custom Computing Machines, April 2003. [16] R. Sidhu, V. K. Prasanna. Fast regular expression matching using FPGAs. IEEE Symposium on Field-Programmable Custom Computing Machines, April 2001. [17] R. Franklin, D. Carver, B. L. Hutchings. Assisting network intrusion detection with reconfigurable hardware. IEEE Symposium on Field-Programmable Custom Computing Machines, April 2002. [18] M. Roesch. Snort: Lightweight intrusion detection for networks. Proceedings of the 13th Administration Conference, LISA, November 1999. [19] S. Dharmapurikar, P. Krishnamurthy, T. S. Sproull, J. W. Lockwood. Deep packet inspection using parallel Bloom filters. IEEE Micro 24(1), January 2004. [20] H. Song, S. Dharmapurikar, J. Turner, J. W. Lockwood. Fast hash table lookup using extended Bloom filter: An aid to network processing. ACM SIGCOMM, August 2005. [21] M. Attig, J. W. Lockwood. SIFT: SNORT intrusion filter for TCP. IEEE Symposium on High Performance Interconnects (Hot Interconnects-13), August 2005. [22] L. Schaelicke, T. Slabach, B. Moore, C. Freeland. Characterizing the performance of network intrusion detection sensors. Proceedings of the Sixth International Symposium on Recent Advances in Intrusion Detection, September 2003. [23] B. Madhusudan, J. W. Lockwood. A hardware-accelerated system for real-time worm detection. IEEE Micro 25(1), January 2005. [24] D. Moore, C. Shannon, G. Voelker, S. Savage. Internet quarantine: Requirements for containing self-propagating code. IEEE INFOCOM, 2002. [25] S. Staniford, V. Paxson, N. Weaver. How to own the Internet in your spare time. Usenix Security Symposium, August 2002. [26] S. Singh, C. Estan, G. Varghese, S. Savage. The Earlybird System for the Realtime Detection of Unknown Worms, Technical report CS2003-0761, University of California, San Diego, Department of Computer Science, 2003. [27] C. Estan, G. Varghese. New directions in traffic measurement and accounting. ACM SIGCOMM, August 2002. [28] C. M. Kastner, G. A. Covington, A. A. Levine, J. W. Lockwood. HAIL: A hardwareaccelerated algorithm for language identification. Proceedings of the 15th Annual Conference on Field-Programmable Logic and Applications, August 2005. [29] Global Reach. Global Internet statistics by language. http://www.glreach.com/ globstats/index.php3, December 2004. [30] Global Reach. Global Internet statistics: Sources and references. http://www.glreach. com/globstats/refs.php3, December 2004. [31] R. Paulsen, M. Martino. Word Counting Natural Language Determination, U.S. Patent 6,704,698, 1996. [32] J. Schmitt. Trigram-based Method of Language Identification, U.S. Patent 5,062,143, 1990. [33] M. Damashek. Method of Retrieving Documents that Concern the Same Topic, U.S. Patent 5,418,951, 1994. [34] J. B. Sharkey, D. Weishar, J. W. Lookwood, R. Loui, R. Rohwer, J. Byrnes, K. Pattipati, D. Cousins, M. Nicolletti, S. Eick. Information processing at very
778
Chapter 34
[35]
[36]
[37]
[38]
[39]
I
Network Packet Processing in Reconfigurable Hardware
high-speed data ingestion rates. In Emergent Information Technologies and Enabling Policies for Counter Terrosiom, edited by R. Popp and J. Yin. IEEE Press/Wiley, 2006. J. W. Lockwood, S. G. Eick, D. J. Weishar, R. Loui, J. Moscola, C. Kastner, A. Levine, M. Attig. Transformation algorithms for datastreams. IEEE Aerospace Conference, March 2005. J. W. Lockwood, S. G. Eick, J. Mauger, J. Byrnes, R. Loui, A. Levine, D. J. Weishar, A. Ratner. Hardware accelerated algorithms for semantic processing of document streams. IEEE Aerospace Conference, March 2006. Y. H. Cho, J. Moscola, J. W. Lockwood. Context-free grammar based token tagger in reconfigurable devices. Proceedings of the International Workshop on Data Engineering, April 2006. J. W. Lockwood, N. McKeown, G. Watson, G. Gibb, P. Hartke, J. Naous, R. Raghuraman, J. Luo. NetFPGA—An open platform for Gigabit-rate network switching and routing. IEEE International Conference on Microelectronic Systems Education (MSE2007), June 2007. J. Luo, J. Pettit, M. Casado, N. McKeown, J. W. Lockwood. Prototyping fast, simple, secure switches for ethane. IEEE Symposium on High-Performance Interconnects (Hot Interconnects-15), August 2007.
CHAPTER
35
ACTIVE PAGES: MEMORY-CENTRIC COMPUTATION Diana Franklin Department of Computer Science California Polytechnic State University
Although field-programmable gate arrays (FPGAs) excel at tailoring the computation and interconnect to an application’s needs, we can go one step further. In many applications, regardless of the speed of the computation, memory performance always will be the limiting factor. This problem, referred to as the memory wall, is broken up into two parts—memory latency and bandwidth. For largescale data-parallel applications, the computation can be moved to memory. This allows for both parallel computation and increased bandwidth. The replication of small computation units provides parallelism, and the sum of their memory ports provides increased bandwidth. Because they are located in memory, there is no shared-bus resource to serialize communication. One such system, Active Pages, places computation with each page of DRAM. It is unique in that it targets the commodity DRAM market. This decision has both advantages and disadvantages. One advantage is that it supports both data streaming and general-purpose computation, and the computational resources scale automatically with memory allocation. One disadvantage is that, to keep costs low, there is no additional interconnect, and parallelism is only at the page level. Many of the characteristics of Active Pages are present in any memory-centric system. This case study explores several characteristics of the Active Pages design. It begins, in Section 35.1, with an overview of the Active Pages architecture and programming model. Section 35.2 shows the performance potential of a scalable, memory-centric design. Section 35.3 then looks at how this scaling of computational resources, but not the interconnect resources, affects the asymptotic properties of several algorithms. Finally, Sections 35.4 and 35.5, explore the parallelism properties and the defect tolerance provided by the Active Pages design. Active Pages is just one of many projects in this realm, and Section 35.6 presents related work, followed by some conclusions in Section 35.7.
35.1
ACTIVE PAGES This section gives a brief description of the Active Pages system. We present three aspects of the design: the hardware design, the interface between Active
Chapter 35
780
I
Active Pages: Memory-centric Computation
Pages and the Central Processor, and the programming model that arises naturally from the design and interface.
35.1.1
DRAM Hardware Design
High-density DRAMs are divided into subarrays, complete with row and column decoders, to minimize column capacitance and decrease power consumption [1]. The proposed Active Pages implementation exploits this natural structure, treating each subarray as an Active Page. As shown in Figure 35.1, a small computational unit and cache—a Page Processor and Page Cache—are embedded next to each subarray to implement Active Page functions [2]. Using commodity 1-Gb DRAM technology as a target [3], we expect subarray size to be 512 KB and the embedded processing to consume less than 31 percent of the chip area. To minimize DRAM modification and reduce hardware overhead, the Active Pages implementations do not provide hardware support for communication between Active Pages. If two Active Pages need to share data, the Central Processor reads the data from one and writes to the other. The disadvantage of this process-mediated approach is that interpage communication must be infrequent to maintain performance with a single processor.
35.1.2
Hardware Interface
Standard external memory interface
To interface with the Central Processor, Active Pages leverage conventional pagebased memory mechanisms to “virtualize” hardware for memory-based computation. Computations for each page can be suspended, restarted, and even swapped to disk. Computations for several pages can be multiplexed on a single embedded processing element. Further, Active Pages use the same interface as conventional memory systems. Active Pages data are modified with conventional memory reads and writes; Active Pages functions are invoked through memory-mapped writes. Synchronization is accomplished through user-defined memory locations.
FIGURE 35.1
DRAM subarray (512 KB) Row decoder Column decoder $
uP
uP
$
$
uP
$
uP
uP
$
$
uP
I
uP uP
The Active Pages architecture (8 pages).
$ $
Page-based computational engine Page-cache (512-bit data/1024-byte instruction)
35.2 Performance Results
35.1.3
781
Programming Model
The programming model of Active Pages was determined by several design decisions. First, communication between Active Pages and the Central Processor is accomplished through traditional reads and writes, allowing the Central Processor to operate on Active Pages data just as it does on any other data. Second, Active Pages were intended for commodity DRAM systems, which may be running general-purpose applications. Thus, we could not assume a traditional data parallel, streaming model. Third, there is no interconnect between Active Pages processors. The model needs to limit the Pages to their own data, with no knowledge of neighboring cells. Finally, each Active Page has computation associated with it. This is a direct association of data with computation. For these two reasons, the model of computation here is object-oriented programming. To program a Page Processor, the programmer creates an object in C++. The choice of C++ is not critical; it is used because it has no runtime system associated with it and has well-defined interfaces for object manipulation. The 512 KB allocated to each Page Processor is divided between code, stack, and data. These 512 KB, larger-than-typical operating systems’ virtual pages are referred to as superpages. The code must fit within the code segment, and the data size of the object is padded appropriately. The operating system (OS) is responsible for allocating Active Pages memory and loading the code into the correct region. The Page Processor begins on activation, first performing any initialization similarly to a C++ object constructor, and then polling a variable waiting for an invocation of a function. To maintain pin compatibility, all Active Pages functions are designed to use conventional reads and writes. The Central Processor invokes Active Pages functions by writing the parameters into appropriate places in the Active Pages memory. The Central Processor then changes the Running variable, on which the Page Processor is polling, indicating which function to execute next. When the Page Processor has completed the function, it resets the looping variable (Running) and waits for the next invocation. Figure 35.2 shows the object declaration and implementation for execution on a Page Processor for LCS. More details on the LCS algorithm can be found in Section 35.3.3. In the LCS algorithm, the application requires only a single function, so the event loop is not actually necessary. It is shown, however, to illustrate how an application with many functions would use the Central Processor to invoke functions on the Page Processors. The main function run on the Central Processor is not shown. The Central Processor can poll the Running variable to determine whether a Page Processor has completed a particular function.
35.2
PERFORMANCE RESULTS Now that we have an idea of what the Active Pages architecture looks like and how it is programmed, this section presents performance results for several applications using a simulated Active Pages system. A more detailed study can be found in Oskin et al. [4].
782
Chapter 35
I
Active Pages: Memory-centric Computation
Class LCS{ //int CodeAndStack[8192]; // added by compiler public: int Running, Data[WIDTH-1][LENGTH-1]; char X[WIDTH], Y[LENGTH]; LCS(){ Running = AP_WAIT; } void Start(); void DoLCS(); } ; void LCS::DoLCS() { int i, j; for(i=1;i Data[i][j-1]) Data[i][j] = Data[i-1][j]; else Data[i][j] = Data[i][j-1]; } } void LCS::Start() { volatile int *act = &(Running); while(*act != AP_STOP) { while(*act == AP_WAIT) ; // wait for Central Processor switch (*act) { case(AP_LCS): DoLCS(Val); *act = AP_WAIT; // it is done break; } } }
FIGURE 35.2 I A code example of an Active Pages object. Each Page Processor initializes its own space on allocation using the constructor. The Central Processor starts the Page Processor by writing to the Running variable. When the call is finished, the Page Processor sets Running back to AP_WAIT.
To estimate the performance of Active Pages configurations, each Active Pages function was hand-coded in a high-level circuit-description language, such as VHDL (see Chapter 6 and [5]), and synthesized to an Altera 10K FPGA. The mapping was carried out all the way to placed and routed designs [6]. To demonstrate effective partitioning of applications between the Central Processor and Active Pages, we chose a range of applications representing both memory- and processor-centric partitioning. Table 35.1 summarizes the attributes of these applications.
35.2.1
Speedup over Conventional Systems
To evaluate performance of the Active Pages memory system, each application was executed on a range of problem sizes. The speedup of the applications
35.2 Performance Results
TABLE 35.1
I
Summary of the partitioning of applications between the Central Processor and Active Pages Memory-centric applications Central Processor computation
Name
Application
Array
C++ standard template library array class Address database
Database Median Dynamic program
783
Median filter for images Protein sequence matching
C++ code using array class cross-page moves Initiates queries summarizes results Image I/O Backtracking
Active Pages computation Array insert, delete, and find Searches unindexed data Median of neighboring pixels Compute MINs and fills table
Processor-centric applications Matrix
MPEG-MMX
Matrix multiply for Simplex and finite element MPEG decoder using MMX instructions
Floating-point multiplies
Index comparison and data gathering and scattering
MMX dispatch Discrete cosine transform
MMX instructions
running on an Active Pages memory system compared to a conventional memory system is shown in Figure 35.3. Each application was run on a range of problem sizes, given in terms of number of Active Pages (512-KB superpages). The following are two primary observations about this graph. First, the performance results qualitatively scale as expected. This shows the advantage of memory-centric computation. We observe that most applications show little growth in speedup as data size grows within the subpage region (below one page). In this region, Active Pages applications have little parallelism to offset activation costs. When leaving this region, however, we enter the scalable region and see that performance on all applications grows as data size increases. Four applications—database, MMX, matrix-simplex, matrix-boeing, and median-filtering—also reach the saturated region. Here, Active Pages performance is limited by the progress of the Central Processor. This limitation may be because of either too much work for a given-speed Central Processor or too much data travelling between the Central Processor and Active Pages across the memory bus. Performance can actually decrease as coordination costs dominate performance. Given a large enough problem size, all applications would eventually reach the saturated region. Second, we see that the array-delete primitive performs poorly in the subpage region. This is because of the difference between the FPGA implementation and the instruction set used to implement the Central Processor. The Central Processor’s instruction set is especially well suited for the array-delete primitive. Thus, unless there is sufficient parallelism to justify using Active Pages, it is faster to use the Central Processor. So, for small deletes, we use only the Central Processor. This benchmark was a combination of small deletes and large deletes.
Chapter 35
I
Active Pages: Memory-centric Computation 1000 100
Speedup
784
10 1
1 10 100 Problem size (in 512 KB pages) array-delete dyn-prog median-kernel
FIGURE 35.3
I
array-find matrix-boeing median-total
array-insert matrix-simplex MMX
database
Active Pages speedup as problem size varies.
As problem size grows, and the Central Processor is used for both the coordination of large deletes and the complete execution of small deletes, the Central Processor becomes the limiting factor in performance and the performance gets closer to that of the uniprocessor. This shows an interesting tradeoff between the FPGA and the Central Processor. Some computations, though not many, will perform better on the Central Processor. If this coincides with a part of the application that does not require parallelism, then the advantage of the memory-centric FPGA implementation will be reduced.
35.2.2
Processor–Memory Nonoverlap
The saturated region of Active Pages performance emphasizes the importance of partitioning applications to efficiently use the Central Processor in a system. For processor-centric applications, this dependence is obvious. The goal is to keep the Central Processor computing by providing a steady stream of useful data from the memory system. For memory-centric partitions, however, the Central Processor is still a vital resource. Active Pages cannot compute without activation and interpage communication, both provided by the Central Processor. As data size grows in an Active Pages application, so does the load on the Central Processor. We measure the remaining capacity of a Central Processor to handle this load with a metric, processor–memory nonoverlap time. Nonoverlap is the time the Central Processor spends waiting for the memory system and can be used to estimate the boundary between the scalable and saturated regions of application performance. The relative percentage of time the Central Processor is stalled, waiting for memory system computation, is shown in Figure 35.4. As described in
35.2 Performance Results
785
Processor–memory nonoverlap (%)
100 80 60 40 20 0 1
10
100
Problem size (in 512 KB pages) array-delete dyn-prog median-kernel
array-find matrix-boeing median-total
array-insert matrix-simplex MMX
database
FIGURE 35.4 I The percent of cycles that the Central Processor is stalled on Active Pages as problem size varies.
the previous section, the applications that reached the saturated region of speedup were database, matrix-simplex, matrix-boeing, and median-filtering. As Figure 35.4 shows, these applications also reach a point of complete processor– memory overlap. We also observe that for the array primitives and the dynamic programming application, the nonoverlap percentage remains relatively high. These applications are largely memory-centric with very little Central Processor activity. In fact, the array primitives operate asynchronously to the end of the application and are artificially forced into synchronous operation for this study. This means that an application can use the array-insert and array-delete primitives with only the cost of Active Pages function invocation. Modulo dependencies on the array, the time spent by the memory system shifting data, can be overlapped with operations outside of the STL array class. This overlap occurs in a natural way with no additional effort required by the programmer who uses the Active Pages STL array class. Opportunities for overlapping execution of data structure operations with data structure usage are intriguing and are being investigated further. The dynamic programming example maintains a very high processor–memory nonoverlap; however, preliminary results indicate that processor-mediated communication required by the Active Pages memory system eventually dominates performance. This occurs for extremely large problems that are well beyond the range of problem sizes presented in this study. Dedicating more resources to the interconnect increases the range of problems that Active Pages can help solve.
786
Chapter 35
35.2.3
I
Active Pages: Memory-centric Computation
Summary
Memory-centric computation provides a scalable source of performance for large-scale applications. Active Pages provides a large number of simple, reconfigurable computational elements that can achieve speedups up to 1000 times faster than conventional systems. Systems with rich interconnects have the potential for scalable gains on an even wider range of applications.
35.3
ALGORITHMIC COMPLEXITY Although the simulated results show great promise, to truly understand how Active Pages improves runtimes as problem sizes grow, we need to explore asymptotic properties of algorithms in conventional systems as well as Active Pages systems [7]. For this study, we use a set of kernels whose asymptotic properties are well known in algorithmic literature. While it is unrealistic to expect the number of processors in a conventional multiprocessor to scale arbitrarily, the amount of DRAM in a system is expected to scale with problem size for a majority of problems. With Active Pages DRAMs, computational hardware also scales. This scaling provides parallelism that can improve asymptotic performance. Table 35.2 gives a preview of such gains for a variety of algorithms. Note that Active Pages execution times rely on the optimal page size given in the table. In practice, we expect Active Pages hardware to support a small range of page sizes designed to support target applications and problem sizes. The challenge in the analysis is to take communication costs into account. In any system, the interconnect will affect the asymptotic properties of the performance as the problem scales. Active Pages, in particular, requires careful consideration of the communication between Page Processors as well as between the Central Processor and the Page Processors. The partitioned computations and restricted communication model here differ substantially from traditional parallel models such as PRAM [8]. This section presents an analysis of each algorithm that considers these issues. These analyses are also validated with simulation results. TABLE 35.2
I
Algorithmic complexity (summary)
Application Array insert 2D LCS 3D LCS All-pairs shortest path Sorting Volume rendering
Conventional
Execution time within Active Pages
O( n ) O( n 2 ) O( n 3 ) O( n 3 ) O(n · log2 (n))
√ O( n ) √ O( n n ) O(n7/3 ) O(n7/3 ) O(n · log2 (log2 (n)))
O( n 3 )
O(n5/2 )
Page size
√
O( n ) O( n ) O( n 2 ) O(n4/3 ) O(n/z) where n = z · ez O(n3/2 )
35.3 Algorithmic Complexity
35.3.1
787
Algorithms
Active Pages can dramatically improve the performance of many algorithms. This section maps several common algorithms to an Active Pages system and analyzes performance gains. Figure 35.5 introduces the notation used here. With these conventions, we analyze the worst-case execution time of the algorithms: insertion of an element into a linear array of elements, longest common subsequence of two- and three-dimensional sequences using a dynamic programming formulation, all-pairs shortest path using a dynamic programming formulation, sorting of a linear array of elements, and volume rendering using ray-tracing and linear absorption coefficients [7, 9]. Each analysis is provided by first presenting a general model for the algorithm’s execution time. Next, various model-specific parameters are assumed to be constants. After this simplification, the derivative of execution time with respect to page size is used to find an optimal page size. This page size is then substituted back into the model, and execution time is expressed again as a function of problem size. These results are then validated with a high-level simulator. The simulator models Active Pages execution using parameters based on execution of the cyclelevel simulator. The parameters used are given in Table 35.3. Typical parameters correspond to the target architecture studied here and often exhibit better performance than a purely asymptotic analysis would suggest. Asymptotic parameters emphasize the dominant terms in asymptotic performance while remaining within realistic problem sizes. These exaggerated parameters are used to validate the more conservative analyses. Table 35.3 summarizes the parameters used in the high-level simulator. Ta is the amount of time required by the processor to invoke a function on a memory-based
n p q k
is the size of the input. is the number of data elements in an Active Page. is a problem-specific function of p that is used for most algorithms to define p. For instance, for dynamic programming algorithms where a two-dimensional result set is generated, it is convenient to describe p as equal to p = q2 . is a function of the number of Active Pages used for the problem—usually k = n /q.
FIGURE 35.5
TABLE 35.3
I
I
The notation used for algorithmic analysis.
Summary of simulation parameters
Parameter
APSP*
Sort
Array insert
LCS*
LCS3
Render
Activation time (Ta ) Central Processor per-page processing time (Tp ) Page processing per-element processing time (Tc ) Fixed communication overhead (Tsa ) Per-element communication cost (Tsb )
100/0 – 10/10 1/1 1/1
0 1 1 – –
2058 387 2 – –
100/100 – – 10/10 1/100
100 – 10 1 1
100 5 10 – –
∗ Typical/asymptotic.
788
Chapter 35
I
Active Pages: Memory-centric Computation
processor. This includes setup, argument passing, and invocation. This constant is per page. Tp is the amount of time required by the processor to complete execution of an algorithm associated with a particular page. Generally, the “focus” of execution traverses from the Central Processor to the Active Pages and then back again. This may proceed many times and involve overlap throughout the execution of the algorithm. However, for the analysis presented here the focus is on a single set of transitions from host to memory and back. Hence, Tp is the time spent by the Central Processor per page when completing the Central Processor portion of the computation for that page. Tc is the amount of time required by the memory-based processing element to compute its portion of the algorithm for a single data item within the page. For instance, on a conventional processor and memory system, an O(n) algorithm requires some time, Tc , to compute the solution for each element; hence, the execution time is described as T = Tc · n. Tsa is the amount of time that corresponds to the “fixed overhead” associated with each interpage communication. Inter–Active Pages communication is a necessarily expensive process, and this constant quantifies the relatively large fixed overhead associated with each such communication request. Tsb is the amount of time, per data item, associated with an interpage communication. Not all algorithms use interpage communication, and some use portions of Ta or Tp to perform such communication as part of activation and postprocessing, respectively. This short section can present detailed analysis only of the array and LCS applications. We refer the reader to a technical report by Oskin et al. for the full set of analyses and results [9].
35.3.2
Array-Insert
The analysis begins with a simple array library. Specifically, we examine an insertion operation performed on an array of elements arranged in a linear fashion. A conventional system requires O(n) execution to complete this task. In an Active Pages memory system, we partition these n elements into k pages, with each Active Page managing n/k elements. To insert an element at position j within the array, each Active Page from the page containing j up to the last page of the array shifts the elements up by one to make room for the new element. These shifts proceed in parallel, however, since each Active Page operates independently. Note, though, that some form of communication between pages is required to migrate elements across page boundaries. This communication is grouped within the activation portion of each Active Page. The algorithm can be expressed as shown in Figure 35.6.
for j=1 to k communicate the last element of page j to page j+1 activate page j informing it to shift elements upward
FIGURE 35.6
I
The array-insert algorithm.
35.3 Algorithmic Complexity
789
The analysis begins with s(i), the nonoverlap (stall) time for page i. The nonoverlap time, discussed in Section 35.2.2, is the amount of time spent by the processor waiting for the Active Pages memory system to finish. Essentially, this algorithm (and many other Active Pages algorithms) proceeds by having the Central Processor set up and activate memory-based processing, then wait for a page to complete computing. After the memory-based computation section is complete, the processor can return to finish its section of the computation. It turns out that quantifying how much a processor stalls while waiting for memory-based computation to complete, for traditionally linear algorithms, is an important and measurable quantity that can be used to tune applications to achieve maximum performance. We use it to quantify execution time. Three functions—T^ a, T^ p, and T^ c—are used to quantify portions of the execution time. These are expressed as functions because several linear-based algorithms can be mapped to an execution time analysis similar to that presented here. The functions correspond to activation time, host processor postexecution time, and per-page memory-based computation time, respectively. For array insertion, these are essentially constant functions; hence, Tc(i) = Tc , Ta(i) = Ta , and Tp(i) = Tp . Figure 35.7 shows the timing of the array-insert operation (or any other linear-based function) on the Active Pages system using Ta, Tc, and Tp. Next, note that ∑ki=1 s(i) ≤ Tc ·p allows us to simplify execution time and take the derivative of T with respect to p. This gives us a new expression for T given the optimal value for p: k
T=
∑
i=1
k Ta + Tp + s(i) = k Ta + Tp + ∑ s(i) i=1
n T a + T p + Tc · p ≤ k T a + Tp + Tc · p = p n Ta + Tp dT −n = 2 T a + T p + Tc ⇒ p = dp Tc p √ n Ta + Tp + Tc · p = 2 · n · (Ta + Tp ) · Tc = O( n) Topt = p k
T=
∑
Ta(i) + Tp(i) + s(i)
i=1
s(i) = max
s (i) = Tc(i) −
0
s (i)
k
∑
j=i+1
i−1
Ta( j) + ∑ (Tp( j) + s( j))
j=1
(35.1)
Chapter 35
790
I
Active Pages: Memory-centric Computation
Processor Ta(1) Ta(2) Ta(3)
Ta(K)
Active Page 1
s(1)
Tp(1)
Tp(2)
Tp(3)
Tp(K)
Tc(1)
Active Page 2
Tc(2)
Active Page 3
Tc(3)
Active Page K
Tc(K)
Time I
An array-insert operation demonstrating processor and Active Page computations.
7.0E106 y 5 135837n 0.5
6.0E106 Simulated machine cycles
FIGURE 35.7
5.0E106 4.0E106 3.0E106 2.0E106 1.0E106 0.0E100 0
FIGURE 35.8
I
125 M
250 M
n
375 M
500 M
625 M
Simulation results for the array-insert operation.
This analysis makes the conservative assumption that computation proceeds in serializable steps. First, all pages are activated; then all pages compute; finally, all pages finish and the processor performs some minimal postpage computation for each page. In reality, there is substantial overlap of these functions, and only during asymptotic performance is this serializing behavior observed. During practical application of this algorithm, the dominant term is Tc · p, and execution time is held relatively constant. This behavior is observed until the point at which the number of pages times the activation and postpage processing per page starts to significantly approach Tc · p. Figure 35.8 depicts simulated application performance versus problem √ size. As can be seen from the graph, simulated performance follows an O( n) growth curve, as predicted by the analytical model here.
35.3 Algorithmic Complexity
35.3.3
791
LCS (Two-dimensional Dynamic Programming)
Moving to a more complex algorithm, we examine a dynamic programming formulation for computing the longest common subsequence in a protein. The conventional execution time of this algorithm is O(n2 ). Figure 35.9 outlines the algorithm. For a more in-depth discussion of the LCS algorithm with finegrained parallel execution in a systolic model, see Hoang [10]. Parallel execution of this algorithm proceeds in “wave-fronts,” as depicted in Figure 35.10. Once the first subproblem is solved and the results have been dispatched, two other problems can immediately start computing, and when they are done, three other Active Pages can start their computation in parallel. The processor is responsible for activating a wave-front. When processor-mediated communication is used, the wave-front is uneven, with certain pages of the computation executing slightly ahead of other pages. This is because of the overlapping nature of Active Pages computation and processor activity. In the model of computation here, this overlap is very important to performance, and we take advantage of it to lower overall execution time. Also note that the subproblem solution that an Active Page will make available consists only of the items on two edges of the page. For this problem we assume the following constants. Tc is the time required by the Active Pages processor to compute the result of a single item of the LCS computation. Tsa is the fixed overhead cost associated with an interpage communication. Tsb is the cost to transfer items between pages on a per-item basis. partition x and y into k segments divide the computation into x/q and y/q smaller computations initialize page (i,j) with the corresponding component i of string x and with component j of string y. let page (i, j) perform the conventional LCS algorithm after subproblems (i, j-1), (i-1, j), and (i-1, j-1) have been solved. page (i,j) dispatches results to neighboring subproblems.
FIGURE 35.9
I
The two-dimensional LCS algorithm.
C
om
pu ta tio n
w
av
y/q pages
e-
fro
nt
x/q pages
FIGURE 35.10
I
Parallel execution of two-dimensional LCS on Active Pages.
792
Chapter 35
I
Active Pages: Memory-centric Computation
Further, since the dynamic programming model dictates that the number of items in a page be quadratic in terms of the length of sequence x and the length of sequence y, we define the page size p to be equal to q2 , where q is a variable. This makes the reasonable analytical assumption that x and y are of similar lengths. We can express application execution time as j T < 2 · ∑ Tc · q2 + Tsa + q · Tsb + 2 · i=1
n/q
∑
i · 3 · Tsa + (2 · q + 1) · Tsb
(35.2)
i=j+1
where j represents the particular wave-front in which the overall execution switches from being bounded by computation to being bounded by communication. Focusing on the first half of the computation-bound area, each wavefront has an ever-increasing cost of communication. This is because more Active Pages are involved in each wave-front. At first, the communication is hidden by computation, but eventually the cost of communicating the required data between wave-fronts exceeds the cost of computation for the wave-front. At this point, the algorithm crosses over from being bounded by computation to being bounded by communication; thus, computation completely overlaps with communication. We denote the wave-front where this occurs as j. This chapter presents an analysis that achieves a better theoretical upper-bound than the conventional sequential solution. Based on particular protein sequence sizes, computer-assisted analysis can reveal the ideal j and q, which minimize the execution time of this algorithm, thus tailoring the behavior of Active Pages in terms of the given problem size. The simulation results show that computer-calculated ideal page sizes entail even a slightly better performance than the theoretical analysis. As will be seen, this is because of a simplification in the analysis. Suppose we force j ≥ n/q. This implies that the algorithm will never become bounded by communication resources. We can do this by carefully selecting q and then demonstrating that this q does indeed force j ≥ n/q. To find a q that satisfies these conditions, we require that the communication always weighs less than computation: n · 3 · Tsa + (2 · q + 1) · Tsb ≤ Tc · q2 + Tsa + q · Tsb q
(35.3)
Then simplify this inequality by: n · 3 · Tsa + (2 · q + 1) · Tsb ≤ Tc · q2 + Tsa + q · Tsb q n · 3 · q · (Tsa + Tsb + 1) ≤ Tc · q2 + Tsa + q · Tsb q Tc · q2 ≤ Tc · q2 + Tsa + q · Tsb
(35.4)
35.3 Algorithmic Complexity
793
This simplification will not lead to an absolute lower-bound on execution time, but it does present a tractable alternative that can be used to find an “ideal” q: √ √ 3 · (Tsa + Tsb + 1) = α· n (35.5) q ≥ n· Tc Then use this q to drop j from the equation, since the algorithm will never be bound by communication: n/q n T < 2 · ∑ Tc · q2 + Tsa + q · Tsb = 2 · · Tc · q2 + Tsa + q · Tsb q i=1
(35.6)
√
√ √ n = 2· · Tc · n · α2 + Tsa + n · Tsb + α = O(n n) α
√ While O(n n) is a loose upper-bound, it is faster than the conventional runtime of O(n2 ). The simulation √ results concurred with the findings and suggested a slightly better than O(n n) lower worst-case execution bound. Figure 35.11 depicts simulated performance of the LCS algorithm; √ two curves are shown. The first curve depicts the predicted performance of O(n n) (using asymptotic parameters from Table 35.3). The second curve predicts a more realistic performance of O(n4/3 ) (using typical parameters). The discrepancy is because of communication performance. If communication were more expensive, then the ideal page size would shift away from communication requirements and toward increased computational requirements, amplifying that term in the execution time expression. This in turn would reveal the asymptotic order of the LCS algorithm. 7.0E107
Simulated machine cycles
6.0E107 5.0E107
y 5 53.031n1.54
4.0E107
y 5 35.469n 1.3772
3.0E107 2.0E107 1.0E107 0.0E100 0
FIGURE 35.11
I
5000
10000
15000
n
20000
Simulation results for the two-dimensional LCS.
25000
30000
35000
794
Chapter 35
I
Active Pages: Memory-centric Computation
1.2E110
Simulated machine cycles
1.0E110 y 5 6.8863x 2.3554
8.0E109
6.0E109
4.0E109
2.0E109
0.0E100 0
1000
2000
3000
4000
5000
6000
7000
8000
9000
n
FIGURE 35.12
I
Simulation results for the three-dimensional LCS.
A more realistic depiction of application performance follows an O(n4/3 ) trend. A similar analysis predicts performance of O(n7/3 ) for three-dimensional LCS. Figure 35.12 shows that the simulated performance for three-dimensional LCS closely matches this prediction.
35.3.4
Summary
We can see that with a memory-centric architecture such as Active Pages, in which the computation scales with the communication, the asymptotic complexity can be reduced. We also see that it is a much more complex equation than one might think. The overhead of the Active Pages, the delay of any communication, and the page size need to be taken into account. Two algorithms, along with validated simulations, have been presented to show their new asymptotic properties. We have found that the inexpensive parallelism provided by page-based intelligent memories can have a significant affect on asymptotic performance. We have also found the optimal page sizes that are required to maximize performance.
35.4
EXPLORING PARALLELISM In any memory-centric system, we must decide the proper balance between memory resources and computation power. To save money, we could share a single computational element with twice as much memory. Allowing sharing can potentially even out the computational requirements of two processing elements because their needs may not always be identical.
35.4 Exploring Parallelism
795
This section looks at virtualizing the computational logic across superpages in the Active Pages chip. Virtualization is accomplished by time-slicing a VLIW processor (see VLIW datapath control subsection of Section 5.2.2) across one to eight Active Pages. We refer to this time-slicing as the multiplexing of the computational logic. This study presents an analysis of multiplexing and its effects on performance in a multiprocess environment. In addition, it looks at how varying individual processor widths affects performance. By combining these approaches, we demonstrate that multiplexing is a more effective technique for reducing logic area requirements than reducing individual Page Processor performance. In this study, we chose to use VLIW computational elements rather than an FPGA so that we could explore the trade-off between instruction-level parallelism and task-level parallelism. The results hold for FPGAs as well. From a high level, it is merely the trade-off between smaller dedicated resources per memory segment and shared resources between memory segments. The study is cleaner when using processor width rather than FPGA area.
35.4.1
Speedup over Conventional
We begin with the raw speedups of a commodity workload that is used for this study. Because the focus is on multi-programmed systems, we are using a slightly different workload than before. Figure 35.13 depicts application speedup when applications use an Active Pages memory system. Speedup is measured in terms of wall-clock time for the application in a conventional memory system divided by its wall-clock time
41
10
18
Speedup over conventional
9 8 7 6 5 4 3 2 1 0 Array (78 M)
MPEG (8 M)
Render (256 M)
gcc (2.5 M)
Application
FIGURE 35.13
I
Speedup over conventional.
gzip (0.5 M)
Perl (1 M)
Chapter 35
I
Active Pages: Memory-centric Computation
using an Active Pages memory system. We observe that Active Pages applications continue to show substantial speedups when executed in a multiprocess environment. That is, even when many independent applications are executed at once, the applications experience speedup.
35.4.2
Multiplexing Performance
We continue by exploring how much performance degradation occurs as resources are shared between Active Pages. Figure 35.14 depicts relative application performance as the degree of multiplexing is increased. We normalize the results to a configuration with no multiplexing, where a one-to-one relationship exists between 4-wide VLIW processors and DRAM subarrays. Multiplexing factors of two, four, and eight make up the remaining data points. Note that hardware multiplexing of eight incurs no more than a 17 percent performance penalty, and a multiplexing factor of four incurs no more than a 6 percent performance penalty for all Active Page applications in the workload.
35.4.3
Processor Width Performance
It is promising that with a 4-wide VLIW, performance does not degrade substantially, as it is shared between Active Pages. Is this because the VLIW processor is not being used efficiently? We now examine the inherent instruction-level parallelism (ILP) in our applications. Figure 35.15 depicts relative application performance as VLIW processor width is varied. Here, processor widths of one, two, four, and eight were evaluated. We observe that half of the applications show a 20 to 80 percent increase in performance from increasing processor width, but the other half do not. It should be noted that MPEG suffers adverse 1.05 1, 2, 4, 8 Relative performance (normalized to no multiplexing)
796
1.00
0.95
0.90
0.85
0.80 Array (78 M)
MPEG (8 M)
Render (256 M)
gcc (2.5 M)
Application
FIGURE 35.14
I
Performance versus hardware multiplexing.
gzip (0.5 M)
Perl (1 M)
35.4 Exploring Parallelism
797
Relative performance (normalized to processor width of one)
2.00 1, 2, 4, 8 1.80 1.60 1.40 1.20 1.00 0.80 Array (78 M)
MPEG (8 M)
Render (256 M)
gcc (2.5 M)
gzip (0.5 M)
Perl (1 M)
Application
FIGURE 35.15
I
Performance of multiplexing versus VLIW processor width.
cache effects with a VLIW width of eight, thus lowering performance relative to a 4-wide VLIW. We note that the largest performance gains because of VLIW processor width are achieved with processor widths of two and four, and not with eight.
35.4.4
Processor Width versus Multiplexing
Taking another look at Figure 35.15, we find that the Active Pages applications do not have the static instruction-level parallelism to use much beyond a 4-wide VLIW processor. In addition, Figure 35.14 shows that degradation because of multiplexing is superlinear, suggesting that too much coarse-grained parallelism exists within the application workloads to substantially multiplex processor resources. An experiment designed to compare these two forms of parallelism is depicted in Figure 35.16. Here we compare an Active Pages device using a single-issue processor with no multiplexing against a device using a 2-wide VLIW with twoway multiplexing, a 4-wide VLIW with four-way multiplexing, and an 8-wide VLIW with eight-way multiplexing. In the Active Pages applications, a 2-wide VLIW with two-way multiplexing shows a performance gain. This implies that the gain from the increased ILP outweighs the reduced coarse-grained parallelism. Because several conventional applications are active in the workloads, this makes sense because many of the pages do not need the page processors. A 4-wide VLIW with four-way multiplexing is the best configuration studied. Hence, we use this configuration in the remainder of this study.
Chapter 35
I
Active Pages: Memory-centric Computation
2 1.8 Relative performance
798
1.6
1.4
1.2
1, 2, 4, 8
1
0.8 Array (78 M)
MPEG (8 M)
Render (256 M)
gcc (2.5 M)
gzip (0.5 M)
Perl (1 M)
Application
FIGURE 35.16
I
Performance versus processor width.
To describe why multiplexing performs well in a multiprocess environment, we identify three key factors: nonactive memory, Active Pages processing time, and partitioning. Nonactive memory This helps mask the performance degradation because of multiplexing. By definition, all pages of memory in a conventional application require no computation in memory. Some pages in an Active Pages application also require no memory computation. Active Pages processing time This is the amount of time spent by the Active Pages computing without main processor intervention. The time varies with Page Processor performance. Simple data manipulations are easily offloaded to the memory system. This leads to longer per-page computation times, most notably MPEG, with Active Pages processing time on the order of seconds. The combination of low Active Pages processing times and context switching in the Central Processor hides the effects of multiplexing in the memory system. In the absence of multiplexed Active Pages, when the main processor switches to another process, the Active Pages associated with the previous process quickly finish their work and stall until the process regains control of the Central Processor. Multiplexing allows efficient utilization of Page Processors by context-switching them to another Active Pages process when they would otherwise be idle. In an environment with Active Pages processing times longer than a Central Processor time slice, such as those observed in MPEG, we would expect multiplexing to degrade performance. Within this study, however, degradation
35.5 Defect Tolerance
799
is minimal due to the relatively low memory requirements of MPEG and the effects of conventional memory (without computational capability). Partitioning This is the process of dividing an application into work done in Active Pages and work done in the Central Processor. As long as the main processor can keep up with the Active Pages, an application is scalable and will exhibit linear speedup as its dataset grows and more Active Pages are used. Once the main processor becomes saturated with work, however, performance will no longer increase as more Active Pages are used. We find that multiprocess environments change the position at which an application transitions from scalable to saturated. Multiprocessing time slices the Central Processor, which may be viewed as artificially slowing down the processor from the perspective of a single process. This will shift the scalablesaturated point toward smaller problem sizes. We may use multiplexing to reverse this shift. Essentially, multiplexing slows down the Active Pages computation, shifting the scalable-saturated point back toward larger problem sizes. Because of the preceding properties of multi-programming environments, we observe that multiplexing is an efficient mechanism for reducing logic area requirements in an Active Pages memory device. A four-way multiplexed 4-wide VLIW Active Pages device is estimated to require 12 percent of the available chip area for computational logic while still providing substantial performance gains. This estimate is based on the reduced logic area coupled with a 20 percent logic area increase because of additional interconnect requirements.
35.4.5
Summary
This study has looked at a promising method for reducing the computational logic area requirements of an Active Pages memory device. Such an approach could be exploited by any memory-centric device. By multiplexing the computational logic among one to four Active Pages, hardware cost can be reduced by four times with little performance impact in a multiprogrammed environment. Further, we find that it is more important to have fewer, faster computational logic elements that are time-shared across pages than more abundant, slower ones available for direct computation at each page. With a 4-wide VLIW processor multiplexed with every four Active Pages, computational logic area can be reduced to 12 percent of total chip area in a gigabit DRAM.
35.5
DEFECT TOLERANCE The previous section explored the parallelism trade-offs gained by sharing computational units between pages. This section focuses on another major factor in cost: manufacturing defects. DRAM architectures use redundant cells to tolerate defects, dramatically increasing chip yields and reducing cost. Embedded processors, however, do not have an analogous unit of redundancy. While multiplexing several Active Pages with one embedded processor reduces
Chapter 35
I
Active Pages: Memory-centric Computation
chip area, multiplexing each group of pages with two processors allows each group to tolerate a processor defect. This associativity requires some additional interconnect, but tolerance to randomly distributed processor defects increases from 33 percent to more than 50 percent. In this section, we use associativity to increase the defect tolerance of an Active Pages system. The focus is on manufacturing defects that render embedded processors inoperative. The goal is to provide some degree of processor redundancy under the assumption that memory cells already have their own redundancy techniques. Instead of four Active Pages sharing one 4-wide VLIW processor, we allow eight pages to share two processors. We study the effect of randomly distributed processor defects on this associative system. If a group suffers two defects, the operating system will only map conventional pages to that group (pages with no computation). The performance degradation because of randomly distributed processor defects is depicted in Figure 35.17. We note that up to a 50-percent defect rate is tolerated. Increasing the defect rate to 60 percent decreased the number of functional Active Pages below that required by the workload without page swapping. Virtualizing Active Pages to disk was studied by Oskin et al. [11], and a similar mechanism can be used to further increase defect tolerance. Associativity creates an increased tolerance to defects. The benefits are straightforward. Two processors must fail instead of one in order to disable any Active Pages. If 50 percent of embedded processors fail in the test system, we see that with two-way associativity up to 75 percent of the memory will still be available for Active Pages use.
1.05 0%, 10%, 20%, 30%, 40%, 50% 1 Relative performance
800
0.95 0.9 0.85 0.8 0.75 0.7 Array (78 M)
MPEG (8 M)
Render (256 M)
gcc (2.5 M)
Application
FIGURE 35.17
I
Performance versus random processor defects.
gzip (0.5 M)
Perl (1 M)
35.6 Related Work
801
Second, not all of the system memory is required to be “active” at the same time. This allows the OS to map around defect areas and use fully defective functional groups for conventional applications. Further, the workloads do not require the full 512 MB available to the system, and the unutilized memory is available to map into defective regions. The OS can tolerate some defects without associativity by taking advantage of underutilization and conventional applications. As noted in this section, multiplexing, associativity, and clever OS resource allocation can map around manufacturing defects with only a 20 percent performance penalty with 50 percent random logic defects. An Active Pages–aware OS can be defect tolerant and allow a lower-cost system to be developed by increasing manufacturing chip yield. These incremental costs make Active Pages an attractive memory-based computation model, though the same principles would hold for FPGA-based systems (see Chapter 37).
35.6
RELATED WORK DRAM densities have made intelligent memory attractive as commodity components. Intelligent memory, however, was proposed well before the current commodity thrust. The SWIM project [12] combined reconfigurable logic and memory to perform fast protocol computations. The J-Machine integrated processor, memory, and network router in a single chip to form building blocks for a fine-grained multiprocessor [13]. The RAW [14], MORPH [15], and RaPiD [16] projects continue to explore the use of reconfigurable technology to exploit parallelism. The RAW project, in particular, has also examined issues of processor width, dynamically trading off ILP and speculation. The HPAM project [17] takes a hierarchical approach to intelligent memory. The project that is most similar to Active Pages is FlexRAM [18], which targeted general-purpose computation. The goal was to find computation that could take advantage of the bandwidth provided within a DRAM chip. FlexRAM proposed a hierarchical solution with simple computational elements within each page and a more complex processor for each DRAM. This allowed communication to be handled by an on-chip processor rather than the Central Processor. This had the disadvantage of adding pins to commodity DRAM packaging. Several other projects explored placing processors in DRAM for more massively parallel computation. IRAM [19] solved this problem by placing a singlevector processor in DRAM. For applications amenable to vectorization, this is an excellent match between a high, bandwidth memory and a processing element. Notre Dame’s PIM [20] project uses SIMD functional units to consume the extra bandwidth. DIVA [21] has the most sophisticated design, allowing for a kernel to run on the PIM processors. It also features a dedicated PIM communication network, allowing for communication between PIM processors without host processor intervention. Currently, there is a single computational element in each DRAM.
802
Chapter 35
I
Active Pages: Memory-centric Computation
The Impulse project [22] has similar goals to Active Pages but focuses on adding address manipulation functions to the memory controller. Its applications, such as gather-scatter for multiplying a sparse matrix by a dense vector, are also enhanced by more efficiently feeding the microprocessor with data. All the Active Pages applications, however, require some small computations that cannot be supported without more generalized computation in the memory system than Impulse provides.
35.7
SUMMARY This chapter presented the enormous potential for memory-centric computation, along with several issues specific to the Active Pages DRAM environment. The potential for all memory-centric designs is the bandwidth between memory and the nearest computational unit. The challenge, just as in Active Pages, is how to communicate between units. As the ratio of memory to processing units decreases, the total bandwidth increases, but the communication needs increase. This different balance between computation and communication can affect the asymptotic properties of algorithms. The barriers for intelligent memory, in particular, are the need for explicit parallel programming and the buy-in by manufacturers to put it in commodity production to lower the price. DIVA is working on a migration path for this technology. The advent of multicore commodity processors pushes the field in two directions. First, it provides performance improvements in multi-programmed environments without the need for parallel programming. This hurts the case for intelligent memory. The prevalence of parallel processors on the market, however, increases the utility of parallel programming so that this may not be such a rare skill in the future. If parallel programming becomes commonplace, then intelligent memory will be poised for success in the commodity market. Acknowledgments Like any large-scale project, Active Pages was the work of several people over several years. Fred Chong and Mark Oskin were the driving force behind the project. Matt Farrens provided valuable advice. Several graduate and undergraduate students contributed to the project, including Justin Hensley, Lucian Vlad-Lita, Tim Sherwood, Ravishankar Rao, Aneet Chopra, Paul Sultana, and Jennifer Hollfelder.
References [1] K. Itoh et al. Limitations and challenges of multigigabit DRAM chip design. IEEE Journal of Solid-State Circuits 32(5), 1997. [2] M. Oskin, J. Hensley, D. Keen, F. T. Chong, M. K. Farrens, A. Chopra. Exploiting ILP in page-based intelligent memory. International Symposium on Microarchitecture, 1999. [3] Semiconductor Industry Association. The national technology roadmap for semiconductors. http://www.sematech.org/public/roadmap/, 1994.
35.7 Summary
803
[4] M. Oskin, F. T. Chong, T. Sherwood. Active pages: A computation model for intelligent memory. Proceedings of the 25th Annual International Symposium on Computer Architecture, 1998. [5] P. Ashenden. The Designer’s Guide to VHDL, 2nd ed., Morgan Kaufmann, 2002. [6] Altera Corporation. FLEX 10K Embedded Programmable Logic Family, May 1998. [7] M. Oskin, L. V. Lita, F. T. Chong, J. Hensley, D. K. Franklin. Algorithmic complexity with page-based intelligent memory. Parallel Processing Letters 10(1), 2000. [8] A. Kautonen, V. Leppnen, M. Penttonen. PRAM model. http//www.cs.joensuu.fi/ pages/penttonen/parallel/pram.pram.html. [9] M. Oskin, L.-V. Lita, F. T. Chong, J. Hensley, D. K. Franklin. Algorithmic Complexity with Page-Based Intelligent Memory. Technical Report CS-01-00, Department of Computer Science, University of California, Davis, February 2000. [10] D. T. Hoang. Searching genetic database on Splash 2. In D. Buell, J. Arnold, W. Kleinfelder, Splash 2: FPGAs in a Custom Computing Machine, IEEE Computer Society Press, 1996. [11] M. Oskin, F. T. Chong, T. Sherwood. ActiveOS: Virtualizing intelligent memory. Proceedings of the IEEE International Conference on Computer Design, 1999. [12] A. Asthana, M. Cravatts, P. Krzyzanowski. Design of an active memory system for network applications. International Workshop on Memory Technology, Design and Testing, IEEE Computer Society Press, 1994. [13] M. Noakes, D. Wallach, W. Dally. The J-Machine multicomputer: An architectural evaluation. Proceedings of the 20th Annual ACM International Symposium on Computer Architecture, May 1993. [14] W. Lee. Space-time scheduling of instruction-level parallelism on a Raw machine. Proceedings of the 8th International Conference on Architectural Support for Programming Languages and Operating Systems, October 1998. [15] A. A. Chien, R. K. Gupta. MORPH: A system architecture for robust high performance using customization. Frontiers, 1996. [16] C. Ebeling et al. Mapping applications to the RaPiD configurable architecture. Symposium on FPGAs for Custom Computing Machines, April 1997. [17] Z. Miled, R. Eigenmann, J. Fortes, V. Taylor. Hierarchical processors-and-memory architecture for high performance computing. Sixth Symposium on the Frontiers of Massively Parallel Computation, October 1996. [18] Y. Kang, M. Huang, S. Yoon, Z. Ge, D. K. Franklin, V. Lam, P. Pattnaik, J. Torrellas. FlexRAM: An advanced intelligent memory system. International Conference on Computer Design, October 1999. [19] D. Patterson. Microprocessors in 2020. Scientific American, September 1995. [20] P. M. Kogge, T. Sunaga, E. A. E. Retter. Combined DRAM and logic chip for massively parallel applications. 16th IEEE Conference on Advanced Research in VLSI, 1995. [21] J. Draper, J. Chame, M. Hall, C. Steele, T. Barrett, J. LaCoss, J. Granacki, J. Shin, C. Chen, C. W. Kang, I. Kim, G. Daglikoca. Architecture: The architecture of the DIVA processing-in-memory chip. International Conference on Supercomputing, 2002. [22] J. Carter, et al. Impulse: Building a smarter memory controller. Proceedings of the International Symposium on High-Performance Computer Architecture, January 1999.
This page intentionally left blank
PART
VI
THEORETICAL UNDERPINNINGS AND FUTURE DIRECTIONS Parts I through V addressed what reconfigurable architectures look like (Part I), how we can develop reconfigurable solutions (Parts I, II, IV, V), and, by example, where reconfigurable solutions can be particularly beneficial (Part V). In this, the final part of the book, we examine why reconfigurable architectures are beneficial and we gain insight into the areas where the benefits of reconfigurable solutions lie. We also observe technology trends and examine why reconfigurable architectures may become increasingly important over time. To support and ground these discussions, the following chapters delve into the technology basis from which we build these architectures, and their alternatives, and discuss physical issues including area, defects, faults, and manufacturing trends. Chapter 36 constructs a simplified model of the architectural design space in which postfabrication programmable architectures (e.g., processors, FPGAs, VLIWs, SIMD arrays) are built. Using this model, the chapter illustrates the trade-offs inherent in different architectures and the impact these trade-offs have on the architectures’ efficiency in implementing various applications. This simple analysis illuminates the appropriate roles for processors and FPGAs, underscores how we can use FPGAs efficiently, and suggests why, as component capacities continue to grow, reconfigurable architectures may be important for carrying out an ever-enlarging set of high-throughput tasks. Chapters 37 and 38 explore how continued feature size scaling will influence the design of integrated circuits. As device feature sizes approach the atomic scale, our traditional techniques, abstractions, and solutions may no longer be appropriate. Manufacturing at the atomic scale demands higher regularity and produces less controlled structures. At the same time, physical imperfections (e.g., defects, faults, wear) occur at significantly higher rates. Postfabrication configurability appears to be an essential tool for dealing with these atomic-scale effects. This, too, suggests the growing importance of reconfigurable architectures for future technologies.
806
Part VI
I
Theoretical Underpinnings and Future Directions
Chapter 37 addresses defect and fault tolerance. It shows how configurable designs can accommodate defects and suggests in what directions our design and usage paradigms should evolve in order to deal with increasing defect rates. The chapter also examines how transient faults will affect future configurable systems. Chapter 38 further explores the impact of technologies in which feature sizes are measured in single-digit atomic widths. It reviews emerging atomic-scale technologies and shows how they can be assembled into a complete reconfigurable architecture.
CHAPTER
36
THEORETICAL UNDERPINNINGS Andre´ DeHon Department of Electrical and Systems Engineering University of Pennsylvania
Throughout this book there are examples for which reconfigurable designs offer superior performance to processor-based solutions. The reconfigurable implementation is typically orders of magnitude faster than the processor-based system. Even when we normalize the performance advantage to the number of components used in the solution, or to the number of square millimeters of silicon in the same process technology, we often see the reconfigurable solution providing one to two orders of magnitude higher computational capacity per square millimeter. These observations raise questions about reconfigurable computing systems. I I
I
Why do we see this greater computational capacity per unit area? How can we predict when reconfigurable systems can deliver significantly higher performance than processor-based implementations? What does this tell us about how we should engineer reconfigurable designs?
This computational density advantage is not an accident. It occurs for real, structural reasons resulting from where silicon is allocated in reconfigurable architectures. Field-programmable gate arrays (FPGAs) and reconfigurable architectures organize their instructions differently from processors, making different trade-offs between instruction and computational density. Processors give up raw computational capacity for the ability to support large and irregular computations robustly, while FPGAs give up the ability to switch rapidly among diverse tasks to maximize available compute density and spatial parallelism. This chapter develops a simple model of programmable devices and uses it to illustrate the gross design space, which includes processors and FPGAs, the trade-offs each makes, and the consequences of those trade-offs.
36.1
GENERAL COMPUTATIONAL ARRAY MODEL Let us start by focusing exclusively on a capabilities viewpoint, ignoring, for the moment, costs. What would be good to have for a general-purpose programmable computing architecture? Copyright © 2008 by Andr´e DeHon. Published by Elsevier Inc.
Chapter 36
I
Theoretical Underpinnings
The most general and flexible programmable architecture we might build would have: I
I I I
Computational operators (e.g., programmable gates) that compute an output bit from some number of input bits Full, bit-level interconnect among computational operators Local data storage for each bit operator The ability to issue a unique instruction to each bit-level computational operator on every cycle; this instruction should indicate: – Which computational function the operator should perform on each cycle – Where the inputs for the operator should come from, including both spatially from any other operator and temporally from local memory – Where the output of the operator on this cycle should go into local memory
Figure 36.1 shows a diagram of this architecture. For this simple model, we assume that all the programmable blocks are identical. We call the instruction that controls each programmable block (including interconnect and memory, as just summarized) a primitive instruction, or pinst for short (see Figure 36.2). With an array of N blocks, the full instruction word issued on every cycle to control the computational array is the composition of N pinsts. This array provides a computational capacity of N-bit operations (bitops) on each cycle. We have great flexibility in using this array since every bitop can have a unique pinst on every cycle. So, if we need to process an irregular collection of operations, such as a 17-bit add, an 8-bit subtract, a 13-bit exclusive-or (XOR), the next state evaluation on a 23-state finite-state machine (FSM), and a 5-bit shift left by 3, we can direct each bitop independently to keep all bitops performing exactly the operations needed for the computation. Further, if the following cycle needs a very different set of operations, such as a 9-bit multiply by the constant 27, a 12-bit AND, the next state evaluation on a 23-state FSM,
Data store
Compute unit
Data store
Data store
Compute unit
Compute unit
Array-wide instruction
808
FIGURE 36.1
I
The general computational array model.
Data store
Compute unit
809
FIGURE 36.2
I
Write/read addresses to data store Input source Op selection selection
Primitive instruction (Pinst)
Interconnect selection
36.2 Implications of the General Model
Data store
Compute unit
Primitive instruction (pinst) for programmable bitops.
and an 11-bit shift right by 2, we can issue the next array-wide instruction to control the computational array accordingly. We get to use all the bitops all the time. Mapping designs to this array is simply a matter of scheduling the bit-level computational needs onto the N-bit operations provided by the array. With this full ability to control the cycleby-cycle operation of each bitop independently, scheduling is relatively easy. (Strictly speaking, optimal scheduling remains NP-hard, but it can be approximated within a factor of 2 of optimal using a variant of Johnson’s Algorithm [1].) So, why is it that we do not have a popular architecture that provides this model?
36.2
IMPLICATIONS OF THE GENERAL MODEL From a purely logical standpoint, we cannot fault the general computational array model. However, we must implement any architecture in a physical computational medium (e.g., out of a number of discrete vacuum tubes or transistors, on a silicon die, ultimately out of molecules and atoms). To support the architecture, we must commit physical resources. Those resources have a cost in terms of area, delay, and energy. The general computational array model turns out to be extravagant—so much so that we are generally willing to compromise its power to build more practical architectures.
810
Chapter 36
I
Theoretical Underpinnings
This section illustrates two ways in which the instruction organization of the general model is unreasonably expensive. The focus here is on silicon VLSI implementations, and we discuss the sizes and areas of components in VLSI. To make the discussion general, resource areas are measured in terms of technology-normalized units. In particular, we will measure widths in units of F—the minimum feature size in a VLSI process; as a consequence, areas are measured in units of F2 . VLSI technologies are normally named by their minimum feature size, so when we talk about a 45 nm technology, we are talking about a technology with F = 45 nm. Ideally, when we scale from a larger technology to a smaller technology, everything scales as F. Features 900 nm wide in a 90 nm technology are 10 F wide and should become 450 nm wide in a 45 nm technology. Features do not always scale perfectly linearly like this, but they scale close enough for illustrative purposes. Details and estimates on how the industry expects silicon technology to scale are summarized by the ITRS [2]; the industry collaborates to produce an updated or revised version of this document annually.
36.2.1
Instruction Distribution
This section starts by considering the resource implications of delivering a separate pinst to every bitop. We assume the following: √ √ I The bitops are arranged in a dense N × N array (see Figure 36.3). I The area required for each bitop, including compute, storage, and interconnect, is Abop = 250,000 F2 ; we further assume that the bit operator itself is laid out as a square 500 F on a side. This size assumes that the interconnect has also been designed in a more restrictive way than the most general model (see Section 36.1), perhaps resembling something closer to traditional FPGA interconnect capabilities. I The metal pitch available for distributing an instruction bit is W metal = 4 F. The minimum pitch possible in a given technology is 2 F because we need to leave one feature size worth of space between features so that they do not short together. The smallest feature sizes tend to be polysilicon for transistor gate widths, with metal pitches being a little wider. A modern VLSI process has many metal layers, and the ones higher in the stack (farther from the silicon base) tend to be wider. I We have one complete horizontal metal layer and one complete vertical metal layer available to distribute instructions. As noted, modern VLSI processes generally have many metal layers; for example, an F = 65 nm process might have 11 metal layers. Some of the layers will be needed for local wiring in the cell, some for power and clock distribution, and some for interconnect. Dedicating two complete metal layers to instruction distribution is extravagant even with 11 metal layers. I Each pinst requires I bits = 64 to specify its instruction. This may seem small if we think about how many bits are required per 4-LUT in an FPGA, or large if you think about 32-bit processor instructions. Encoded densely, FPGA configurations could be much smaller [3]. The capabilities of the pinst might be closer to two processor instructions than one.
36.2 Implications of the General Model
811
N
Wmetal
N
Abop
Abop
bop
Wires/side 5
FIGURE 36.3
I
N Abop
Wmetal
Wiring for instruction distribution.
As we will see, the preceding assumptions only affect the particular quantitative conclusion we reach. The qualitative effect remains even if we assume two or four times as many metal layers, half the metal pitch, more compact instruction encodings, or larger bitop cell sizes. If the instructions must all come into the computational array, then the total wiring capacity available for instruction distribution is equal to the perimeter of the array.
Aside (N) =
√ N × Abop
Lperimeter (N) = 4 × Aside (N)
(36.1) (36.2)
Note that the two metal layers allow the connections on the top and bottom layers to cross over each other to reach into the array. However, if the lower
812
Chapter 36
I
Theoretical Underpinnings
layer is completely dense, we will have trouble making connections between the upper layer and the bit operations (i.e., we need to reserve space for vias through the lower level). To keep the math simple, general, and illustrative, we will not model that effect, which will only tend to make the problem more severe than the simple model indicates. To feed the N-bit operators into the array, we need: Itotal bits (N) = N × Ibits
(36.3)
Linstr dist (N) = Wmetal × Itotal bits (N)
(36.4)
For the distribution to be viable, we need: Lperimeter (N) > Linstr dist (N)
(36.5)
Substituting into the previous equations, this results in: √ 4 × N × Abop > Wmetal × N × Ibits 4×
Abop
Wmetal × Ibits
>
√
(36.7)
N
⎛
N (36.10) 4
√ 2 Wmetal × N × Ibits (36.11) Abop (N) = 4 Abop (N) = 4096 × NF2
(36.12)
That is, the area of each bitop needs to grow linearly with N, meaning that the array area is actually growing quadratically with N. Equivalently, we can recognize this effect as a difference between the growth rate of the area and the perimeter. If we assume the bitop area is constant, then the total area in the array is growing linearly in the number of bitops. However, the perimeter of the array is only growing as the square root of the
36.2 Implications of the General Model
813
array area. So it is not surprising that we reach a point where the array’s need for instructions, which is also growing linearly with bitops, exceeds the ability to feed instructions into the array that grows only as the square root of the number of bitops in it. The particular assumptions used for this example starkly illustrate that this effect is already an issue for very small arrays. You can substitute your favorite assumptions about instruction bits, metal pitch, metal layers, or bitoperator area, but the qualitative conclusion remains as follows: If we support this model, either we are limited in the size of the arrays we can build, or instruction distribution wiring ends up dominating all other resources and forces us to scale only as the square root of the area we spend on the computational array.
36.2.2
Instruction Storage
Local instruction store (holds Ninstr pinsts)
The previous section illustrated that instruction distribution from outside the computational array is not scalable to large computations. Alternately, consider storing the instructions inside the array. In particular, each bitop could include an instruction memory that holds its instruction (see Figure 36.4). We would
Data store
Compute unit
Instruction address
FIGURE 36.4
I
A bitop with local instruction memory.
814
Chapter 36
I
Theoretical Underpinnings
then only need to broadcast an address into the array, and each bitop could translate that through the instruction memory to its instruction. Even a 64-bit address is small compared to Lperimeter (1), so this solution does not challenge wiring capacity. However, it does raise the question of how large the instruction memory should be to begin to approximate the general model. In any case, storing the instructions requires area. So we should assess the cost of storing these instructions. Assume that the instruction memory lives in SRAM, and that the area of an SRAM cell to hold an instruction bit is Abit = 200 F2 . This means that the area per instruction is: Apinst = Abit × Ibits
(36.13)
Apinst = 200 F2 × 64 = 12,800 F2
(36.14)
The total area per bitop is now: Abitop w imem = Abop + Ninstrs × Apinst
(36.15)
Abitop w imem = 250,000 F2 + Ninstrs × 12,800 F2
(36.16)
Equation 36.16 now tells a very interesting story. The area required to store a single instruction is small compared to the area required for compute and interconnect in the bit operator (one-twentieth the area). If we store 20 instructions locally, we place half of the area into instruction memory. When we store 200 instructions locally, the instruction memory area ends up dominating (i.e., is 10 times the size of) the area required for computation. That is, given fixed area, the design with 200 instructions will only fit one-tenth the number of bitops as the design with a single local instruction. Unless we can limit the number of different, array-wide instructions we need to issue, the instruction memory needed to approximate the general model will end up dominating the computational area. Taken together with the result on instruction distribution, these examples illustrate why the general model is not typically supported: To support the general model, instruction resources would dominate all other resources, forcing limited computational density. We are left with the choice of either accepting very low computational density or looking for compromises in the general model that will allow us to avoid the huge instruction expense it implies.
36.3
INDUCED ARCHITECTURAL MODELS If the general model was viable, we would not have the varied set of computer architectures that exist. That is, computer architectures arise because (1) the general model is too expensive, and (2) there is structure in typical computational tasks that permits more economical implementations. Having identified
36.3 Induced Architectural Models
815
that it is unreasonable to support the general computational array model, we ask: Which structure exists in typical computations that can be exploited to provide a more economical implementation?
36.3.1
Fixed Instructions (FPGA)
If the instructions never change, we do not need to distribute them into the computational array, nor do we need to allocate instruction memory area to store more than a single instruction. We still allow each bitop a pinst, so each can perform a unique operation; however, we do not allow the pinst to change from cycle to cycle. Unchanging instructions is an extreme form of temporal locality, where computation remains the same over time. This allows us to build large arrays and keep the computation dense. If we need to, or can arrange to, perform the same computation on every cycle, then we use the array efficiently. This restriction on the general model effectively gives us an FPGA or spatially reconfigurable architecture. In Chapter 5, Section 5.2, we saw many system architectures that illustrate how we might organize computation to enhance this kind of structure.
36.3.2
Shared Instructions (SIMD Processors)
Another structure common to applications is SIMD datapaths (see Single program, multiple data subsection of Section 5.2.4)—that is, it is common for us to identify sequences of bit-level operations that are the same across a number of data bits. The most common case is word-wide operations, such as multibit adds or bitwise logical operations (e.g., OR, AND, XOR). At a higher level, we would perform a number of identical word-wide operations on different data (e.g., performing a component-wise multiplication on the elements of two arrays as part of a dot product). Here we perform the same operation across many bitops. Rather than providing a unique instruction for each bitop, we can arrange to share a single instruction across a large number of bit operators, amortizing the instruction distribution or storage expense. In the extreme, we would distribute a single instruction to all the bitops in the array. This is the opposite of the simplification used in the FPGA. Here, all bitops in the array must perform the same operation on a given cycle, but this operation may change from cycle to cycle. We can view conventional, word-wide processors as exploiting this idea. A processor instruction typically only tells the datapath to do one homogeneous thing—that is, the processor instruction asks every bit in the arithmetic logic unit (ALU) bit slice to perform the same computation (e.g., perform a full adder bit, perform an XOR, perform a shift). For example, a 32-bit processor datapath could perform many more operations if each individual bit slice of the ALU could operate independently; instead, ALUs are constrained to operate in SIMD fashion to keep the cycle-by-cycle instruction size small. In the general computational array model, we saw that the instruction memory took up the same area as the computation when we stored only 20 instructions in the array (equation 36.16). If we instead share each instruction across
816
Chapter 36
I
Theoretical Underpinnings
Wsimd = 32 bitops to form a SIMD datapath, it takes 625 instructions for the instruction memory to reach parity with the computation—that is: Ninstrs × Apinst (36.17) Abitop w imem (Wsimd , Ninstrs ) = Abop + Wsimd Abitop w imem (Ninstrs , 32) = 250, 000 F2 + Ninstrs × 400 F2
(36.18)
From these illustrations, we can see how the more familiar FPGA and processor architectures fall out as simplifications of the general computational array model that exploits different kinds of structures that exist in typical computations.
36.4
MODELING ARCHITECTURAL SPACE The demonstrations in Sections 36.2 and 36.3 highlight the fact that choices about instruction architecture can have a first-order impact on the area, and hence density, of programmable computing components. We can take this a step farther and build models of the density, and ultimately relative efficiency, of architectural design points. Table 36.1 summarizes where some familiar architectures fall in the (Wsimd , Nisntr ) architectural space. Nonetheless, remember that we are using a deliberately simple model and that many other effects and issues are associated with each architecture, some of which are mentioned in Section 36.4.3.
36.4.1
Raw Density from Architecture
Using equation 36.17, we can plot the relative densities of each bit operator as a function of the local instruction memory, Ninstr , and the SIMD instruction width, Wsimd . Figure 36.5 shows plots of the computational density for the instruction memory from 1 to 16,384 and the instruction width from 1 to 1024. Here, note that peak densities vary over three orders of magnitude. As we increase instruction depth (Ninstr ), we shift area into instructions rather than compute, often significantly reducing computational density. Wide-word architectures can reduce the memory costs at a particular instruction depth, but there also may be significant computational density reductions as instruction depth grows. TABLE 36.1
I
Placement of sample architectures in (Wsimd , Ninstr ) space
Architecture
Wsimd
Ninstr
Reference
FPGA
1
1
GARP fabric
2
4
KiloCore256 MIPS-X IA-64 (Montecito) Cell SPU
Chapter 2, Section 2.1.1
8
16
32
512
Chapter 2, Section 2.1.2
64
200,000
[5]
128
65,536
[6]
[4]
36.4 Modeling Architectural Space
817
Relative density 1 0.1 0.01 0.001 1024 1
4
16
64 256 1024 4096 1 Ninstr 16384
4
256 64 16 Wsimd
FIGURE 36.5 I Relative peak computational density from the model (normalized to the density of Ninstr = 1, Wsimd = 1024 design points).
36.4.2
Efficiency
The previous section showed peak raw densities achievable at various architectural points. If peak raw density was all that mattered, we would build SIMD designs with shallow instruction memories, as Figure 36.5 illustrates. However, it is seldom the case today that we can keep the millions of SIMD bit-processing elements we might be able to put on a die performing useful computations. When we cannot match the structure assumed by the architecture, the yield is only a fraction of the potential density—that is, another architecture, perhaps one with lower peak density, often can deliver more net density to the application. In particular, the architectural point whose structure assumptions exactly match the application will deliver the highest net density on that application. This leads to an interesting set of questions: I
I
How does the efficiency of an architecture fall off as it becomes mismatched to the structure of the application? How does the net density compare between various matched and mismatched architectures?
Since there is a model for the area of architectural design points in the (Wsimd , Ninstr ) design space (equation 36.17), we can use that to measure efficiency. In particular, it is possible to measure the efficiency of an architecture design point (Arch (W simd , Ninstr )) processing applications with a particular structure (App Wapp , Lpath ) as the ratio of the area of the architecture that exactly matches the application structure to the area of the point being evaluated: Efficiency Arch (Wsimd , Ninstr ), App Wapp, Lpath Area Arch Wapp, Lpath , App Wapp, Lpath = Area Arch (Wsimd , Ninstr ), App Wapp, Lpath
(36.19)
818
Chapter 36
I
Theoretical Underpinnings
TABLE 36.2
I
Sample applications in the (Wapp , Lpath ) space Wapp
Lcritpath
Conway’s Game of “Life”
1
1
1
Error correcting codes
1
1
1–10,000
At memory interface, need one per cycle; on audio-rate, realtime data can be low throughput
Entropy coding
1
1–10
1–10,000
(similar to previous)
Video processing of pixel data
8
1–6
12
CD audio
16
1–10
10,000
SPIHT image compression
16
10
10+
Chapter 27
FDTD
35
1–5
1–5
Chapter 32
Application
Lpath
Comments Bit-level CA [7]
1024×1024 at 30 frames per second on a 500 MHz cycle can afford approximately 12 cycles per pixel 44 kHz real-time vs. 500 MHz cycle
To characterize the structure of the architecture separately from the structure of the application, equation 36.19 keeps Wsimd and Ninstr as parameters characterizing the architecture and adds the dual parameters Wapp and Lpath to characterize the application structure. Wapp is simply the natural SIMD datapath width of the application, while Lpath is the path length of the application (see the Mismatch in Ninstr subsection). For illustrative purposes, Table 36.2 summarizes where several applications appear in the (Wapp , Lpath ) space. The area of the mismatched design is always larger, so the efficiency metric in equation 36.19 effectively tells us how much lower the mismatched point’s net density is than the matched point’s net density. To develop the intuition and keep the explanation simple, we stay with the assumption that applications have homogeneous structure (i.e., singlecharacteristic Wapp and Lpath ). One of the reasons we are interested in how well an architecture deals with different, mismatched structures is that a real application will typically contain heterogeneity in the structure it exhibits. Mismatch in Wsimd What happens when the application width Wapp is mismatched to the architectural width Wsimd ? I
Wsimd > Wapp : Here we do not have as fine-grained control of the bit operators as the application requires. Consequently, bitops go unused. In particular, we will actually need a larger array so that we match the
36.4 Modeling Architectural Space
819
instruction control needs of the application. For example, if Wapp = 5 and Wsimd = 8, then three bitops in every architectural SIMD datapath will go idle. To satisfy the application requirements, we end up needing Wsimd 8 = = 1.6 times as many physical bitops as the application actually Wapp 5 requires. I
Wsimd < Wapp : There are two effects that can work to make implementations in this architecture larger than the optimally matched architecture: 1. We have finer-grained control, but may still need more physical bit operators because of granularity problems. For example, when Wapp = 2 groups of Wsimd Wapp = 8 and Wsimd = 5, we need Wsimd
8 × Warch = 10 bitops, bitops to cover each application group, or 5 of which only Wapp = 8 are doing useful work. 2. Since we have more control than necessary for the application, the area of each bitop is larger than necessary in order to accommodate additional instruction memory; this extra instruction memory holds redundant information. Continuing the Wapp = 8 and Wsimd = 5 Wapp 8 example, each bit operator effectively pays for = = 1.6 times Wsimd 5 as many instructions as necessary for the application.
Assuming that instruction storage depth is matched to application path length (Ninstr = Lpath ) to focus on the width mismatch, we can show this in an area model as: Area Arch Wsimd , Lpath , App Wapp , Lpath = =
Wsimd Wapp Wsimd Wapp
×
×
Wapp × Abitop w imem Wsimd , Lpath Wsimd
(36.20)
Lpath Wapp × Abop + × Apinst Wsimd Wsimd
This allows us to compute the efficiency of the mismatched SIMD datapath width at a matched Lpath as: Efficiency [Lpath ] Wsimd , Wapp
=
Wsimd Wapp
×
Abop + Wapp Wsimd
× A pinst Wapp L × Abop + Wpath × Apinst
L
path
simd
(36.21)
Chapter 36
820
I
Theoretical Underpinnings
Figure 36.6 shows plots of the efficiency from equation 36.21 versus Wapp for a collection of Wsimd ’s and Lpath ’s. Perhaps more significant than the large density range shown in Figure 36.5, we see that SIMD width mismatches can cost us orders of magnitude in net density delivered to an application. Interestingly, we see some SIMD width selections that do not show orders of magnitude efficiency losses (e.g., Wsimd = 1 for Lpath = 1, Wsimd = 3 for Lpath = 64, Wsimd = 32 for Lpath = 640). These robust points occur when the instruction area is equal to the compute and interconnect area. That is: Lpath × Apinst (36.22) Abop = Wsimd In these cases, half the area is allocated to storing instructions and half to compute. For illustration, consider the Lpath = 640 and Wsimd = 32 case. Here, if we are processing Wapp = 1 data, then we use only one-thirty-second of the compute 1
0.1
0.1
Efficiency
Efficiency
1
0.01
0.01
0.001
0.001 1
4
16
64
256
1
1024
4
16
Wapp
(a)
64 Wapp
256
(b)
Efficiency
1
0.1
0.01
0.001 1
4
16
64
256
1024
Wsimd 5 1 Wsimd 5 3 Wsimd 5 32 Wsimd 5 64 Wsimd 5 1024
Wapp
(c)
FIGURE 36.6 I Efficiency as a function of Wapp for various Lpath values: (a) Lpath = 1, (b) Lpath = 640, and (c) Lpath = 64.
1024
36.4 Modeling Architectural Space
821
area. However, we are able to use all the memory area; a matched architecture can, at most, be half the size of this design point since it still requires the 640 instructions, even if they drive a smaller datapath. At the opposite extreme, if Wapp = 16, 384, we can use all the compute operators but we underutilize the instructions. Here, a matched architecture could have used a factor of 512 lower instruction area; however, since half the area is in compute, the matched architecture is, at best, only half the size of this robust point. It should be clear that this observation holds for any choice of Wapp when the area is allocated evenly between compute and instruction memory. In contrast, if we make Wsimd = 1 for this Lpath = 640 case, then 97 percent of the area goes into memory; if this Wsimd = 1 architecture now has a task with Wapp = 16, 384, it is much larger (at least 33 times larger) than a design with matched width, which can put significantly less area into instruction memory. If we can design to a single application width, or a small range of widths, it is best to select a matched width, or the width that provides the highest average efficiency over the range. However, if we don’t have tight bounds on the application width, these robust points show how we can select organizations that remain fairly efficient for any application width. Mismatch in Ninstr A similar phenomenon occurs when Ninstr does not match the structure of the application. First, we need to understand Lpath —the application demand for Ninstr . In particular, let us consider an inner loop in a kernel or the computation required for each invocation of a transform operator (see Transform or object subsection of Section 5.1.2). To compute each inner loop iteration, or each operator invocation, we need to evaluate a set of Nops bitops. In general, there may be a set of cyclic sequential dependencies, or a critical path, of depth Lcritpath among the bitops in the computation that prevent us from starting the next iteration of the loop or invocation of the operator until the Lcritpath array cycles have completed. For example, consider the loop body of a saturated accumulation: y[i] = max (min (x[i] + y[i − 1], 255) , 0) Before performing the next addition to compute y[i + 1] from y[i], we must complete the computation of y[i], including both the addition and the selection of maximum or minimum bound limits (see Figure 36.7).1 Assume the following: I I I I
The addition requires a path length of six sequential bitops. The comparisons can be performed in parallel. Each comparison requires a path of three sequential bitops. The final selection requires a single bitop.
The critical path Lcritpath is 10 for this computation. With a path length of Lcritpath , we can schedule the Nops required to evaluate the application into Lcritpath cycles 1
With care, this actually can be avoided using sophisticated transformations [8].
822
Chapter 36
I
Theoretical Underpinnings
x [i ]
Register ⫹ 0
⬎
⬎
255
y [i ]
FIGURE 36.7
I
Saturated accumulator cyclic dependency.
on the array without slowing down the application, the sequentially dependent paths guarantee that it will always take at least Lcritpath cycles to perform the operation. The application may not actually demand that the computation be performed every Lcritpath cycle. Perhaps the data throughput is lower and new samples, x[i], are arriving every 20 ns while the array cycle time is 1 ns. Here, evaluating with Lcritpath = 10 leaves the array sitting idle for 10 cycles before the next input sample is available to compute. Consequently, it would be possible to schedule to Lpath = 20 > Lcritpath and cut the number of bitops needed by at least a factor of 2. In this way, the loop or transform body is efficiently implemented by scheduling the computations onto a minimum number of bitops in a period of Lpath cycles, with each operator potentially getting a unique instruction on each cycle Ninstr = Lpath . For examples, see Table 36.2, which summarizes the throughput Lpath required in a few applications. Now consider the two mismatched cases: I
I
Ninstr > Lpath : In this case, by scheduling the computation into Lpath cycles, Ninstr − Lpath instruction memory slots in each bitop go unused. The matched architecture is smaller because it does not spend area on these unused instruction memories. In the aforementioned saturated accumulation, if Lpath = 20 and an array with Ninstr = 100 is used, then 80 instruction slots go unused. Ninstr < Lpath : In this case, we cannot necessarily reuse each bit operator in Lpath in different ways on each of the Lpath cycles. Since we can only use each operator in Ninstr ways, to solve the entire problem we may need
Lpath times as many bitops to perform the computation. a total of Ninstr Continuing with the example, if Ninstr = 5 and there is an Lpath = 20, we may need four times as many bitops as the optimally matched architecture. The total amount of memory is the same between these cases; however, an Ninstr = 5 architecture pays for four times as many
36.4 Modeling Architectural Space
823
compute blocks (Abop ). There is also a granularity effect here; for example, we still need four times as many bitops even when Ninstr = 6. Assuming that the datapath width is matched (Wsimd = Wapp ), allows us to focus on the instruction mismatch; we can show this in an area model as: Area Arch Wapp , Ninstr , App Wapp , Lpath
Lpath × Abitop w imem Wapp , Ninstr = Ninstr (36.23)
Lpath Ninstr = × Abop + × Apinst Ninstr Wapp This allows us to compute the efficiency of the mismatched instruction store at a matched Wapp as: Efficiency [Wapp ] Ninstr , Lpath Lpath × Apinst Abop + Wapp = Lpath Ninstr × Abop + × Apinst Ninstr Wapp
(36.24)
Figure 36.8 plots the efficiency from equation 36.24 versus Lpath for a collection of Ninstrs ’s and Wapps ’s. Again, note that instruction store mismatches can cost orders of magnitude in net density. We also see robust points here where the net density remains within 50 percent of the matched architecture. The effect is the same as for datapath width mismatch (see previous section), and the efficient points are governed by an analogous equation: Abop =
Ninstr Wapp
× Apinst
(36.25)
For any of these robust points, at the minimum value, Lpath = 1, we are using all the compute area and only a fraction of the instruction memory area, so an optimally matched architecture could, at best, be implemented in half the area. Similarly, for arbitrarily large Lpath , if Ninstr < Lpath , all the instruction memory area is used to hold instructions, but this may leave the compute area idle most of the time. Here, again, with only 50 percent of the area in compute, the design is, at most, twice the size of an optimally matched architecture with less area allocated to computation. In contrast, if we put 90 percent of the area into compute, then we could end up wasting 90 percent of the area in scenarios where Lpath >> Ninstr ; matched architectures can be an order of magnitude smaller in such cases. Similarly, if 90 percent of the area is put into instruction memory, we can end up wasting almost 90 percent of the area when Lpath is small.
Chapter 36
824
I
Theoretical Underpinnings 1
0.1
0.1 Efficiency
Efficiency
1
0.01
0.001
0.01
1
4
16
64 256 Lpath
1024
0.001 1
4096
4
16
(a)
64 256 Lpath
1024
4096
(b)
Efficiency
1
0.1
0.01
0.001
1
4
16
64 256 Lpath
1024
Ninstr 5 1 Ninstr 5 20 Ninstr 5 160 Ninstr 5 1280 Ninstr 5 10240
4096
(c)
FIGURE 36.8 I Efficiency as a function of Lpath for various Wapp values: (a) Wapp = 1, (b) Wapp = 64, and (c) Wapp = 8.
Composite effects Combining the effects of SIMD width mismatch and local instruction storage mismatch, we get the total efficiency: Efficiency Arch (Wsimd , Ninstr ), App Wapp , Lpath (36.26) L path Abop + Wapp × Apinst L = Wapp Wsimd Ninstr path × A + × × × A pinst bop Wapp W N W simd
instr
simd
Unfortunately, if both the SIMD width and the local instruction storage mismatch, it is not possible to pick a robust point as we did in previous sections. Returning to equations 36.22 and 36.25, we note that the robust points occur Ninstr × Apinst , and the comwhen we can match the instruction storage area, W simd
putation and interconnect area, Abop . However, when both Wapp and Lpath vary, even when the area is matched, we can have cases where the allocation of width
36.4 Modeling Architectural Space
825
versus storage size within that area can prevent us from using the computational units efficiently. Efficiency of processors and FPGAs The previous section suggests that we will not find an architectural point in this (Wsimd , Ninstr ) design space that is efficient across a wide range of application structures. To understand where processors and FPGAs are efficient, we can use the composite efficiency relation (equation 36.26) and estimate how efficient they each can be across a portion of the design space (see Figure 36.9). Here the FPGA is naturally modeled with Ninstr = 1 and Wsimd = 1. We model a processor as Wsimd = 64 and Ninstr = 16,384. Figure 36.9 shows starkly that the FPGA and processor are both designed for different points in the application space. Notice that each can be less than 1 percent efficient in some portions of the space. Further, we note that in the places where the processor is very inefficient ( < 1 percent), the FPGA is highly efficient; the reverse is true as well. This effect, coupled with the heterogeneous nature of applications, explains why it is often useful to have reconfigurable systems that mix FPGA or reconfigurable fabrics along with processors (e.g., Instruction augmentation subsection of Section 5.2.2 and Chapter 26).
36.4.3
Caveats
As noted in the introduction to this chapter, we are deliberately using a simple model to illustrate key effects in instruction organization. There are many other application structural opportunities and architectural variables that can also have a large effect on resource balance and efficiency, including interconnect richness (e.g., [9]) and organization, data storage and memory hierarchy capacities, bandwidth and latencies, threads of control, dynamic instruction selection, and integration of hardware functional units (e.g., multipliers [10,11]
FPGA efficiency Processor efficiency Efficiency
Efficiency
1 0.1 0.01 0.001 0.0001
1 0.1 0.01 0.001 0.0001 1
FIGURE 36.9
1024 256
4
I
16 16 64 4 2561024 4096163841 Lpath
64 Wapp
1
4
16 64 4 2561024 4096163841 Lpath
Wsimd 5 1, Lpath 5 1
Wsimd 5 64, Lpath 5 16,384
(a)
(b)
1024 256 64 16 W app
Efficiency of FPGA-like (a) and processor-like (b) designs across both Lpath and Wapp .
826
Chapter 36
I
Theoretical Underpinnings
and floating-point units [12]). In processors, the SIMD control of ALUs is coupled with fast logic to support carries in arithmetic (e.g., [13]), which serves to reduce Lcritpath ; FPGAs also employ fast cascade structures for similar reasons (e.g., [14], Chapter 1) but do not tie them to SIMD datapaths. Nonetheless, the simple model shows that these instruction organization decisions can have a significant impact on computational density, and it illustrates why FPGAs can be more efficient than processors for important classes of applications.
36.5
IMPLICATIONS 36.5.1
Density of Computation versus Description
From this model, we can clearly see a trade-off between computational density and instruction density. Equation 36.16 illustrates that the instruction store area for a single bitop can be an order of magnitude smaller than the computation to support it. This means an Ninstr = 1 design stores instructions an order of magnitude less densely than an Ninstr = 200 design, and an Ninstr = 200 design packs computation an order of magnitude less densely than an Ninstr = 1 design. When the goal is to simply pack a large, irregular computation into a small area, we are best off focusing on instruction density; this minimizes the area for the implementation, at the expense of lower performance. When the goal is to perform the computation at high throughput, designs with high computational density allow us to meet the throughput with the least area.
36.5.2
Historical Appropriateness
When we first started building programmable integrated circuits, the premium for describing large computations was high. The capacity on a single integrated circuit was very low when they were built with F = 3 μm technology. In the mid-1980s, with Ninstr = 1 and Wsimd = 1, we could put only 64 bitops on a die [15], limiting computations to those that could be described by 64 instructions. At roughly the same time, one could put Ninstr = 512 instructions on the die along with 32 bitops controlled in an SIMD fashion by a single pinst on each cycle (Wsimd = 32) [4]. The struggle at this point in history was to fit an entire computational kernel onto a single die, and the deep instruction, word-wide processor design could begin to fit interesting kernels while the FPGA designs could fit only the most trivial computations. By 2005, however, with F ≤ 0.1 μm, the landscape had changed. Moore’s Law process scaling has given us more than a 10,000-fold increase in capacity per integrated circuit. Modern processors, still built with ever-deeper memories, have large enough instruction stores to contain large applications. At the same time, FPGAs hold hundreds of thousands of active bitops. Even kernels with thousands of 64-bit-wide operations can fit spatially on the FPGA and exploit the higher computational density.
36.5 Implications
827
The question with today’s silicon is less “Can we get the application to fit on the die?” and more “How do we turn the available die area into performance?” Consequently, as we continue to scale feature sizes, the fraction of tasks where high instruction density remains the premium is shrinking, while the fraction where the application fits on the die and high computational density offers a benefit is increasing.
36.5.3
Reconfigurable Applications
Understanding why FPGAs can be efficient and where they are most efficient (e.g., Figure 36.9) provides additional insight into where we should use FPGAs and how to fully exploit their strengths. Certainly, if the task has low throughput requirements (i.e., large Lpath ), then FPGAs are often not an efficient implementation. The FPGA is efficient when we operate at minimum path length, preferably Lpath = 1, where we are performing the same operation over and over and keeping all the bitops active during the operation. For FPGAs with a variable clock cycle, we want to keep the cycle time to the minimum, maximizing the reuse rate of each operation. This underscores why retiming operations such as pipelining and C-slow (see Chapter 18) are important for optimizing FPGA efficiency, as well as behavioral transformations that reduce Lcritpath . When Lpath is large simply because of a low throughput demand, we can often turn the SIMD structure, Wapp , into additional operation regularity. In particular, when Wapp > 1, that is an indication that a number of bit-level operators do perform the same operation. By moving this regularity into time rather than space, we can reduce the number of unique instruction combinations needed and hence reduce the Ninstr required. For example, if Wapp = 16 and Lpath >> Lcritpath , we can implement the SIMD datapath bit serially so that the necessary instruction storage depth is a factor of 16 smaller L
). As shown in Figure 36.10, this can increase the FPGA’s domain of (Ninstr ≈ Wpath app efficiency. FPGA efficiency Efficiency 1 0.1 0.01 0.001 0.0001 1
FIGURE 36.10 regularity.
I
4 16 64 256 10244096 1 Lpath 16384
4
1024 256 64 16 Lapp
FPGA efficiency when datapath regularity can be used to increase temporal
828
Chapter 36
I
Theoretical Underpinnings
References [1] D. S. Hochbaum, ed. Approximation Algorithms for NP-Hard Problems, PWS Publishing, 1997. [2] International technology roadmap for semiconductors. http://www.itrs.net/Links/ 2005ITRS/Home2005.htm, 2005. [3] A. DeHon. Entropy, counting, and programmable interconnect. Proceedings of the International Symposium on Field-Programmable Gate Arrays, ACM/SIGDA, 1996. [4] M. Horowitz, J. Hennessy, P. Chow, G. Gulak, J. Acken, A. Agarwal, C.-Y. Chu, S. McFarling, S. Przybylski, S. Richardson, A. Salz, R. Simoni, D. Stark, P. Steenkiste, S. Tjiang, M. Wing. A 32b microprocessor with on-chip 2 Kbyte instruction cache. IEEE International Solid-State Circuits Conference, Digest of Technical Papers, IEEE, 1987. [5] C. McNairy, R. Bhatia. Montecito: A dual-core, dual-thread Titanium processor. IEEE Micro 25(2), 2005. [6] M. Gschwind, H. P. Hofstee, B. Flachs, M. Hopkins, Y. Watanabe, T. Yamazaki. Synergistic processing in cells multicore architecture. IEEE Micro 26(2), 2006. [7] M. Gardner. The fantastic combinations of John Conway’s new solitaire game “Life.” Scientific American 223, 1970. [8] K. Papadantonakis, N. Kapre, S. Chan, A. DeHon. Pipelining saturated accumulation. Proceedings of the International Conference on Field-Programmable Technology, 2005. [9] A. DeHon. Balancing interconnect and computation in a reconfigurable computing array (or, why you don’t really want 100% LUT utilization). Proceedings of the International Symposium on Field-Programmable Gate Arrays, 1999. [10] A. DeHon. The density advantage of configurable computing. IEEE Computer 33(4), 2000. [11] I. Kuon, J. Rose. Measuring the gap between FPGAs and ASICs. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 26(2), 2007. [12] M. J. Beauchamp, S. Hauck, K. D. Underwood, K. S. Hemmert. Embedded floating-point units in FPGAs. Proceedings of the International Symposium on FieldProgrammable Gate Arrays, 2006. [13] R. P. Brent, H. T. Kung. A regular layout for parallel adders. IEEE Transactions on Computers 31(3), 1982. [14] S. Hauck, M. M. Hosler, T. W. Fry. High-performance carry chains for FPGAs. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 8(2), 2000. [15] W. S. Carter, K. Duong, R. H. Freeman, H.-C. Hsieh, J. Y. Ja, J. E. Mahoney, L. T. Ngo, S. L. Sze. A user programmable reconfigurable logic array. Proceedings of the IEEE Custom Integrated Circuits Conference, 1986.
CHAPTER
37
DEFECT AND FAULT TOLERANCE Andre´ DeHon Department of Electrical and Systems Engineering University of Pennsylvania
As device size F continues to shrink, it approaches the scale of individual atoms and molecules. In 2007, 65-nm integrated circuits are in volume production for processors and field-programmable gate arrays (FPGAs). With atom spacing in a silicon lattice around 0.5 nm, F = 65-nm drawn features are a little more than 100 atoms wide. Key features, such as gate lengths, are effectively half or a third this size. Continued geometric scaling (e.g., reducing the feature size by a factor of 2 every six years) will take us to the realm where feature sizes are measured in single-digit atoms sometime in the next couple of decades. Very small feature sizes will have several effects on integrated circuits, including: I
I
I
I
I
Increased defect rates: Smaller devices and wires made of fewer atoms and bonds are less likely to be “good enough” to function properly. Increased device variation: When dimensions are a few atoms wide, the addition, absence, or exact position of each atom has a significant affect on device parameters. Increased change in device parameters during operational lifetime: With only a few atoms making up the width of wires or devices, small changes have large impacts on performance, and the likelihood of a complete failure grows. The fragility of small devices reduces traditional opportunities to overstress them as a means of forcing weak devices to fail before the component is integrated into an end system. This means many weak devices will only turn into defects during operation. Increased single die capacity: Smaller devices allow integration of more devices per die. Thus, not only do we have devices that are more likely to fail, but there also are more of them, meaning more chances that some device on the die will fail. Increased susceptibility to transient upsets: Smaller nodes use less charge to hold state or configuration data, making them more susceptible to upset by noise, including ionizing particles, thermal noise, and shot noise. Coupled with the greater capacity, which means more nodes that can be upset, dies will have significantly increased upset rates.
Accommodating and exploiting these effects will demand an increasing role for postfabrication configurable architectures. Nonetheless, some usage paradigms Copyright © 2008 by Andr´e DeHon. Published by Elsevier Inc.
830
Chapter 37
I
Defect and Fault Tolerance
will need to shift to fully exploit the potential benefits of reconfigurable architectures at the atomic scale. This chapter reviews defect tolerance approaches and points out how the configurability available in reconfigurable architectures is a key tool for coping with defects. It also touches briefly on lifetime and transient faults and their impact on configurable designs.
37.1
DEFECTS AND FAULTS A defect is a persistent error in a component. Because defects are persistent, we can test for defect occurrences and record their locations. We contrast defects with transient faults that may produce the wrong value on one or a few cycles but do not continue to corrupt calculations. For the sake of simple discussion here, we classify any persistent problem that causes the circuitry to work incorrectly for some inputs and environments as defects. Defects are often modeled as stuck-at-1, stuck-at-0, or shorted nodes. They can also be nodes that are excessively slow, such that they compute correctly but not in a timely fashion, or excessively leaky, such that they do not hold their value properly. A large number of physical effects and causes may lead to these manifestations, including broken wires, shorts or bridging between nodes that should be distinct, excessive or inadequate doping in a device, poor contacts between materials or features, or excessive variation in device size. A transient fault is a temporary error in a circuit result. Transient faults can occur at random times. A transient fault may cause a gate output or node to take on the incorrect value on some cycle of operation. Examples of transient faults include ionizing particles (e.g., α-particles), thermal noise, and shot noise.
37.2
DEFECT TOLERANCE 37.2.1
Basic Idea
An FPGA or reconfigurable array is a set of identical (programmable) bitprocessing operators with postfabrication configurable interconnect. When a device failure renders a bitop or an interconnect segment unusable, we can configure the computation to avoid the failing bitop or segment (see Figure 37.1). If the bitop is part of a larger SIMD word (Chapter 36, Section 36.3.2) or other structure that does not allow its independent use, we may be forced to avoid the entire structure. In any case, as long as all the resources on the reconfigurable array are not being used, we can substitute good resources for the bad ones. As defect rates increase, this suggests a need to strategically reserve spare resources on the die so that we can guarantee there are enough good resources to compensate for the unusable elements.
37.2 Defect Tolerance
831
C
A B (a)
Spare track
A
B
C (b)
Spare logic block
Short to ground
A
B
C Defect (c)
FIGURE 37.1 I Configuring computation to avoid defective elements in a reconfigurable array: (a) logical computation graph, (b) mapping to a defect-free array with spare, and (c) mapping to an array with defects.
This basic strategy of (1) provisioning spare resources, (2) identifying and avoiding bad resources, and (3) substituting spare resources for bad resources is well developed for data storage. DRAM and SRAM dies include spare rows and columns and substitute the spare rows and/or columns for defective rows and columns (e.g., see [1, 2]). Magnetic data storage (e.g., hard disk) routinely has bad sectors; the operating system (OS) maps the bad sectors and takes care not to allocate data to those sectors. These two forms of storage actually illustrate two models for dealing with defects: 1. Perfect component: In the perfect component model, the component has to look perfect; that is, we require every address visible to the user to perform correctly. The spare resources are added beyond those required to deliver the promised memory capacity and are substituted out behind the scenes so that users never see that there are defective elements in the component. DRAM and SRAM components are the traditional example of the perfect component model.
832
Chapter 37
I
Defect and Fault Tolerance
2. Defect map: The defect map model allows elements to be bad. We expose these defects to higher levels of software, typically the OS, which is responsible for tracking where the defects occur and avoiding them. Magnetic disks are a familiar example of the defect map model—we permit sectors to be bad and format the disk to avoid them.
37.2.2
Substitutable Resources
Some defects will be catastrophic for the entire component. While a reconfigurable array is composed largely of repeated copies of identical instances, the device infrastructure is typically unique; defects in this infrastructure may not be repairable by substitution. Common infrastructures include power and ground distribution, clocking, and configuration loading or instruction distribution. It is useful to separate the resources in the component into nonrepairable and repairable resources. Then we can quantify the fraction of resources that are nonrepairable. We can minimize the impact of nonrepairable resources either by reducing the fraction of things that cannot be repaired or by increasing the reliability of the constituent devices in the nonrepairable structures. Many of the infrastructure items, such as power and ground networks, are built with larger devices, wires, and feature sizes. As such, they are less susceptible to the failures that impact small features. Memory components (e.g., DRAMs) also have distinct repairable and nonrepairable components; they typically use coarser feature sizes for the nonrepairable infrastructure. Memory designs only use the smallest features for the dense memory array, where row and column sparing can be used to repair defects. In FPGAs, it may be reasonable to provide spares for some of the traditional infrastructure items to reduce the size of the nonrepairable region. For example, modern FPGAs already include multiple clock generators and configurable clock trees; as such, it becomes feasible to repair defective clock generators or portions of the clock tree by substitution. We simply need to guarantee that there are sufficient alternative resources to use instead of the defective elements. For any design there will be a minimum substitutable unit that defines the granularity of substitution. For example, in a memory array we cannot substitute out individual RAM cells. Rather, with a technique like row sparing, the substitutable unit is an entire row. In the simplest sparing schemes, a defect anywhere within a substitutable unit may force the discard of the entire element. Consequently, the granularity of substitution can play a big role in the viable yield of a component (see the Perfect yield subsection that follows). Section 37.2.5 examines more sophisticated sparing schemes that relax this constraint.
37.2.3
Yield
This section reviews simple calculations for the yield of components and substitutable units. We assume uniform device defect rates and independent, random
37.2 Defect Tolerance
833
failure (i.e., identical, independently distributed—iid). Using these simple models, we can illustrate the kinds of calculations involved and build intuition on the major trends. Perfect yield A component with no substitutable units will be nondefective only if all the devices in the unit are not defective. Similarly, in the simplest models each substitutable unit is nondefective only when all of its constituent devices are not defective. If we have a device defect probability Pd and if a unit contains N devices, the probability that the entire component or unit is nondefective is: Pdefect−free (N, Pd ) = (1 − Pd )N
(37.1)
We can expand this as a binomial: N N i Pdefect−free (N, Pd ) = ∑ (Pd )2 − . . . (−Pd ) = 1 − N · Pd + 2 i i
(37.2)
If N × Pd 109 devices, the defect rate Pd must be below 10−10 to expect 90 percent or greater chip yield. To maintain constant yield (Pdefect−free ) for a chip as N scales, we must continually decrease Pd at the same rate. For example, a 10× increase in device count, N, must be accompanied by a 10× decrease in per-device defect rate. As noted in this chapter’s introduction, we expect the opposite effect for atomic-scale devices; smaller devices mean a higher likelihood of defects. This exacerbates the challenge of increasing device counts. At the same defect rate, Pd , a finer-grained substitutable unit (e.g., an individual LUT or bitop) will have a higher unit yield rate than a coarser-grained unit (e.g., a cluster of 10 LUTs, such as an Altera LAB (Section 1.5.1) or an SIMD collection of 32 bitops). Alternatively, if one reasons about defect rates of the substitutable units, a defect rate of Psd = 0.05 for a coarse-grained block corresponds to a much lower device defect rate, Pd , than the same Psd for a fine-grained substitutable unit.
834
Chapter 37 I
I
Defect and Fault Tolerance
To keep substitutable unit yield rates at some high value, we must decrease unit size, N, as Pd increases. For example, if we design for a Psd = 10−4 and the device defect rate doubles, we need to cut the substitutable block size in half to achieve the same block yield; this suggests a trend toward fine-grained resource sparing as defect rates increase (e.g., see Fine-grained Pterm matching subsection of Section 37.2.5 and Section 38.6).
Yield with sparing We can significantly increase overall yield by providing spares so that there is no need to demand that every substitutable unit be nondefective. Assume for now that all substitutable units are interchangeable. The probability that we will have exactly i nondefective substitutable units is: N i N−i (37.5) Pyield (N, i) = (Psd ) (1 − Psd ) i That is, there are Ni ways to select i nondefective blocks from N total blocks, and the yield probability of each case is (Psd )i (1 − Por )N−i . An ensemble with at least M items is obtained whenever M or more items yield, so the ensemble yield is actually the cumulative distribution function, as follows: N (37.6) Pyield (N, M) = ∑ (Psd )i (1 − Psd )N−i i M≤i≤N As an example, consider an Island-style FPGA cluster (see Figure 37.2) composed of 10 LUTs (e.g., Altera LAB, Chapter 1). Assume that each LUT, along with its associated interconnect and configuration, is a substitutable unit and that the LUTs are interchangeable. Further, assume Psd = 10−4 . The probability of yielding all 10 LUTs is: 10 0 1 − 10−4 ≈ 0.9990005 Pyield (10, 10) = 10−4 (37.7) Now, if we add two spare lookup tables, the probability of yielding at least 10 LUTs is: 12 0 11 1 1 − 10−4 + 12 10−4 Pyield (12, 10) = 10−4 1 − 10−4 2 12 · 11 −4 10 10 1 − 10−4 (37.8) + 2 = 0.99880065978 + 0.0011986806598 + 0.0000006593402969 ≈ 0.9999999998 > 1 − 10−9 Without the spares, a component with only 1000 such clusters would be difficult to yield. With the spares, components with 1,000,000 such clusters yield more than 99.9 percent of the time.
37.2 Defect Tolerance Cluster inputs
835
Cluster outputs
Cluster
FIGURE 37.2
I
An island-style FPGA cluster with five interchangeable 2-LUTs.
The assumption that all substitutable units are interchangeable is not directly applicable to logic blocks in an FPGA since their location strongly impacts the interconnections available to other logic block positions. Nonetheless, the sparing yield is illustrative of the trends even when considering interconnect requirements. To minimize the required spares, it would be preferable to have fewer large pools of mostly interchangeable resources rather than many smaller pools of interchangeable resources. This results from Bernoulli’s Law of Large Numbers (the Central Limit Theorem) effects [3, 4], where the variance of a sum of random variables decreases as the number of variables increases. For a more detailed development of the impact of the Law of Large Numbers on defect yield statistics and strategies see DeHon [5].
37.2.4
Defect Tolerance through Sparing
To exploit substitution, we need to locate the defects and then avoid them. Both testing (see next subsection) and avoidance could require considerable time for each individual device. This section reviews several design approaches, including approaches that exploit full mapping (see the Global sparing subsection) to minimize defect tolerance overhead, approaches that avoid any extra mapping (see the Perfect component model subsection), and approaches that require only minimal, local component-specific mapping (see the Local sparing subsection). Testing Traditional acceptance testing for FPGAs (e.g., [6]) attempts to validate that the FPGA is defect free. Locating the position of any defect is generally not
836
Chapter 37
I
Defect and Fault Tolerance
important if any chip with defects is discarded. Identifying the location of all defects is more difficult and potentially more time consuming. Recent work on group testing [7–9] has demonstrated that it is possible to identify most of the nondefective√resources on a chip with N substitutable components in time proportional to N. In group testing, substitutable blocks are configured together and given a selftest computation to perform. If the group comes back with the correct result, this is evidence that everything in the group is good. Conversely, if the result is wrong, this is evidence that something in the group may be bad. By arranging multiple tests where substitutable blocks participate in different groups (e.g., one test set groups blocks around rows while another groups them along columns), it is possible to identify which substitutable units are causing the failures. For example, if there is only one failure in each of two groupings, and the failing groups in each grouping contain a single, common unit, this is strong evidence that the common unit is defective while the rest of the substitutable units are good. As the failure rates increase such that multiple elements in each group fail in a grouping, it can be more challenging to precisely identify failing components with a small number of groupings. As a result, some group testing is conservative, marking some good components as potential defects; this is a trade-off that may be worthwhile to keep testing time down to a manageably low level as defect rates increase. In both group testing and normal FPGA acceptance testing, array regularity and homogeneity make it possible to run tests in parallel for all substitutable units on the component. Consequently, testing time does not need to scale as the number of substitutable units, N. If the test infrastructure is reliable, group tests can run completely independently. However, if we rely on the configurable logic itself to manage tests and route results to the test manager, it may be necessary to validate portions of the array before continuing with later tests. In such cases, testing can be performed as a parallel wave from a core test manager, testing the entire two-dimensional device in time proportional to the square root of the number of substitutable units (e.g., [8]).
Global sparing A defect map approach coupled with component-specific mapping imposes low overhead for defect tolerance. Given a complete map of the defects, we perform a component-specific design mapping to avoid the defects. Defective substitutable units are marked as bad, and scheduling, placement, and routing are performed to avoid these resources. An annealing placer (Chapter 14) can mark the physical location of the defective units as invalid or expensive and penalize any attempts to assign computations to them. Similarly, a router (Chapter 17) can mark defective wires and switches as “in use” or very costly so that they are avoided. The Teramac custom-computing machine tolerated a 10 percent defect rate in logic cells (Psdlogic = 0.10) and a 3 percent defect rate in on-chip interconnect (Psdinterconnect = 0.03) using group testing and component-specific mapping [7].
37.2 Defect Tolerance
837
With place-and-route times sometimes running into hours or days, the component-specific mapping approach achieves low overhead for defect tolerance at the expense of longer mapping times. As introduced in Chapter 20, there are several techniques we could employ to reduce this mapping time, including: I
I I
Tuning architectures to facilitate faster mapping by overprovisioning resources and using simple architectures that admit simple mapping; the Plasma chip—an FPGA-like component, which was the basis of the Teramac architecture—takes this approach and was highlighted in Chapter 20. Trading mapping quality in order to reduce mapping time. Using hardware to accelerate placement and routing (also illustrated in Sections 9.4.2 and 9.4.3).
Perfect component model To avoid the cost of component-specific mapping, an alternate technique to use is the perfect component model (Section 37.2.1). Here, the goal is to use the defect map to preconfigure the allocation of spares so that the component looks to the user like a perfect component. Like row or column sparing in memory, entire rows or columns may be the substitutable units. Since reconfigurable arrays, unlike memories, have communication lines between blocks, row or column sparing is much more expensive to support than in memories. All interconnect lines must be longer, and consequently slower, to allow configuration to reach across defective rows or columns. The interconnect architecture must be designed such that this stretching across a defective row is possible, which can be difficult in interconnects with many short wires (see Figure 37.3). Extended segment in use bypassing defective row (column)
Row configuration
Row configuration
Spare row
Segment extension beyond defective row (column)
Spare column
Column configuration
FIGURE 37.3
I
Column configuration
Arrays designed to support row and column sparing.
838
Chapter 37
I
Defect and Fault Tolerance
A row of FPGA logic blocks is a much coarser substitutable unit than a memory row. FPGAs from Altera have used this kind of sparing to improve component yield [10, 11], including the Apex 20KE series. Local sparing With appropriate architecture or stylized design methodology, it is possible to avoid the need to fully remap the user design to accommodate the defect map. The idea here is to guarantee that it is possible to locally transform the design to avoid defects. For example, in cases where all the LUTs in a cluster are interchangeable, if we provision spares within each cluster as illustrated earlier in the Yield with sparing subsection of Section 37.2.3, it is simply a matter of locally reassigning the functions to LUTs to avoid the defective LUTs. For regular arrays, Lach et al. [12] show how to support local interchange at a higher level without demanding that the LUTs exist in a locally interchangeable cluster. Consider a k × k tile in the regular array. Reserve s spares within each k×k tile so that we only populate k2 − s LUTs in each such region. We can now compute placements for the k2 − s LUTs for each of the possible combinations of s defects. In the simplest case, s = 1, we precalculate k2 placements for each region (e.g., see Figure 37.4). Once we have a defect map, as long as each region has fewer than s errors, we simply assemble the entire configuration by selecting an appropriate configuration for each tile. When a routing channel provides full crossbar connectivity, similarly, it may be possible to locally swap interconnect assignments. However, typical FPGA routing architectures do not use fully populated switching; as a result, interconnect sparing is not a local change. Yu and Lemieux [13, 14] show that FPGA switchboxes can be augmented to allow local sparing at the expense of 10 to 50 percent of area overhead. The key idea is to add flexibility to each switchbox that allows a route to shift one (or more) wire track(s) up or down; this allows routes to be locally redirected around broken tracks or switches and then restored to their normal track (see Figure 37.5). To accommodate a particular defect rate and yield target, local interchange will require more spares than global mapping (see the Global sparing subsection). Consider any of the local strategies discussed in this section where we allocate one spare in each local interchange region (e.g., cluster, tile, or channel). If there are two defects in one such region, the component will not be repairable. However, the component may well have adequate spares; they are just assigned to different interchange regions. With the same number of resources, a global remapping would be able to accommodate the design. Consequently, to achieve the same yield rate as the global scheme, the local scheme always has to allocate more spares. This is another consequence of the Law of Large Numbers (see the Yield with sparing subsection): The more locally we try to contain replacement, the higher variance we must accommodate, and the larger overhead we pay to guarantee adequate yield.
37.2 Defect Tolerance
A
A
FIGURE 37.4
I
B
A
C
C
B
A
C
C
B
B
Four placements of a three-gate subgraph on a 2 × 2 tile.
Spare track
Track defect (a)
(b)
FIGURE 37.5 I Added switchbox flexibility allows local routing around interconnect defects: (a) defect free with spare and (b) configuration avoiding defective track.
839
840
Chapter 37
37.2.5
I
Defect and Fault Tolerance
Defect Tolerance with Matching
In the simple sparing case (Section 37.2.4), we test to see whether each substitutable unit is defect free. Substitutable units with defects are then avoided. This works well for low-defect rates such that Psd remains low. However, it can also be highly conservative. In particular, not all capabilities of the substitutable unit are always needed. A configuration of the substitutable unit that avoids the particular defect may still work correctly. Examples where we may not need to use all the devices inside a substitutable unit include the following: I
I
I
A typical FPGA logic block, logic element, or slice includes an optional flip-flop and carry-chain logic. Many of the logic blocks in the user’s design leave the flip-flop or carry chain unused. Consequently, these “defective” blocks may still be usable, just for a subset of the logical blocks in the user’s design. When the substitutable unit is a collection of Wsimd bitops, a defect in one of the bitops leaves the unit imperfect. However, the unit may work fine on smaller data. For example, maybe a Wsimd = 8 substitutable unit has a defect in bit position 5. If the application requires some computations on Wapp = 4 bit data elements, the defective 8-bit unit may still perform adequately to support 4 bitops. A product term (Pterm) in a programmable logic array (PLA) or programmable array logic (PAL) is typically a substitutable unit. Each Pterm can be configured to compute the AND of any of the inputs to the array (see Figure 37.6). However, all the Pterms configured in the array will never need to be connected to all the inputs. Consequently, defects that prevent a Pterm from connecting to a subset of the inputs may not inhibit it from being configured to implement some of the Pterms required to configure the user’s logic.
Instead of discarding substitutable units with defects, we characterize their capabilities. Then, for each logical configuration of the substitutable unit Inputs
Enabled crosspoint allows input to participate in Pterm
FIGURE 37.6
I
A PAL OR-term with a collection of substitutable Pterm inputs.
37.2 Defect Tolerance
841
demanded by the user’s application, we can identify the set of (potentially defective) substitutable units capable of supporting the required configuration. Our mapping then needs to ensure that assignments of logical configurations to physical substitutable units obey the compatibility requirements. Matching formulation To support the use of partially defective units as substitutable elements, we can formulate the mapping between logical configurations and substitutable units as a bipartite matching problem. For simplicity and exposition, it is assumed that all the substitutable units are interchangeable. This is likely to be an accurate assumption for LUTs in a cluster or Pterms in a PAL or PLA, but it is not an accurate assumption for clusters in a two-dimensional FPGA routing array. Nonetheless, this assumption allows precise formulation of the simplest version of the problem. We start by creating two sets of nodes. One set, R = {r0 , r1 , r2 . . . }, represents the physical substitutable resources. The second set, L = {l0 , l1 , l2 . . . }, represents the logic computations from the user’s design that must be mapped to these substitutable units. We add a link (li , rj ) if-and-only-if logical configuration li can be supported by physical resource rj . This results in a bipartite graph, with L being one side of the graph and R being the other. What we want to find is a complete matching between nodes in L and nodes in R—that is, we want every li ∈ L to be matched with exactly one node rj ∈ R, and every node rj ∈ R to be matched with at most one node li ∈ L. We can optimally compute the maximal matching between L and R in polynomial time using the Ford–Fulkerson maximum flow algorithm [15] with time complexity O (|V| · |E|) or a Hopcroft–Karp algorithm [16] with time complexity |V| · |E| . In the graph, |V| = |L| + |R| and |E| = O(|L| · |R|). Since there must be O at least as many resources as logical configurations, |L| ≤ |R|, the Hopcroft–Karp algorithm is thus O |R|2.5 ; for local sparing schemes, |R| might be reasonably in the 10 to 100 range, meaning that the matching problem is neither large nor growing with array size. If the maximal matching fails to be a complete matching (i.e., assign each li to a unique match in ri ), we know that it is not possible to support the design on a particular set of defective resources. Fine-grained Pterm matching Naeimi and DeHon use this matching to assign logical Pterms to physical nanowires in a nanoPLA (Chapter 38, Section 38.6) [17, 18]. Before considering defects, all the Pterm nanowires in the PLA are freely interchangeable. Each nanowire that implements a Pterm has a programmable diode between the input nanowires and the nanowire itself. If the diode is programmed into an off state, it disconnects the input from the nanowire Pterm. If the diode is in the on state, it connects the input to the nanowire, allowing it to participate in the AND that the Pterm is computing. The most common defect anticipated in this technology is that the programmable diode is stuck in an off state—that is, it cannot be programmed into a valid on state. Consequently, a Pterm nanowire with a stuck-off diode at a
842
Chapter 37
I
Defect and Fault Tolerance
particular input location cannot be programmed to include that input in the AND it is performing. A typical PLA will have 100 inputs, meaning each product-term nanowire is connected to 100 programmable diodes. A plausible failure rate for the productterm diodes is 5% (Pd = 0.05). If we demanded that each Pterm be defect free in order to use it, the yield of product terms would be: Pnwpterm (100, 0.05) = (1 − 0.05)100 ≈ 0.006
(37.9)
However, since none of the product terms use all 100 inputs, the probability that a particular Pterm nanowire can support a logical Pterm is much higher. For example, if the Pterm only uses 10 inputs, then the probability that a particular Pterm nanowire can support it is: Pnwpterm (10, 0.05) = (1 − 0.05)10 ≈ 0.599
(37.10)
Further, typical arrays will have 100 product-term nanowires. This suggests that, on average, this Pterm will be compatible with roughly 60 of the Pterm nanowires in the array—that is, the li for this Pterm will end up with compatibility edges to 60 rj ’s in the bipartite matching graph described before. As a result, DeHon and Naeimi [18] were able to demonstrate that we can tolerate stuck-off diode defects at Pd = 0.05 with no allocated spare nanowires. In other words, we can have |L| as large as |R| and, in practice, always find a complete matching for every PLA. This is true even though the probability of a perfect nanowire is below 1 percent (equation 37.9), suggesting that most arrays of 100 nanowires contain no perfect Pterm nanowires. This strategy follows the defect map model and does demand componentspecific mapping. Nonetheless, the required mapping is local (see the Local sparing section) and can be fast. Naeimi and DeHon [17] demonstrate the results quoted previously using a greedy, linear-time assignment algorithm rather than the slower, optimal algorithm. Further, if it is possible to test the compatibility of each Pterm as part of the trial assignment, it is not necessary to know the defect map prior to mapping. FPGA component level It is also possible to apply this matching idea at the component level. Here, the substitutable unit is an entire FPGA component. Unused resources will be switches, wires, and LUTs that are not used by a specific user design. Certainly, if the specific design does not fill the logic blocks in the component, there will be unused logic blocks whose failure may be irrelevant to the proper functioning of the design. Even if the specific design uses all the logic blocks, it will not use all the wires or all the features of every logic block. So, as long as the defects in the component do not intersect with the resources used by an particular FPGA configuration, the FPGA can perfectly support the configuration. Xilinx’s EasyPath series is one manifestation of this idea. At a reduced cost compared to perfect FPGAs, Xilinx sells FPGAs that are only guaranteed to
37.3 Transient Fault Tolerance
843
work with a particular user design, or a particular set of user designs. The user provides their designs, and Xilinx checks to see whether any of their defective devices will successfully implement those designs. Here, Xilinx’s resource set, R, is the nonperfect FPGAs that do not have defects in the nonrepairable portion of the logic. The logical set, L, is the set of customer designs destined for EasyPath. Xilinx effectively performs the matching and then supplies each customer with FPGA components compatible with their respective designs. Hyder and Wawrzynek [19] demonstrate that the same idea can be exploited in board-level FPGA systems. Here, their resource set, R, is the set of FPGAs on a particular board with multiple FPGAs. Their logical set is the set of FPGA configurations intended for the board. If all the FPGAs on the board were interchangeable, this would also reduce to the previous simple matching problem. However, in practice, the FPGAs on a board typically have different connections. This provides an additional set of topological constraints that must be considered along with resource compatibility during assignment. Rather than creating and maintaining a full defect map of each FPGA in the system, they also use application-specific testing (e.g., Tahoori [20]) to determine whether a particular FPGA configuration is compatible with a specific component on the FPGA board.
37.3
TRANSIENT FAULT TOLERANCE Recall that transient faults are randomly occurring, temporary deviations from the correct circuit behavior. It is not possible to test for transient faults and configure around them as we did with defects. The impact of a transient fault depends on the structure of the logic and the location of the transient fault. The fault may be masked (hidden by downstream gates that are not currently sensitive to this input), may simply affect the circuit output temporarily, or may corrupt state so that the effect of the transient error persists in the computation long after the fault has occurred. Examples include the following: I
I
I
If both inputs to an OR gate should be 1, but one of the inputs is erroneously 0, the output of the OR gate will still have the correct value. If the transient fault impacts the combinational output from a circuit, only the output on that cycle is affected; subsequent output cycles will be correct until another transient fault occurs. If the transient fault results in the circuit incorrectly calculating the next state transition in a finite-state machine (FSM), the computation may proceed in the incorrect state for an indefinite period of time.
To deal with the general case where transient faults impact the observable behavior of the computation, we must be able to prevent the errors from propagating into critical state or to observable outputs from the computation. This demands that we add or exploit some form of redundancy in the calculation to detect or correct errors as they occur. This section reviews two general
844
Chapter 37
I
Defect and Fault Tolerance
approaches to transient fault tolerance: feedforward correction (Section 37.3.1) and rollback error recovery (Section 37.3.2).
37.3.1
Feedforward Correction
One common strategy to tolerate transient faults is to provide adequate redundancy to correct any errors that occur. This allows the computation to continue without interruption. The simplest example of this redundancy is replication. That is, we arrange to perform the intended computation R times and vote on the result, using the majority result as the value allowed to update state or to be sent to the output. The smallest example uses R = 3 and is known as triple modular redundancy (TMR) (see Figure 37.7). In general, for there to be a clear majority, R must be odd, and a system with R replicas can tolerate at least R−1 2 simultaneous transient faults. We can perform the multiple calculations either in space, by concurrently placing R copies of the computation on the reconfigurable array, or in time, by performing the computation multiple times on the same datapath. In the simple design in Figure 37.7, a failure in the voter may still corrupt the computation. This can be treated similarly to nonrepairable area in defecttolerance schemes: I
I
I
If the computation is large compared to the voter, the probability of voter failure may be sufficiently small so that it is acceptable. The voter can be implemented in a more reliable technology, such as a coarser-grained feature size. The voter can be replicated as well. For example, von Neumann [21] and Pippenger [22] showed that one can tolerate high transient fault rates (up to 0.4 percent) using a gate-level TMR scheme with replicated voters.
TMR strategies have been applied to Xilinx’s Virtex series [23]. Rollins et al. [24] evaluate various TMR schemes on Virtex components, including strategies with replicated voters and replicated clock distribution. A key design choice in modular redundancy schemes is the granularity at which voting occurs. At the coarsest grain, the entire computational circuit could be the unit of replication and voting. At the opposite extreme, we can replicate and vote individual gates as the Von Neumann design suggests. The appropriate choice will balance area overhead and fault rate. From an area Replica of computation Inputs
Replica of computation Replica of computation
FIGURE 37.7
I
A simple TMR design.
Vote
Outputs
37.3 Transient Fault Tolerance
845
overhead standpoint, we would prefer to vote on large blocks; this allows the area of the voters to be amortized across large logic blocks so that the total area grows roughly as the replication factor, R. From an area overhead standpoint, we also want to keep R low. From a reliability standpoint, we want to make it sufficiently unlikely that more than R−1 2 replicas are corrupted by transient errors in a single cycle. Similar to defects (equation 37.4), the failure rate of a computation, and hence a replica, scales with the number of devices in the computation and the transient fault rate per device; consequently, we want to scale the unit of replication down as fault rate increases to achieve a target reliability with low R. Memory A common form of feedforward correction is in use today in memories. Memories have traditionally been the most fault-sensitive portions of components because: (1) A value in a memory may not be updated for a large number of cycles; as such, memories integrate faults over many cycles. (2) Memories are optimized for density; as such, they often have low capacitance and drive strength, making them more susceptible to errors. We could simply replicate memories, storing each value in R memories or memory slots and voting the results. However, over the years information theory research has developed clever encoding schemes that are much more efficient for protecting groups of data bits than simple replication [25, 26]. For example, DRAMs used in main memory applications generally tolerate a single-bit fault in a 64-bit data-word using a 72-bit error correcting code. Like the nonrepairable area in DRAMs, the error correcting circuitry in memories is generally built from coarser technology than the RAM memory array and is assumed to be fault free.
37.3.2
Rollback Error Recovery
An alternative technique to feedforward correction is to simply detect when errors occur and repeat the computation when an error is detected. We can detect errors with less redundancy than we need to correct errors (e.g., two copies of a computation are sufficient to detect a single error, while three are required for correction); consequently, detection schemes generally require lower overhead than feedforward correction schemes. If fault rates are low, it is uncommon for errors to occur in the logic. In most cycles, no errors occur and the normal computation proceeds uninterrupted. In the uncommon case in which a transient fault does occur, we stop processing and repeat the computation in time without additional hardware. With reasonably low transient-fault rates, it is highly unlikely that repeated computation will also be in error; in any case, detection guards against errors in the repeated computation as well. To be viable, the rollback technique demands that the application tolerate stalls in computation during rollback. This is easily accommodated in streaming models (Chapter 5, Section 5.1.3) that exploit data-presence signaling (see Data
Chapter 37
I
Defect and Fault Tolerance
presence subsection of Section 5.2.1) to tolerate variable timing for operator implementations. When detection and rollback are performed on an operator level, stream buffers between operator datapaths can isolate and minimize the performance impact of rollback.
Inputs
Detection To detect errors we use some form of redundancy. Again, this can be either temporal or spatial redundancy. To minimize the performance impact, we can employ a concurrent-error detection (CED) technique—that is, in parallel with the normal logic, we compute some additional function or property of the output (see Figure 37.8). We continuously check consistency between the logical output and this concurrent calculation. If the concurrent calculation ever disagrees with the base computation, this means there is an error in the logic. In the simplest case, the parallel function could be a duplicate copy of the intended logic (see Figure 37.8(b)). Checking then consists of verifying that the two computations obtained the equivalent results. However, it is often possible to avoid recomputing the entire function and, instead, compute a property of the output, such as its parity (see Figure 37.8(c)) [27]. The choice of detection granularity is based on the same basic considerations discussed before for feedforward replica granularity. Larger blocks can amortize out comparison overhead but will increase block error rates and hence the rate of rollback. For a given fault rate, we reduce comparison block granularity until the rollback rate is sufficiently low so that it has little impact on system throughput.
F
Outputs
Property of F
Check F Error
F
Outputs
Copy of F
Inputs
(a)
Inputs
846
F
Outputs
Parity of F
Not equal?
Parity
Error (b)
Error (c)
FIGURE 37.8 I A concurrent error-detection strategy and options: (a) generic formulation, (b) duplication, and (c) parity.
37.3 Transient Fault Tolerance
847
Recovery When we do detect an error, it is necessary to repeat the computation. This typically means making sure to preserve the inputs to a computation until we can be certain that we have reliably produced a correct result. Conceptually, we read inputs and current state, calculate outputs, detect errors, then produce outputs and save state if no errors are detected. In practice, we often want to pipeline this computation so that we detect errors from a previous cycle while the computation continues, and we may not save state to a reliable storage on every calculation. However, even in sequential cases, it may be more efficient to perform a sequence of computations between error checks. A common idiom is to periodically store, or snapshot, state to reliable memory, store inputs as they arrive into reliable memory, perform a series of data computations, and store results to reliable memory. If no errors are detected between snapshots, then we continue to compute with the new state and discard the inputs used to produce it. If errors are detected, we discard the new state, restore the old state, and rerun the computation using the inputs stored in reliable memory. As noted earlier in the Memory subsection, we have particularly compact techniques for storing data reliably in fault-prone memories; this efficient protection of memories allows rollback recovery techniques to be robust and efficient. In streaming systems, we already have FIFO streams of data between operators. We can exploit these memories to support rollback and retry. Rather than discarding the data as soon as the operator reads it, we keep it in the FIFO but advance the head pointer past it. If the operator needs to rollback, we effectively reset the head pointer in the FIFO to recover the data for reexecution. When an output is correctly produced and stored in an output FIFO, we can then discard the associated inputs from the input FIFOs. For operators that have bounded depth from input to output, we typically know that we can discard an input set for every output produced. Communications Data transmission between two distant points, especially when it involves crossing between chips and computers, is highly susceptible to external noise (e.g., crosstalk from nearby wires, power supply noise, clock jitter, interference from RF devices). As such, for a long time we have protected communication channels with redundancy. As with memories, we simply need to reliably deliver the data sent to the destination. Unlike memories, we do not necessarily need to guarantee that the correct data can be recovered from the potentially corrupted data that arrive at the destination. When the data are corrupted in transmission, it suffices to detect the error. The sender holds onto a copy of the data until the receiver indicates they have been successfully received. When an error is detected, the sender can retransmit the data. The detection and retransmission are effectively a rollback technique. When the error rates on the communication link are low, such that error detection is the uncommon event, this allows data to be protected with low overhead error-detecting codes, or checksums, instead of more expensive
848
Chapter 37
I
Defect and Fault Tolerance
error correcting codes. The Transmission Control Protocol (TCP) used for communication across the Internet includes packet checksums and retransmission when data fail to arrive error free at the intended destination [28].
37.4
LIFETIME DEFECTS Over the lifetime of a component, the physical device will change and degrade, potentially introducing new defects into the device. Individual atomic bonds may break or metal may migrate, increasing the resistance of the path or even breaking a connection completely. Device characteristics may shift because of hot-carrier injection (e.g., [29, 30]), NBTI (e.g., [31]), or even accumulated radiation doses (e.g., [32, 33]). These effects become more acute as feature sizes shrink. To maintain correct operation, we must detect the errors (Section 37.4.1) and repair them (Section 37.4.2) during the lifetime of the component.
37.4.1
Detection
One way to detect lifetime failures is to periodically retest the device—that is, we stop normal operation, run a testing routine (see the Testing subsection in Section 37.2.4), then resume normal operation if there are no errors. It can be an application-specific test, determining whether the FPGA can still support the user’s mapping [20], or an application-independent test of the FPGA substrate. Application-specific tests have the advantage of both being more compact and ignoring new defects that do not impact the current design. Substrate tests may require additional computation to determine whether the newly defective devices will impact the design. While two consecutive, successful tests generally mean that the computation between these two points was correct, the component may begin producing errors at any time inside the interval between tests and the error will not be detected until the next test is run. Testing can also be interleaved more directly with operation. In partially reconfigurable components (see Section 4.2.3), it is possible to reconfigure portions of a component while the rest of the component continues operating. This allows the reservation of a fraction of the component for testing. If we then arrange to change the specific portions of the component assigned to testing and operation over time, we can incrementally test the entire component without completely pulling it out of service (e.g., [34, 35]). In some scenarios, the component may need to stall operation during the partial reconfiguration, but the component only needs to stall for the reconfiguration period and not the entire testing period. When the total partial reconfiguration time is significantly shorter than the testing time, this can reduce the fraction of cycles the application must be removed from normal operation. This still means that we may not detect the presence of a new defect until long after it occurred and started corrupting data. If it is necessary to detect an error immediately, we must employ one of the fault tolerance techniques reviewed in Section 37.3. CED (see the Detection
37.5 Configuration Upsets
849
subsection in Section 37.3.2) can identify an error as soon as it occurs and stall computation. TMR (Section 37.3.1) can continue correct operation if only a single replica is affected; the TMR scheme can be augmented to signal higherlevel control mechanisms when the voters detect disagreement.
37.4.2
Repair
Once a new error has occurred, we can repeat global (see the Global sparing subsection in Section 37.2.4) or local mapping (see the Local sparing subsection in Section 37.2.4) to avoid the new error. However, since the new defect map is most likely to differ from the old defect map by only one or a few defects, it is often easier and faster to incrementally repair the configuration. In local mapping schemes, we only need to perform local remapping in the interchangeable region(s) where the new defect(s) have occurred. This may mean that we only need to move LUTs in a single cluster, wires in channel, or remap a single tile. Even in global schemes the incremental work required may be modest. Lakamraju and Tessier [36] show that incrementally rerouting connections severed by new lifetime defects can be orders of magnitude faster than performing a complete reroute from scratch. A rollback scheme (Section 37.3.2) can stall execution during the repair. A replicated, feedforward scheme (Section 37.3.1) with partial reconfiguration may be able to continue operating on the functional replicas while the newly defective replica is being repaired. Lifetime repair strategies depend on the ability to perform defect mapping and reconfiguration. Consequently, the perfect component model cannot support lifetime repair. Even if the component retains spare redundancy, redundancy and remapping mechanisms are not exposed to the user for in-field use.
37.5
CONFIGURATION UPSETS Many reconfigurable components, such as FPGAs, rely on volatile memory cells to hold their configuration, typically static memory cells (e.g., SRAM). Dynamic memory cells have long had to cope with upsets from ionizing particles (e.g., α-particles). As the feature sizes shrink, even static RAM cells can be upset by ionizing particles (e.g., Harel et al. [37]). In storage applications, we can typically cope with memory soft errors using error correcting codes (see the Memory subsection in Section 37.3.1) so that bit upsets can be detected and corrected. However, in reconfigurable components, we use the memory cells directly and continuously as configuration bits to define logic and interconnect. Upsets of these configuration memories will change, and potentially corrupt, the logic operation. Unfortunately, although memories can amortize the cost of a large error correction unit across a deep memory, FPGA configurations are shallow (i.e., Ninstr = 1); an error correction scheme similar to DRAM memories would end up being as large as or larger than the configuration memory it protects. Data
850
Chapter 37
I
Defect and Fault Tolerance
and projections from Quinn and Graham [38] suggest that ionizing radiation upsets can be a real concern for current, large FPGA-based systems and will be an ongoing concern even for modest systems as capacity continues to increase. Because these are transient upsets of configuration memories, they can be corrected simply by reloading the correct bitstream once we detect that the bitstream has been corrupted. Logic corruption can be detected using any of the strategies described earlier for lifetime defects (Section 37.4.1). Alternatively, we can check the bitstream directly for errors. That is, we can compute a checksum for the correct bitstream, read the bitstream back periodically, compute the checksum of the readback bitstream, and compare it to the intended bitstream checksum to detect when errors have occurred. When an error has occurred, the bitstream can be reloaded [38, 39]. Like interleaved testing, bitstream readback introduces a latency, which can be seconds long, between configuration corruption and correction. If the application can tolerate infrequent corruption, this may be acceptable. Asadi and Tahoori [40] detail a rollback scheme for tolerating configuration upsets. Pratt et al. [41] use TMR and partial TMR schemes to tolerate configuration upsets; their partial TMR scheme uses less area than a full TMR scheme in cases where it is acceptable for the outputs to be erroneous for a number of cycles as long as the state is protected so that the results return to the correct values when the configuration is repaired.
37.6
OUTLOOK The regularity in reconfigurable arrays, coupled with the resource configurability they already possess, allow these architectures to tolerate defects. As features shrink and defect rates increase, all devices, including ASICs, are likely to need some level of regularity and configurability; this will be one factor that serves to narrow the density and cost gap between FPGAs and ASICs. Further, at increased defect rates, it will likely make sense to ship components with defects and defect maps. Since each component will be different, some form of component-specific mapping will be necessary. Transient upsets and lifetime defects further suggest that we should continuously monitor the computation to detect errors. To tolerate lifetime defects, repair will become part of the support system for components throughout their operational lifetime. Increasing defect rates further drive us toward architectures with finer-grained substitutable units. FPGAs are already fairly fine grained, with each bit-processing operator potentially serving as a substitutable unit, but finer-grained architectures that substitute individual wires, Pterms, or LUTs may be necessary to exploit the most aggressive technologies.
References [1] S. E. Schuster. Multiple word/bit line redundancy for semiconductor memories. IEEE Journal of Solid State Circuits 13(5), 1978.
37.6 Outlook
851
[2] B. Keeth, R. J. Baker. DRAM Circuit Design: A Tutorial. Microelectronic Systems, IEEE Press, 2001. [3] J. Bernoulli. Ars Conjectandi. Impensis thurnisiorum, fratrum, Basel, Switzerland, 1713. [4] A. W. Drake. Fundamentals of Applied Probability Theory, McGraw-Hill, 1988. [5] A. DeHon. Law of large numbers system design. Nano, Quantum and Molecular Computing: Implications to High Level Design and Validation, S. K. Shukla, R. I. Bahar (eds.), Kluwer Academic, 2004. [6] W. K. Huang, F. J. Meyer, X.-T. Chen, F. Lombardi. Testing configurable LUTbased FPGAs. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 6(2), 1998. [7] W. B. Culbertson, R. Amerson, R. Carter, P. Kuekes, G. Snider. Defect tolerance on the TERAMAC custom computer. Proceedings of the IEEE Symposium on FPGAs for Custom Computing Machines, 1997. [8] M. Mishra, S. C. Goldstein. Defect tolerance at the end of the roadmap. Proceedings of the International Test Conference (ITC), 2003. [9] M. Mishra, S. C. Goldstein. Defect tolerance at the end of the roadmap. Nano, Quantum and Molecular Computing: Implications to High Level Design and Validation, S. K. Shukla, R. I. Bahar (Eds.), Kluwer Academic, 2004. [10] R. G. Cliff, R. Raman, S. T. Reddy. Programmable logic devices with spare circuits for replacement of defects. U.S. Patent number 5,434,514, July 18, 1995. [11] C. McClintock, A. L. Lee, R. G. Cliff. Redundancy circuitry for logic circuits. U.S. Patent number 6,034,536, March 7, 2000. [12] J. Lach, W. H. Mangione-Smith, M. Potkonjak. Low overhead fault-tolerant FPGA systems. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 26(2), 1998. [13] A. J. Yu, G. G. Lemieux. Defect-tolerant FPGA switch block and connection block with fine-grain redundancy for yield enhancement. Proceedings of the International Conference on Field-Programmable Logic and Applications, 2005. [14] A. J. Yu, G. G. Lemieux. FPGA defect tolerance: Impact of granularity. Proceedings of the International Conference on Field-Programmable Technology, 2005. [15] T. Cormen, C. Leiserson, R. Rivest. Introduction to Algorithms. MIT Press, 1990. [16] J. E. Hopcroft, R. M. Karp. An n2.5 algorithm for maximum matching in bipartite graphs. SIAM Journal on Computing 2(4), 1973. [17] H. Naeimi, A. DeHon. A greedy algorithm for tolerating defective crosspoints in nanoPLA design. Proceedings of the International Conference on Field-Programmable Technology, IEEE, 2004. [18] A. DeHon, H. Naeimi. Seven strategies for tolerating highly defective fabrication. IEEE Design and Test of Computers 22(4), 2005. [19] Z. Hyder, J. Wawrzynek. Defect tolerance in multiple-FPGA systems. Proceedings of the International Conference on Field-Programmable Logic and Applications, 2005. [20] M. B. Tahoori. Application-dependent testing of FPGAs. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 14(9), 2006. [21] J. von Neumann. Probabilistic logic and the synthesis of reliable organisms from unreliable components. Automata Studies C. Shannon, J. McCarthy (ed.), Princeton University Press, 1956. [22] N. Pippenger. Developments in “the synthesis of reliable organisms from unreliable components.” Proceedings of the Symposia of Pure Mathematics 50, 1990. [23] C. Carmichael. Triple Module Redundancy Design Techniques for Virtex FPGAs. San Jose, 2006 (XAPP 197—http://www.xilinx.com/bvdocs/appnotes/xapp197.pdf).
852
Chapter 37
I
Defect and Fault Tolerance
[24] N. Rollins, M. Wirthlin, P. Graham, M. Caffrey. Evaluating TMR techniques in the presence of single event upsets. Proceedings of the International Conference on Military and Aerospace Programmable, 2003. [25] G. C. Clark Jr., J. B. Cain. Error-Correction Coding for Digital Communications, Plenum Press, 1981. [26] R. J. McEliece. The Theory of Information and Coding, Cambridge University Press, 2002. [27] S. Mitra, E. J. McCluskey. Which concurrent error detection scheme to choose? Proceedings of the International Test Conference, 2000. [28] J. Postel (ed.). Transmission Control Protocol—DARPA Internet Program Protocol Specification, RFC 793, Information Sciences Institute, University of Southern California, Marina del Rey, 1981. [29] E. Takeda, N. Suzuki, T. Hagiwara. Device performance degradation to hot-carrier injection at energies below the Si-SiO2 energy barrier. Proceedings of the International Electron Devices Meeting, 1983. [30] S.-H. Renn, C. Raynaud, J.-L. Pelloie, F. Balestra. A thorough investigation of the degradation induced by hot-carrier injection in deep submicron N- and P-channel partially and fully depleted unibond and SIMOX MOSFETs. IEEE Transactions on Electron Devices 45(10), 1998. [31] D. K. Schroder, J. A. Babcock. Negative bias temperature instability: Road to cross in deep submicron silicon semiconductor manufacturing, Journal of Applied Physics 94(1), 2003. [32] J. Osborn, R. Lacoe, D. Mayer, G. Yabiku. Total dose hardness of three commercial CMOS microelectronics foundries. Proceedings of the European Conference on Radiation and Its Effects on Components and Systems, 1997. [33] C. Brothers, R. Pugh, P. Duggan, J. Chavez, D. Schepis, D. Yee, S. Wu. Total-dose and SEU characterization of 0.25 micron CMOS/SOI integrated circuit memory technologies. IEEE Transactions on Nuclear Science 44(6) 1997. [34] J. Emmert, C. Stroud, B. Skaggs, M. Abramovici. Dynamic fault tolerance in FPGAs via partial reconfiguration. Proceedings of the IEEE Symposium on FieldProgrammable Custom Computing Machines, 2000. [35] S. K. Sinha, P. M. Kamarchik, S. C. Goldstein. Tunable fault tolerance for runtime reconfigurable architectures. Proceedings of the IEEE Symposium on FieldProgrammable Custom Computing Machines, 2000. [36] V. Lakamraju, R. Tessier. Tolerating operational faults in cluster-based FPGAs. Proceedings of the International Symposium on Field-Programmable Gate Arrays, 2000. [37] S. Harel, J. Maiz, M. Alavi, K. Mistry, S. Walsta, C. Dai Impact of CMOS process scaling and SOI on the soft error rates of logic processes. Proceedings of Symposium on VLSI Digest of Technology Papers, 2001. [38] H. Quinn, P. Graham. Terrestrial-based radiation upsets: A cautionary tale. Proceedings of the IEEE Symposium on Field-Programmable Custom Computing Machines, 2005. [39] C. Carmichael, M. Caffrey, A. Salazar. Correcting Single-Event Upsets Through Virtex Partial Configuration. Xilinx, Inc., San Jose, 2000 (XAPP 216—http://www.xilinx.com/ bvdocs/appnotes/xapp216.pdf). [40] G.-H. Asadi, M. B. Tahoori. Soft error mitigation for SRAM-based FPGAs. Proceedings of the VLSI Test Symposium, 2005. [41] B. Pratt, M. Caffrey, P. Graham, K. Morgan, M. Wirthlin. Improving FPGA design robustness with partial TMR. Proceedings of the IEEE International Reliability Physics Symposium, 2006.
CHAPTER
38
RECONFIGURABLE COMPUTING AND NANOSCALE ARCHITECTURE Andre´ DeHon Department of Electrical and Systems Engineering University of Pennsylvania
For roughly four decades integrated circuits have been patterned top down with optical lithography, and feature sizes, F, have shrunk in a predictable, geometric fashion. With feature sizes now far below optical wavelengths (c.f. 400 nm violet light and 65 nm feature sizes) and approaching atomic lattice spacings (c.f. 65 nm feature sizes and 0.5 nm silicon lattice), it becomes more difficult and more expensive to pattern arbitrary features. At the same time, fundamental advances in synthetic chemistry allow the assembly of structures made of a small and precise number of atoms, providing an alternate, bottom-up approach to constructing nanometer-scale devices. Rather than relying on ever-finer precision and control of lithography, bottomup techniques exploit physical phenomena (e.g., molecular dimensions, film thicknesses composed of a precise number of atomic layers, nanoparticles constructed by self-limiting chemical processes) to directly define key feature sizes at the nanometer scale. Bottom-up fabrication gives us access to smaller feature sizes and promises more economical construction of atomic-scale devices and wires. Both bottom-up structure synthesis and extreme subwavelength top-down lithography can produce small feature sizes only for very regular topologies. In optical lithography, regular interference patterns can produce regular structures with finer resolution than arbitrary topologies [1]. Bottom-up syntheses are limited to regular structures amenable to physical self-assembly. Further, as noted in Chapter 37, construction at this scale, whether by topdown or bottom-up fabrication, exhibit high defect rates. High defect rates also drive increasing demand for regularity to support resource substitution. At the same time, new technologies offer configurable switchpoints that can fit in the space of a nanoscale wire crossing (Section 38.2.3). The switches are much smaller than current SRAM configurable switches and can reduce the cost of reconfigurable architectures relative to ASICs. Smaller configurable switchpoints are particularly fortuitous because they make fine-grained configurability for defect tolerance viable. High demand for regularity and fine-grained defect tolerance coupled with less expensive configurations increase the importance of reconfigurable architectures. Reconfigurable architectures can accommodate the requirements of Copyright © 2008 by Andr´e DeHon. Published by Elsevier Inc.
854
Chapter 38
I
Reconfigurable Computing and Nanoscale Architecture
these atomic-scale technologies and exploit the density benefits they offer. Nonetheless, to fully accommodate and exploit these cost shifts, reconfigurable architectures continue to evolve. This chapter reviews proposals for nanoscale configurable architectures that address the demands and opportunities of atomic-scale, bottom-up fabrication. It focuses on the nanoPLA architecture (see Section 38.6 and DeHon [2]), which has been specifically designed to exploit nanowires (Section 38.2.1) as the key building block. Despite the concrete focus on nanowires, many of the design solutions employed by the nanoPLA are applicable to other atomic-scale technologies. The chapter also briefly reviews nanoscale architectures (Section 38.7), which offer alternative solutions to key challenges in atomic-scale design.
38.1
TRENDS IN LITHOGRAPHIC SCALING In the conventional, top-down lithographic model, we define a minimum, lithographically imageable feature size (i.e., half pitch, F) and build devices that are multiples of this imageable feature size. Within the limits of this feature size, VLSI layout can perfectly specify the size of features and their location relative to each other in three dimensions—both in the two-dimensional plane of each lithographic layer and with adequate registration between layers. This gives complete flexibility in the layout of circuit structures as long as we adhere to the minimum imageable and repeatable feature size rules. Two simplifying assumptions effectively made this possible: (1) Feature size was large compared to atoms, and (2) feature size was large compared to the wavelength of light used for imaging. With micron feature sizes, features were thousands of atoms wide and multiple optical wavelengths. As long as the two assumptions held, we did not need to worry about the discreteness of atoms nor the limits of optical lithography. Today, however, we have long since passed the point where optical wavelengths are large compared to feature sizes, and we are rapidly approaching the point where feature sizes are measured in single-digit atom widths. We have made the transition to optical lithography below visible light (e.g., 193 nm wavelengths) and subwavelength imaging. Phase shift masking exploits interference of multiple light sources with different phases in order to define feature sizes finer than the wavelength of the source. This has allowed continued feature size scaling but increases the complexity and, hence, the cost of lithographic imaging. Topology in the regions surrounding a pattern now impacts the fidelity of reproduction of the circuit or interconnect, creating the demand for optical proximity correction. As a result, we see an increase both in the complexity of lithographic mask generation and in the number of masks required. Regionbased topology effects also limit the structures we can build. Because of both limitations in patterning and limitations in the analysis of region-based patterning effects, even in “full-custom” designs, we are driven to compose functions from a small palette of regular structures.
38.2 Bottom-up Technology
855
Rock’s Law is a well-known rule of thumb in the semiconductor industry that suggests that semiconductor processing equipment costs increase geometrically as feature sizes shrink geometrically. One version of Rock’s Law estimates that the cost of a semiconductor fabrication plant doubles every four years. Fabrication plants for the 90 nm generation were reported to cost $2 to 3 billion. The increasing cost comes from several sources, including the following: I
I
I
I
Increasing demand for accuracy: Alignment of features must scale with feature sizes. Increasing demand for purity: Smaller features mean that even smaller foreign particles (e.g., dust and debris) must be eliminated to prevent defects. Increasing demand for device yield: As noted in Chapter 37 (see Perfect yield, Section 37.2.3), to keep component yield constant, the per-device defect rate, Pd , must decrease as more devices are integrated onto each component. Increasing processing steps: More metal layers plus increasingly complex masks for optical resolution enhancement (described before) demand more equipment and processing.
It is already the case that few manufacturers can afford the capital investment required to develop and deploy the most advanced fabrication plants. Rising fabrication costs continue to raise the bar, forcing consolidation and centralization in integrated circuit manufacturing. Starting at around 90 nm feature sizes, the mask cost per component typically exceeds $1 million. This rising cost comes from the effects previously noted: more masks per component and greater complexity per mask. Coupled with rising component design and verification complexity, this raises the nonrecurring engineering (NRE) costs per chip design. The economics of rising NRE ultimately lead to fewer unique designs. That is, if we hope to keep NRE costs to a small fraction—for example 10 percent—of the potential revenue for a chip, the market must be at least 10 times the NRE cost. With total NRE costs typically requiring tens of millions of dollars for 90 nm designs, each chip needs a revenue potential in the hundreds of millions of dollars to be viable. The bar continues to rise with NRE costs, decreasing the number of unique designs that the industry can support. This decrease in unique designs creates an increasing demand for differentiation after fabrication (i.e., reconfigurability).
38.2
BOTTOM-UP TECHNOLOGY In contrast, bottom-up synthesis techniques give us a way to build devices and wires without relying on masks and lithography to define their atomic-scale features. They potentially provide an alternative path to device construction that may provide access to these atomic-scale features more economically than traditional lithography.
856
Chapter 38
I
Reconfigurable Computing and Nanoscale Architecture
This section briefly reviews the bottom-up technology building blocks exploited by the nanoPLA, including nanowires (Section 38.2.1), ordered assembly of nanowires (Section 38.2.2), and programmable crosspoints (Section 38.2.3). These technologies are sufficient for constructing and understanding the basic nanoPLA design. For a roundup of additional nanoscale wire and crosspoint technologies, see the appendix in DeHon’s 2005 article [2].
38.2.1
Nanowires
Chemists and material scientists are now regularly producing semiconducting and metallic wires that are nanometers in diameter and microns long using bottom-up synthesis techniques. To bootstrap the process and define the smallest dimensions, self-limiting chemical processes (e.g., Tan et al. [3]) can be used to produce nanoparticles of controlled diameter. From these nanoparticle seed catalysts, we can grow nanowires with diameters down to 3 nm [4]. The nanowire self-assembles into a crystalline lattice similar to planar silicon; however, growth is only enabled in the vicinity of the nanoparticle’s catalyst. As a result, catalyst size defines the diameter of the grown nanowires [5]. Nanowires can be grown to millimeters in length [6], although it is more typical to work with nanowires tens of microns long [7]. Bottom-up synthesis techniques also allow the definition of atomic-scale features within a single nanowire. Using timed growth, features such as composition of different materials and different doping levels can be grown along the axis of the nanowire [8–10]. This effectively allows the placement of device features into nanowires, such as a field effect gateable region in the middle of an otherwise ungateable wire (see Figure 38.1). Further, radial shells of different materials can be grown around nanowires with controlled thickness using timed growth [11, 12] or atomic-layer deposition [13, 14] (see Figure 38.2). These shells can be used to force the spacing between device and wire features, to act as dielectrics for field effect gating, or to build devices integrating heterogeneous materials with atomic-scale dimensions. After a nanowire has been grown, it can be converted into a metal–silicon compound with lower resistance. For example, by coating select regions of Conduct only with field ⬍ 1 V
Conduct any field ⬍ 5 V
FIGURE 38.1 I An axial doping profile. By varying doping along the axis of the nanowire, selectively gateable regions can be integrated into the nanowire.
38.2 Bottom-up Technology
FIGURE 38.2
I
A radial doping profile.
FIGURE 38.3
I
The Langmuir–Blodgett alignment of nanowires.
857
the nanowire with nickle and annealing, we can form a nickle–silicide (NiSi) nanowire [15]. The NiSi resistivity is much lower than the resistivity of heavily doped bulk silicon. Since nanowires have a very small cross-sectional area, this conversion is very important to keep the resistance, and hence the delay, of nanowires low. Further, this conversion is particularly important in reducing contact resistance between nanowires and lithographic-scale power supplies.
38.2.2
Nanowire Assembly
Langmuir–Blodgett (LB) flow techniques can be used to align a set of nanowires into a single orientation, close-pack them, and transfer them onto a surface [16, 17] (see Figure 38.3). The resulting wires are all parallel, but their ends may not be aligned. By using wires with an oxide sheath around the conducting core, the wires can be packed tightly without shorting together. The oxide sheath defines the spacing between conductors and can, optionally, be etched away after assembly. The LB step can be rotated and repeated so that we get multiple layers of nanowires [16, 18], such as crossed nanowires for building a wired-OR plane (Section 38.4.1).
38.2.3
Crosspoints
Many technologies have been demonstrated for nonvolatile, switched crosspoints. Common features include the following: I I I
I
Resistance that changes significantly between on and off states Ability to be made rectifying (i.e., to act as diodes) Ability to turn the device on or off by applying a voltage differential across the junction Ability to be placed within the area of a crossed nanowire junction
858
Chapter 38
I
Reconfigurable Computing and Nanoscale Architecture
Pt
Ti
[2]rotaxane (molecules) Pt
Ti
FIGURE 38.4
I
Switchable molecules sandwiched between nanoscale wires.
Chen et al. [19, 20] demonstrate a nanoscale Ti/Pt-[2]rotaxane-Ti/Pt sandwich (see Figure 38.4), which exhibits hysteresis and nonvolatile state storage showing an order of magnitude resistance difference between on and off states. The state of these devices can be switched at ±2 V and read at ±0.2 V. The basic hysteretic molecular memory effect is not unique to the [2]rotaxane, and the junction resistance is continuously tunable [21]. The exact nature of the physical phenomena involved is the subject of active investigation. LB techniques also can be used to place the switchable molecules between crossed nanowires (e.g., Collier et al. [22], Brown et al. [23]). In conventional VLSI, the area of an SRAM-based programmable crosspoint switch is much larger than the area of a wire crossing. A typical CMOS switch might be 600 F 2 [24], compared to a 3F × 3F bottom-level metal wire crossing, making the crosspoint more than 60 times the area of the wire crossing. Consequently, the nanoscale crosspoints offer an additional device size reduction beyond that implied by the smaller nanowire feature sizes. This particular device size benefit reduces the overhead for configurability associated with programmable architectures (e.g., FPGAs, PLAs) in this technology, compared to conventional CMOS.
38.3
CHALLENGES Although the techniques reviewed in the previous section provide the ability to create very small feature sizes using the basic physical properties of materials to define dimensions, they also bring with them a number of challenges that any nanoscale architecture must address, including the following: I
Required regularity in assembly and architecture: These techniques do not allow the construction of arbitrary topologies; the assembly techniques limit us to regular arrays and crossbars of nanowires.
38.4 Nanowire Circuits I
I
I
I
38.4
859
Lack of correlation in features: The correlation between features is limited. It is possible to have correlated features within a nanowire, but only in a single nanowire; we cannot control which nanowire is placed next to which other nanowire or how they are aligned. Differentiation: If all the nanowires in a regular crossbar assembly behaved identically (e.g., were gated by the same inputs or were diode-connected to the same inputs), we would not get a benefit out of the nanoscale pitch. It is necessary to differentiate the function performed by the individual nanowires in order to exploit the benefits of their nanoscale pitch. Signal restoration: The diode crosspoints described in the previous section are typically nonrestoring; consequently, it is necessary to provide signal restoration for diode logic stages. Defect tolerance: We expect a high rate of defects in nanowires and crosspoints. Nanowires may break or make poor contacts. Crosspoints may have poor contact to the nanowires or contain too few molecules to be switched into a low-resistance state.
NANOWIRE CIRCUITS It is possible to build a number of key circuits from the nanoscale building blocks introduced in the previous section, including a diode-based wired-OR logic array (Section 38.4.1) and a restoring nanoscale inverter (Section 38.4.2).
38.4.1
Wired-OR Diode Logic Array
The primary configurable structure we can build is a set of tight-pitched, crossed nanowires. With a programmable diode crosspoint at each nanowire intersection, this crossed nanowire array can serve as a programmable OR-plane. Assuming the diodes point from columns to rows (see Figure 38.5), each row output nanowire serves as a wired-OR for all of the inputs programmed into the low-resistance state. In the figure, programmed on crosspoints are shown in black; off crosspoints are shown in gray. Bold lines represent a nanowire pulled high, while gray lines remain low. Output nanowires are shown bold starting at the diode that pulls them high to illustrate current flow; the entire output nanowire would be pulled high in actual operation. Separate circuitry, not shown, is responsible for pulling wires low or precharging them low so that an output remains low when no inputs can pull it high. Consider a single-row nanowire, and assume for the moment that there is a way to pull a nondriven nanowire down to ground. If any of the column nanowires that cross this row nanowire are connected with low-resistance crosspoint junctions and are driven to a high voltage level, the current into the column nanowire will be able to flow into the row nanowire and charge it up to a higher voltage value (see O1, O3, O4, and O5 in Figure 38.5). However, if none of
860
Chapter 38
I
Reconfigurable Computing and Nanoscale Architecture A
B
C
D
E
F
1
0
1
0
1
0
1
0
1
1
1
0
FIGURE 38.5
I
O1 5 A1C1E
O2 5 B1E
O3 5 D1E1F
O4 5 A1E
O5 5 C1D
O6 5 B1F
The wired-OR plane operation.
the connected column nanowires is high, the row nanowire will remain low (see O2 and O6 in the figure). Consequently, the row nanowire effectively computes the OR of its programmed inputs. The output nanowires do pull their current directly off the inputs and may not be driven as high as the input voltage. Consequently, these outputs will require restoration (Section 38.4.2). A special use of the wired-OR programmable array is for interconnect. That is, if we restrict ourselves to connecting a single row wire to each column wire, the crosspoint array can serve as a crossbar switch. This allows any input (column) to be routed to any output (row) (see Figure 38.6). This structure is useful for postfabrication programmable routing to connect logic functions and to avoid defective resources. In the figure, programmed on crosspoints are shown in black; off crosspoints are shown in gray. This means that the crossbar shown in the figure is programmed to connect A→T, B→Q, C→V, D→S, E→U, and F→R.
38.4.2
Restoration
As noted in Section 38.4.1, the programmable, wired-OR logic is passive and nonrestoring, drawing current from the input. Further, OR logic is not universal. To build a good, composable logic family, we need to be able to isolate inputs from output loads, restore signal strength and current drive, and invert signals.
38.4 Nanowire Circuits A
B
C
D
E
861
F
Q
R
S
T
U V
FIGURE 38.6
I
An example crossbar routing configuration.
Fortunately, nanowires can be field effect controlled. This provides the potential to build gates that behave like field effect transistors (FETs) for restoration. However, to realize them, we must find ways to create the appropriate gate topology within regular assembly constraints (Section 38.5). If two nanowires are separated by an insulator, perhaps using an oxide core shell, we can use the field from one nanowire to control the other nanowire. Figure 38.7 shows an inverter built using this basic idea. The horizontal nanowire serves as the input and the vertical nanowire as the output. This gives a voltage transfer equation of Vout = Vhigh
Rpd Rpd + Rfet (Input) + Rpu
(38.1)
For the sake of illustration, the vertical nanowire has a lightly doped P-type depletion-mode region at the input crossing that forms a FET controlled by the input voltage (Rfet (Input)). Consequently, a low voltage on the input nanowire allows conduction through the vertical nanowire (Rfet = Ron-fet is small), and a high input depletes the carriers from the vertical nanowire and prevents conduction (Rfet = Roff-fet is large). As a result, a low input allows the nanowire to conduct and pull the output region of the vertical nanowire up to a high voltage. A high input prevents conduction and the output region remains low. A second crossed region on the nanowire is used for the pulldown (Rpd ). This region can be used as a gate for predischarging the output so that the inverter is pulled low
862
Chapter 38
I
Reconfigurable Computing and Nanoscale Architecture
Ohmic contact to voltage source
Vhigh
Vhigh
Precharge or isolation control
Rpu
Input
Rfet
Oxide separation
Input Inverted (restored) output
Lightly doped field effect controllable region
Voltage control for static load or precharge control Ground
FIGURE 38.7
I
Rpd
Ground
A nanowire inverter.
before the input is applied, then left high to disconnect the pulldown voltage during evaluation. Alternatively, it can be used as a static load for PMOS-like ratioed logic. By swapping the location of the high- and low-power supplies, this same arrangement can be used to buffer rather than invert the input. Note that the gate only loads the input capacitively. Consequently, the output current is isolated from the input current at this inverter or buffer. Further, nanowire field effect gating has sufficient nonlinearity so that this gate provides gain to restore logic signal levels [25].
38.5
STATISTICAL ASSEMBLY One challenge posed by regular structures, such as tight-pitch nanowire crossbars, is differentiation. If all the wires are the same and are fabricated at a pitch smaller than we can build arbitrary topologies lithographically, how can we selectively address a single nanowire? If we had enough control to produce arbitrary patterns at the nanometer scale, we could build a decoder (see Figure 38.8) to provide pitch-matching between this scale and the scale at which we could define arbitrary topologies. The trick is to build the decoder statistically. That is, differentiate the nanowires by giving each one an address, randomly select the nanowires that go into each array, and carefully engineer the statistics to guarantee a high
38.5 Statistical Assembly
863
Microscale wires
Ohmic contact to voltage source
Nanoscale wires
FIGURE 38.8
I
A decoder for addressing individual nanowires assembled at nanoscale pitch.
probability that there will be a unique address associated with each nanowire in each nanowire array. We can use axial doping to integrate the address into each nanowire [26]. If we pick the address space sparsely enough, Law of Large Numbers statistics can guarantee unique addressability of the nanowires. For example, if we select 10 nanowires out of a large pool with 106 different nanowire types, we get a unique set of nanowires more than 99.99 percent of the time. In general, we can guarantee more than 99 percent probability of uniqueness of N nanowires using only 100 N2 addresses [26]. By allowing a few duplications, the address space can be much smaller [27]. Statistical selection of coded nanowires can also be used to assemble nanoscale wires for restoration [2]. As shown in Figure 38.9(a), if coded nanowires can be perfectly placed in an array, we can build the restoration circuit shown in Section 38.4.2 (Figure 38.7) and arrange them to restore the outputs of a wired-OR array. However, the bottom-up techniques that can assemble these tight-pitch feature sizes cannot order or place individual nanowires and cannot provide correlation between nanowires. As shown in Figure 38.9(b), statistical alignment and placement of the restoration nanowires can be used to construct the restoration array. Here, not every input will be restored, but the Law of Large Numbers guarantees that we can restore a reliably predictable fraction of the inputs. For further details, see DeHon [2, 27].
864
Chapter 38
I
Reconfigurable Computing and Nanoscale Architecture Vhigh
Vhigh
Ground (a)
Ground (b)
Inputs Inverted restored outputs
FIGURE 38.9
38.6
I
A restoration array: (a) ideal and (b) stochastic.
NANOPLA ARCHITECTURE With these building blocks we can assemble a complete reconfigurable architecture. This section starts by describing the PLA-based logic block (Section 38.6.1), then shows how PLAs are connected together into an array of interconnected logic blocks (Section 38.6.2). It also notes that nanoscale memories can be integrated with this array (Section 38.6.3), reviews the defect tolerance approach for this architecture (Section 38.6.4), describes how designs are mapped to nanoPLA designs (Section 38.6.5), and highlights the density benefits offered by the technology (Section 38.6.6).
38.6.1
Basic Logic Block
The nanoPLA architecture combines the wired-OR plane, the stochastically assembled restoration array, and the stochastic address decoder to build a simple, regular PLA array (see Figure 38.10). The stochastic decoder described in Section 38.5 allows individual nanowires to be addressed from the lithographic scale for testing and programming (see Figures 38.11 and 38.12). The output of the programmable, wired-OR plane is restored via a restoration plane using field effect gating of the crossed nanowire set as described in Section 38.5 and shown in Figure 38.9.
Ohmic contacts to supply Programmable diode crosspoint
Stochastic address decoder [for configuring array] A0
A1
A2
/prechargeB Stochastic inversion array
A3
/evalA Stochastic buffer array
Precharge or static load devices V prechargeB common OR–term (N–type NWs)
Vrow2 Programming and precharge power supplies
Stochastic inversion array
Programmable diode crosspoints (OR–planes)
Lightly doped control region
Vrow1
Lightly doped control region
Nanowires
/evalB
Ohmic contacts to high– and low–supply voltages Restoration columns
FIGURE 38.10
I
A simple nanoPLA block.
Stochastic buffer array /prechargeA
prechargeA Restoration wire (P–type NWs)
Restoration columns
Ohmic contact to power supply
865
866
Chapter 38
I
Reconfigurable Computing and Nanoscale Architecture
Vcommon
A0 A1 A2 A3 Vrow2
Vrow1
FIGURE 38.11
I
Addressing a single nanowire.
Vtop1
Vtop2
Vtop3
Vtop4 Vcommon
A0 A1 A2 A3 Vrow2 Programmed junctions
Vrow1
Vbot1
FIGURE 38.12
I
Vbot2
Vbot3
Vbot4
Programming a nanowire–nanowire crosspoint.
As shown in Figure 38.11, an address is applied on the lithographic-scale address lines (A0 . . . A3). The applied address (1100) allows conduction through only a single nanowire. By monitoring the voltage at the common lithographic node at the far end of the nanowire (Vcommon ), it is possible to determine whether the address is present and whether the wire is functional (e.g., not broken). By monitoring the timing of the signal on Vcommon , we may be able to determine the resistance of the nanowire. As shown in Figure 38.12, addresses are applied to the lithographic-scale address lines of both the top and bottom planes to select individual nanowires in each plane. We use the stochastic restoration columns to turn the corner between the top plane and the restoration inputs to the bottom plane. Note that since column 3 is an inverting column, we arrange for the single, selected signal on the top plane to be a low value. Since the stochastic assembly resulted in two
38.6 nanoPLA Architecture
867
restoration wires for this input, both nanowire inputs are activated. As a result, we place the designated voltage across the two marked crosspoints to turn on the crosspoint junctions between the restored inputs and the selected nanowire in the bottom plane. The restoration planes can provide inversion such that the pair of planes serve as a programmable NOR. The two back-to-back NOR planes can be viewed as a traditional AND–OR PLA with suitable application of DeMorgan’s Law. A second set of restoration wires provides buffered, noninverted inputs to the next wiredOR plane; in this manner, each plane gets the true and complement version of each logical signal just as is normally provided at the inputs to a VLSI PLA. Microscale field effect gates (e.g., /evalA and /evalB) control when nanowire logic can evaluate, allowing the use of a familiar 2-phase clocking discipline. As such, the PLA cycle shown in Figure 38.10 can directly implement an FSM. Programmable crosspoints can be used to personalize the array, avoid defective wires and crosspoints (Section 38.6.4), and implement a deterministic function despite fabrication defects and stochastic assembly.
38.6.2
Interconnect Architecture
To construct larger components using the previously described structures, we can build an array of nanoPLA blocks, where each block drives outputs that cross the input (wired) regions of many other blocks (Figure 38.13) [2, 28]. This allows the construction of modest-size PLAs (e.g., 100 Pterms), which are efficient for logic mapping and keep the nanowire runs short (e.g., 10 μm) in order to increase yield and avoid the high resistance of long nanowires. The nanoPLA blocks provide logic units, signal switching, and signal buffering for long wire runs. With an appropriate overlap topology, such nanoPLAs can support Manhattan (orthogonal X–Y) routing similar to conventional, island-style FPGA architectures (Chapter 1). By stacking additional layers of nanowires, the structure can be extended vertically into the third dimension [29]. Programmable and gateable junctions between adjacent nanowire layers allow routing up and down the nanowire stack. This provides a path to continue scaling logic density when nanowire diameters can shrink no further. The resulting nanoPLA structure is simple and very regular. Its high-density features are built entirely from tight-pitched nanowire arrays. All the nanowire array features are defined using bottom-up techniques. The overlap topology between nanowires is carefully arranged so that the output of a function (e.g., wired-OR, restoration, routing) is a segment of a nanowire that then crosses the active or input portion of another function. Regions (e.g., wired-OR, restoration) are differentiated at a lithographic scale. Small-scale differentiation features are built into the nanowires and statistically populated (e.g., addressing, restoration). In the nanoPLA, the wired-OR planes combine the roles of switchbox, connection box, and logic block into one unified logic and switching plane. The wiredOR plane naturally provides the logic block in a nanoPLA block. It also serves
868 Microscale Microscale output input
Y route channel
Input (AND)
Inv. nanoPLA B
Buffer array
FIGURE 38.13
I
I
Output (OR)
Inversion array
nanoPLA block tiling with edge I/O to lithographic scale.
block
38.6 nanoPLA Architecture
869
to select inputs from the routing channel that participate in the logic. Signals that must be rebuffered or switched through a block are also routed through the same wired-OR plane. Since the configurable switchpoints fit within the space of a nanowire crossing, the wired-OR plane (hence the interconnect switching) can be fully populated unlike traditional FPGA switch blocks that have a very limited population to reduce their area requirements.
38.6.3
Memories
The same basic crosspoints and nanowire crossbar used for the wired-OR plane (Section 38.4.1) can also serve as the core of a memory bank. An address decoder similar to the one used for programming the wired-OR array (see Section 38.5 and Figure 38.8) supports read/write operations on the memory core [26, 30]. Unique, random addresses can be used to configure deterministic memory addresses, avoiding defective memory rows and columns [31]. A full-component architecture would interleave these memory blocks with the nanoPLA logic blocks similar to the way memory blocks are embedded in conventional FPGAs (Chapter 1).
38.6.4
Defect Tolerance
Nanowires in each wired-OR plane and interconnect channel are locally substitutable (see the Local sparing subsection in Section 37.2.4). The full population of the wired-OR crossbar planes guarantees this is true even for the interconnect channels. We provision spare nanowires based on their defect rate, as suggested in the Yield with sparing subsection of Section 37.2.3. For each array, we test for functional wires as illustrated in Section 38.6.1. Logical Pterms are assigned to nanowires using the matching approach described in the Fine-grained Pterm matching subsection of Section 37.2.5. For a detailed description of nanoPLA defect tolerance, see DeHon and Naeimi [32].
38.6.5
Design Mapping
Logic-level designs can be mapped to the nanoPLA. The logic and physical mapping for the nanoPLA uses similar techniques to those introduced in Part III. Starting from a logic netlist, technology mapping can be performed using PLAmap (see Section 13.3.4) to generate two-level clusters for each nanoPLA block, which can then be placed using an annealing-based placer (Chapter 14). Routing is performed with a PathFinder-based router (Chapter 17). Because of the full population of the switchboxes, the nanoPLA router need only perform global routing. Since nanoPLA blocks provide both logic and routing, the router must also account for the logic assigned to each nanoPLA block when determining congestion. As noted before, at design loadtime, logical Pterms are assigned to specific nanowires using a greedy matching approach (see the Fine-grained Pterm matching subsection of Section 37.2.5).
870
Chapter 38
38.6.6
I
Reconfigurable Computing and Nanoscale Architecture
Density Benefits
Despite statistical assembly, lithographic overheads for nanowire addressing, and high defect rates, small feature sizes, and compact crosspoints can offer a significant density advantage compared to lithographic FPGAs. When mapping the Toronto 20 benchmark suite [33] to 10-nm full-pitch nanowires (e.g., 5-nm-diameter nanowires with 5-nm spacing between nanowires), we typically see two orders of magnitude greater density than with defect-free 22-nm lithographic FPGAs [2]. As noted earlier, areal density can be further increased by using additional layers of nanowires [29].
38.7
NANOSCALE DESIGN ALTERNATIVES Several architectures have been proposed for nanoscale logic. A large number are also based on regular crossbar arrays and look similar to the nanoPLA at a gross level (see Table 38.1). Like the nanoPLA, all these schemes employ finegrained configurability to tolerate defects. Within these architectures there are different ways to address the key challenges (Section 38.3). These architectures enrich the palette of available component solutions, increasing the likelihood of assembling a complementary set of technology and design elements to practically realize nanoscale configurable logic.
38.7.1
Imprint Lithography
In the concrete technology described in Section 38.2, seeded nanowire growth was used to obtain small feature sizes and LB flow to assemble them into parallel arrays. Another emerging technique for producing regular, nanoscale structures (e.g., a set of parallel, tight-pitched wires) is imprint lithography. The masks for imprint lithography can be generated using bottom-up techniques.
TABLE 38.1
I
A comparison of nano-electronic programmable logic designs
Component element
HP/UCLA crossbar architecture
CMU nanoFabric
nanoPLA
Stony Brook CMOL
HewlettPackard FPNI
Crosspoint technology
Programmable diode
Programmable diode
Programmable diode
Programmable diode
Programmable diode
Nanowire technology
Nano-imprint lithography
Nanopore templates
Catalyst nanowires
Nano-imprint lithography
Nano-imprint lithography
Logic implementation
Nanoscale wired-OR
Nanoscale wired-OR
Nanoscale wired-OR
Nanoscale wired-OR
Lithoscale (N)AND2
CMOS↔Nanowire interface
Random particles
–
Coded nanowires
Crossbar tilt
Crossbar tilt
Restoration
CMOS
RTD latch
nanowire FET
CMOS
CMOS
References
[34, 35, 36]
[37, 38]
[28, 39]
[40]
[41]
38.7 Nanoscale Design Alternatives
871
In one scheme, timed vertical growth or atomic-layer deposition on planar semiconductors is used to define nanometer-scale layers of differentially etchable materials. Cut orthogonally, the vertical cross-section can be etched to produce a comblike structure where the teeth, as well as the spacing between them, are single-digit nanometers wide (e.g., 8 nm). The resulting structure can serve as a pattern for nanoscale imprint lithography [42,43] to produce a set of tight-pitched, parallel lines. That is, the long parallel lines resulting from the differential etch can be stamped into a resist mask [43], which is then etched to produce a pattern in a polymer or coated with metal to directly transfer metallic lines to a substrate [42]. These techniques can produce regular nanostructures but cannot produce arbitrary topologies.
38.7.2
Interfacing
When nanowires are fabricated together using imprint lithography, it is not possible to uniquely construct and code nanowires as exploited for addressing in the nanoPLA (Section 38.5). Williams and Kuekes [36] propose the first randomized decoder scheme for differentiating nanoscale wires and interfacing between lithographic and nanoscale feature sizes. They use a physical process to randomly deposit metal particles between the lithographic-scale address lines and the nanoscale wires. A nanowire is controllable by an address wire only if it has a metal particle bridging it to the address line. Unlike the nanowire-coding scheme where addresses are selected from a carefully chosen address space and grown into each nanowire (Section 38.5), in this scheme the address on each nanowire is randomly generated. As a result, this scheme requires 2 to 2.5 times as many address wires as the statistically assembled nanowire-coding scheme. Alternately, Strukov and Likharev [40, 44] observe that it should be possible to directly connect each long crossbar nanowire by a nanovia to lithographicscale circuitry that exists below the nanoscale circuits. The nanovia is a semiconductor pin spaced at lithographic distances and grown with a taper to a nanoscale tip for interfacing with individual nanowires. An array of these pins (e.g., Jensen [45]) can provide nanovia interfaces. The key idea is to pitch-match the lithographically spaced nanovia pins with the nanoscale pitch nanowires and guarantee that there is space in the CMOS below the nanoscale circuitry for the CMOS restoration and programming circuits. Note of the following: I
Nanoscale wires can be angled relative to the CMOS circuitry to match the pitch of the CMOS nanovias to the nanoscale wires. Figure 38.14 shows this tilt interfacing to a single nanowire array layer. Nanovias that connect to the CMOS are arranged in a square array with side 2 βFCMOS , where FCMOS is the half-pitch of the CMOS subsystem, and β is a dimensionless factor larger than 1 that depends on CMOS cell complexity. The nanowire crossbar is turned by an angle α = arcsin Fnano βFCMOS relative to the CMOS pin array, where Fnano is the nanowire half-pitch.
872
Chapter 38
I
Reconfigurable Computing and Nanoscale Architecture 2Fnano
␣ 2 FCMOS
FIGURE 38.14 I
I
Nanoscale and CMOS pitch matching via tilt.
If sufficiently long nanowires are used, the area per nanowire can be as large as each CMOS cell (e.g., restoration buffer and programming transistors). For example, if we use 10 μm nanowires at 10 nm pitch, each nanowire occupies 105 nm2 ; each such nanowire could have its own 300 nm × 300 nm CMOS cell (β ≈ 3 for FCMOS = 45 nm) and keep the CMOS area contained below the nanowire area.
For detailed development of this interface scheme, see Likharev and Strukov [44]. Hewlett-Packard employs a variant of the tilt scheme for their fieldprogrammable nanowire interconnect (FPNI) architecture [41].
38.7.3
Restoration
Enabled by the array-tilt scheme that allows each nanowire to be directly connected to CMOS circuitry, the hybrid semiconductor–molecular electronics (CMOL) and FPNI nanoscale array designs use lithographic-scale CMOS buffers to perform signal restoration and inversion. CMOS buffers with large feature sizes will be larger than nanowire FETs and have less variation. The FPNI scheme uses nanoscale configurability only to provide programmable interconnect, using a nonconfigurable 2-input CMOS NAND/AND gate for logic. Alternatively, it may be possible to build latches that provide gain and isolation from 2-terminal molecular devices [38]. Specifically, molecules that serve as resonant-tunneling diodes (RTDs) or negative differential resistors have been synthesized [46, 47]. These devices are characterized by a region of negative resistance in their IV-curve. The CMU nanoFabric design shows how to build and integrate latches based on RTD devices. The latches draw their power from the clock and provide restoration and isolation.
38.8
SUMMARY Between highly regular structures and high defect rates, atomic-scale design appears to demand postfabrication configurability. This chapter shows how
38.8 Summary
873
configurable architectures can accommodate the extreme regularity required. It further shows that configurable architectures can tolerate extremely limited control during the fabrication process by exploiting large-scale assembly statistics. Consequently, we obtain a path to denser logic using building blocks roughly 10 atoms wide, as well as a path to continued integration in the third dimension. Spatially configurable design styles become even more important when all substrates are configurable at their base level. We can always configure sequential processors on top of these nanoscale substrates when tasks are irregular and low throughput (see Chapter 36 and the Processor subsection of Section 5.2.2). However, when tasks can be factored into regular subtasks, direct spatial implementation on the configurable substrate will be more efficient, reducing both runtime and energy consumption.
References [1] S. R. J. Brueck. There are no fundamental limits to optical lithography. International Trends in Applied Optics, SPIE Press, 2002. [2] A. DeHon. Nanowire-based programmable architectures. ACM Journal on Emerging Technologies in Computing Systems 1(2), 2005. [3] Y. Tan, X. Dai, Y. Li, D. Zhu. Preparation of gold, platinum, palladium and silver nanoparticles by the reduction of their salts with a weak reductant–potassium bitartrate. Journal of Material Chemistry 13, 2003. [4] Y. Wu, Y. Cui, L. Huynh, C. J. Barrelet, D. C. Bell, C. M. Lieber. Controlled growth and structures of molecular-scale silicon nanowires. Nanoletters 4(3), 2004. [5] Y. Cui, L. J. Lauhon, M. S. Gudiksen, J. Wang, C. M. Lieber. Diameter-controlled synthesis of single crystal silicon nanowires. Applied Physics Letters 78(15), 2001. [6] B. Zheng, Y. Wu, P. Yang, J. Liu. Synthesis of ultra-long and highly-oriented silicon oxide nanowires from alloy liquid. Advanced Materials 14, 2002. [7] M. S. Gudiksen, J. Wang, C. M. Lieber. Synthetic control of the diameter and length of semiconductor nanowires. Journal of Physical Chemistry B 105, 2001. [8] M. S. Gudiksen, L. J. Lauhon, J. Wang, D. C. Smith, C. M. Lieber. Growth of nanowire superlattice structures for nanoscale photonics and electronics. Nature 415, 2002. [9] Y. Wu, R. Fan, P. Yang. Block-by-block growth of single-crystalline Si/SiGe superlattice nanowires. Nanoletters 2(2), 2002. ¨ [10] M. T. Bjork, B. J. Ohlsson, T. Sass, A. I. Persson, C. Thelander, M. H. Magnusson, K. Depper, L. R. Wallenberg, L. Samuelson. One-dimensional steeplechase for electrons realized. Nanoletters 2(2), 2002. [11] L. J. Lauhon, M. S. Gudiksen, D. Wang, C. M. Lieber. Epitaxial core-shell and core-multi-shell nanowire heterostructures. Nature 420, 2002. [12] M. Law, J. Goldberger, P. Yang., Semiconductor nanowires and nanotubes. Annual Review of Material Science 34, 2004. [13] M. Ritala. Advanced ALE processes of amorphous and polycrystalline films. Applied Surface Science 112, 1997. ¨ anen, ¨ ¨ T. Sajavaara, J. Keinonen. [14] M. Ritala, K. Kukli, A. Rahtu, P. I. Rais M. Leskela, Atomic layer deposition of oxide thin films with metal alkoxides as oxygen sources. Science 288, 2000.
874
Chapter 38
I
Reconfigurable Computing and Nanoscale Architecture
[15] Y. Wu, J. Xiang, C. Yang, W. Lu, C. M. Lieber. Single-crystal metallic nanowires and metal/semiconductor nanowire heterostructures. Nature 430, 2004. [16] Y. Huang, X. Duan, Q. Wei, C. M. Lieber. Directed assembly of one-dimensional nanostructures into functional networks. Science 291, 2001. [17] D. Whang, S. Jin, C. M. Lieber. Nanolithography using hierarchically assembled nanowire masks. Nanoletters 3(7), 2003. [18] D. Whang, S. Jin, Y. Wu, C. M. Lieber. Large-scale hierarchical organization of nanowire arrays for integrated nanosystems. Nanoletters 3(9), 2003. [19] Y. Chen, D. A. A. Ohlberg, X. Li, D. R. Stewart, R. S. Williams, J. O. Jeppesen, K. A. Nielsen, J. F. Stoddart, D. L. Olynick, E. Anderson. Nanoscale molecularswitch devices fabricated by imprint lithography. Applied Physics Letters 82(10), 2003. [20] Y. Chen, G.-Y. Jung, D. A. A. Ohlberg, X. Li, D. R. Stewart, J. O. Jeppesen, K. A. Nielsen, J. F. Stoddart, R. S. Williams. Nanoscale molecular-switch crossbar circuits. Nanotechnology 14, 2003. [21] D. R. Stewart, D. A. A. Ohlberg, P. A. Beck, Y. Chen, R. S. Williams, J. O. Jeppesen, K. A. Nielsen, J. F. Stoddart. Molecule-independent electrical switching in Pt/organic monolayer/Ti devices. Nanoletters 4(1), 2004. [22] C. Collier, G. Mattersteig, E. Wong, Y. Luo, K. Beverly, J. Sampaio, F. Raymo, J. Stoddart, J. Heath. A [2]catenane-based solid state reconfigurable switch. Science 289, 2000. [23] C. L. Brown, U. Jonas, J. A. Preece, H. Ringsdorf, M. Seitz, J. F. Stoddart. Introduction of [2]catenanes into Langmuir films and Langmuir–Blodgett multilayers: A possible strategy for molecular information storage materials. Langmuir 16(4), 2000. [24] A. DeHon. Reconfigurable Architectures for General-Purpose Computing. AI Technical Report 1586, MIT Artificial Intelligence Laboratory, Cambridge, MA, 1996. [25] A. DeHon. Array-based architecture for FET-based, nanoscale electronics. IEEE Transactions on Nanotechnology 2(1), 2003. [26] A. DeHon, P. Lincoln, J. Savage. Stochastic assembly of sublithographic nanoscale interfaces. IEEE Transactions on Nanotechnology 2(3), 2003. [27] A. DeHon. Law of Large Numbers system design. In Nano, Quantum and Molecular Computing: Implications to High Level Design and Validation, Kluwer Academic, 2004. [28] A. DeHon. Design of programmable interconnect for sublithographic programmable logic arrays. Proceedings of the International Symposium on FieldProgrammable Gate Arrays, 2005. [29] B. Gojman, R. Rubin, C. Pilotto, T. Tanamoto, A. DeHon. 3D nanowire-based programmable logic. Proceedings of the International Conference on Nano-Networks 2006. [30] A. DeHon, S. C. Goldstein, P. J. Kuekes, P. Lincoln. Non-photolithographic nanoscale memory density prospects. IEEE Transactions on Nanotechnology 4(2), 2005. [31] A. DeHon. Deterministic addressing of nanoscale devices assembled at sublithographic pitches. IEEE Transactions on Nanotechnology 4(6), 2005. [32] A. DeHon, H. Naeimi. Seven strategies for tolerating highly defective fabrication. IEEE Design and Test of Computers 22(4), 2005. [33] V. Betz, J. Rose. FPGA Place-and-Route Challenge. http://www.eecg.toronto.edu/∼ vaughn/challenge/challenge.html, 1999.
38.8 Summary
875
[34] J. R. Heath, P. J. Kuekes, G. S. Snider, R. S. Williams. A defect-tolerant computer architecture: Opportunities for nanotechnology. Science 280(5370), 1998. [35] Y. Luo, P. Collier, J. O. Jeppesen, K. A. Nielsen, E. Delonno, G. Ho, J. Perkins, H.-R. Tseng, T. Yamamoto, J. F. Stoddart, J. R. Heath. Two-dimensional molecular electronics circuits. ChemPhysChem 3(6), 2002. [36] S. Williams, P. Kuekes. Demultiplexer for a molecular wire crossbar network. U.S. Patent number 6,256,767, July 3, 2001. [37] S. C. Goldstein, M. Budiu. NanoFabrics: Spatial computing using molecular electronics. Proceedings of the International Symposium on Computer Architecture 178–189, 2001. [38] S. C. Goldstein, D. Rosewater. Digital logic using molecular electronics. ISSCC Digest of Technical Papers, IEEE, 2002. [39] A. DeHon, M. J. Wilson. Nanowire-based sublithographic programmable logic arrays. Proceedings of the International Symposium on Field-Programmable Gate Arrays, 2004. [40] D. B. Strukov, K. K. Likharev. CMOL FPGA: A reconfigurable architecture for hybrid digital circuits with two-terminal nanodevices. Nanotechnology 16(6), 2005. [41] G. S. Snider, R. S. Williams. Nano/CMOS architectures using a field-programmable nanowire interconnect. Nanotechnology 18(3), 2007. [42] N. A. Melosh, A. Boukai, F. Diana, B. Gerardot, A. Badolato, P. M. Petroff, J. R. Heath. Ultra high-density nanowire lattices and circuits. Science 300, 2003. [43] M. D. Austin, H. Ge, W. Wu, M. Li, Z. Yu, D. Wasserman, S. A. Lyon, S. Y. Chou. Fabrication of 5 nm linewidth and 14 nm pitch features by nanoimprint lithography. Applied Physics Letters 84(26), 2004. [44] K. K. Likharev, D. B. Strukov. CMOL: Devices, circuits, and architectures. In Introducing Molecular Electronics, Springer, 2005. [45] K. L. Jensen. Field emitter arrays for plasma and microwave source applications. Physics of Plasmas 6(5), 1999. [46] J. Chen, M. Reed, A. Rawlett, J. Tour. Large on-off ratios and negative differential resistance in a molecular electronic device. Science 286, 1999. [47] J. Chen, W. Wang, M. A. Reed, M. Rawlett, D. W. Price, J. M. Tour. Roomtemperature negative differential resistance in nanoscale molecular junctions. Applied Physics Letters 77, 2000.
This page intentionally left blank
INDEX * (wildcards), 152, 761 0-1 knapsack problem, 553 1:1 mapping, 329–30 area/delay trade-offs, 329 PEs, 337 pitch matching, 330 topology matching, 329–30 Absorbing boundary conditions (ABC), 702 Abstract Physical Model (APM), 322 Abstracted hardware resources, 234–36 Accelerated PathFinder, 418–22 limiting search expansion, 419 multi-terminal nets and, 420, 421 parallelized, 215–16, 421 routing high-fanout nets first, 419 scaling sharing/history costs, 419 See also PathFinder Accelerated simulated annealing, 415–18 communication bandwidth and, 416–17 distributed, 415 hardware-assisted, 418 parallelized, 416 See also Simulated annealing Accelerating technology, 56–59 Actel ProASIC3, 83 Active Pages, 779–802 activation portion, 788 algorithmic complexity, 786–94 array-insert, 788–90 Central Processor, 782, 784–85, 788 configurations, 782 defect tolerance, 779, 799–801 DRAM hardware design, 780 execution with parameters, 787 hardware interface, 780 LCS, 791–94 multiplexing performance, 796 Page Processor, 781 performance results, 781–86 performance versus random processor defects, 800 processing time, 798 processor width performance, 796–97 processor-memory nonoverlap, 784–85 programming model, 781
related work, 801–2 speedup over conventional systems, 782–84 Ad hoc testing, 96 Adaptive Computing Systems (ACS), 57 Adaptive lattice structures, 514 Adaptive nulling, CORDIC algorithm and, 514 Add/subtract FUs, 531, 532 Adder trees computation, 598 creation, 598 template-specific, 596 Adders, 504 floating point implementation, 675–77 in reconfigurable dynamic ATR system, 609 Address indirection, 178 Advanced Encryption Standard (AES), 459, 775 A* heuristic, 373–374 AIG, 285 Algebraic layout specification, 352–60 calculation, 353 case study, 357–60 Altera SignalTap, 271 Altera Stratix, 19–23 block diagram, 19 DSP block, 21 LAB structure, 21 logic architecture, 19–21 logic element, 20 MultiTrack, 21–22 routing architecture, 21–23 Altera Stratix–II, 68, 83, 300 configuration information, 68 horizontal/vertical routing, 308 Alternative region implementations, 544, 549 heterogeneous, 550 number of, 550 obtaining, 550 parallel program partitioning, 557 sequential program partitioning, 549–50 See also Hardware/software partitioning ALTOR, 313–14, 315
878
Index AMD/Intel, 55–56 Amdahl’s Law, 62, 542 equation, 542 in hardware/software partitioning, 542, 543 solution space pruning, 544 Amtel AT40K, 70 Analytic peak estimation, 479–84 data range propagation, 482–84 LTI system, 479–82 See also Peak estimation Analytic placement, 315 AND gates, 133 Angle approximation error, 522–23 Annotations absence of loop-carried memory dependence, 178–79 pointer independence, 178 Antifuse, 17–18 use advantages, 18 Application development, 435–38 challenges, 435 compute models, 93–107 system architectures, 107–25 Application-specific computation unit, 603–4 Application-specific integrated circuits. See ASICs Applications arithmetic implementation, 448–52 characteristics and performance, 441–44 computational characteristics/ performance, 441–43 configure-once implementation, 445 embedded, 476 implementation strategies, 445–48 implementing with FPGAs, 439–52 RTR, 446–47 Architectural space modeling, 816–26 efficiency, 817–25 raw density from architecture, 816–17 Area flow, 280 Area models, 485–96 high-level, 493–96 intersection mismatch, 823 for multiple-wordlength adder, 495 width mismatch, 819–21 Area-oriented mapping, 280–82 Arithmetic BFP, 450 complexity, 442–43 distributed, 503–11
fixed-point, 448–49 implementation, 448–52 infinite-precision, 519 Arithmetic logic units (ALUs), 5, 61, 114, 401 Array processors, 48, 226–30, 790, 2191–222 Array-insert algorithm, 788–90 processor and Active Pages computations, 790 simulation results, 790 Arrays block reconfigurable, 74–75 FPTAs, 745 local, 177 reconfigurable (RAs), 43 See also FPGAs Artificial evolution, 727–29 ASICs cost, 440 debug and verification, 440–41 design time, 638 development, 440 general-purpose hardware implementation, 458 power consumption, 440 replacement, 2 time to market, 439–40 vendors, 754 verification, 637, 638 Associativity, 799–800 Asynchronous transfer mode (ATM) networking, 755 ATM adaptation layer 5 (AAL5), 758 ATR, 591–610 algorithms, 592–94 dynamically reconfigurable designs, 594–600, 604–6 FOA algorithm, 592 with FPGAs, 591–610 implementation methods, 604–7, 608 implementations, 604–9 Mojave system, 604–6 Myrinet system, 606–7 reconfigurable computing models, 607–9 reconfigurable static design, 600–4 in SAR imagery, 591 SLD, 592–94 statically reconfigurable system, 606–7 Automated worm detection, 766–67 Automatic compilation, 162–75, 212–13 dataflow graphs, building, 164–69
Index DFG optimization, 169–73 DFG to reconfigurable fabric, 173–75 hyperblocks, 164 memory node connections, 175 operation packing, 173–74 pipelined scheduling, 174–75 runtime netlist, 413–14 scheduling, 174 TDF, 212 See also C for spatial computing Automatic HW/SW partitioning, 175–76 Automatic partitioning trend, 540–42 Automatic Target Recognition. See ATR Back-pressure signal, 210 Backtrack algorithm, 615–17 conflict analysis, 625 distributed control architecture, 620 efficiency, 616 FSM, 621–22 implementing, 619–24 implication circuit, 620 improved, 617–18, 626–27 improved, implementing, 624–27 nonchronological backtracking, 618 reconfigurable solver, 618–27 static variable ordering, 617 terminating conditions, 616 variable values, 619 Basic blocks, 163 Batcher bitonic sorter, 357–60 BEE Platform Studio (BPS), 192, 193 BEE2 platform, 191–94 design flow, 194 I/O, 200 Bellman–Ford algorithm, 386 Bernoulli’s Law of Large Numbers, 835 Bidirectional switches, 377 Binary-level partitioning, 559 Binding flexible, 236–38 install time, 236–37 runtime, 237–38 Bipartitioning, 312, 646 Bitonic sorter, 357–60 ilv combinator, 359 layout and behavior specification, 358 merger, 359 recursion and layout, 360 recursive structure, 357 Bitops (bit operations), 808 BLAS routines, 685
879
Block floating point (BFP), 450 Block reconfigurable arrays, 74–75 BlockRAMs, 585, 708, 713, 766 caching modules, 715 dual-ported, 716 latency, 713 Bloom filters, 762 payload scanning with, 762 SIFT used, 766 Boolean expressions, 464 Boolean operators, 465 Boolean satisfiability (SAT), 613–35 algorithms, 615–18 applications, 614 backtrack algorithm, 615–17 backtrack algorithm improvement, 617–18 clauses, 614 CNF, 613 complete algorithms, 615 formulas, mapping, 634 formulation, 282, 613–14 incomplete algorithms, 615 parallel processing, 618–19 problem, 613 problem analysis, 618–19 test pattern generation, 614, 615 See also SAT solvers Booth technique, 495 BORPH, 197 Bottom-up structure synthesis, 853 Bottom-up technology, 855–58 crosspoints, 857–58 nanowires, 856–57 Bulk Synchronous Parallelism (BSP), 118–19 Butterflies, 688 C++ language, 541 C compiler flow, 163 C compiler frontend, 163–64 CFG, 163 live variable analysis, 163 processing procedures, 164 C for spatial computing, 155–80 actual control flow, 159–60 automatic compilation, 162–75 automatic HW/SW partitioning, 175–76 common path optimization, 161–62 data connections between operations, 157 full pushbutton path benefits, 155–56
880
Index C for spatial computing (cont.) hyperblocks, 164 if-then-else with multiplexers, 158–59 memory, 157–58 mixed operations, 157 partitioning, 155 programmer assistance, 176–80 C language, 155–159, 171, 179, 541 C-slow retiming, 390–93, 827 architectural change requirement, 395–96 benefits, 390 FPGA effects on, 391 interface, 391 latency improvement, 392 low-power environment effect, 392 memory blocks, 391 microprocessor application, 395 as multi-threading, 395–98 results, 392 as threaded design, 391 throughput, 392 See also Retiming Caches configurations, 83 virtually addressed, 397 CAD JHDL system, 255, 265–68 Mentor Graphics, 56 PipeRench tools, 34 runtime, 411 runtime processes, 238 Teramac for, 58 tools, 44–45, 66 Cadence Xcite, 642 Case studies Altera Stratix, 19–23 Xilinx Virtex-II Pro, 23–26 CDFG, See Control dataflow graphs Cellular automata (CA), 122–23, 702–3 folded, 123 two-dimensional, 122 well-known, 122 Cellular programming evolutionary algorithm, 738 Central Limit Theorem, 835 Centralized evolution, 736–37 Chameleon architecture, 40–41 price/performance, 41 Channel width, 430 Checkpointing, 272 Checksums, 847
Chimaera architecture, 42–44 high-level user design language, 44 overview illustration, 43 RFUOPs, 43 VICs, 43–44 Choice networks creating, 285 mapping on, 286 Church–Turing Thesis, 96 Circuit combinators, 352 Circuit emulation, 54–56, 637–68 AMD/Intel, 55–56 impacts, 56 in-circuit, 650 multi-FPGA, 641–44 single-FPGA, 640–41 system uses, 639–40 Virtual Wires, 56 VLE, 653–65 Circuit graph bidirectional switches, 377 de-multiplexers, 376–77 edges, 377 extensions, 376–77 model, 367 symmetric device inputs, 376 Circuit layout algebraic specification, 352–60 calculation, 353 deterministic, 352 explicit Cartesian specification, 351–52 no and totally explicit, 350 problem, 347–51 regularity, 319 specifying, 347–63 verification for parameterized designs, 360–62 CLAP tool, benchmarks, 336 Clause modules, 629–30 Clearspeed SIMD array, 221 Clock cycles for circuit mapping, 649 latency, 507 N/L, 510 packing operations into, 173–74 reducing number of, 506 Clock frequency, 506 Cloning, 54 Clustering, 213, 227, 228, 304–6 benefits, 304 goals, 304 iRAC algorithm, 306
Index mechanical, 423 RASP system, 304–5 T-VPack algorithm, 305 VPack algorithm, 305 CMOS scaling, 507 CMX-2X, 60 Coarse-grained architectures, 32–33 PipeRench, 32–34 Codesign ladder, 541 Coding phase, 582, 585–86 block diagram, 586 See also SPIHT Col combinator, 354–55 Columns, skipping, 602 Common path, 161–62 Common subexpression elimination (CSE), 171 Communicating Sequential Processes (CSP), 93, 106 Communication, 243–48 I/O, 247 intertask, 251 latency, 247 method calls, 244 point-to-point, 251 shared memory, 243–44 streams, 244–46 styles, 243–46 virtual memory, 246–47 Compaction, 324, 337–44 HWOP selection, 338 optimization techniques, 338–42 phases, 337–38 regularity analysis, 338 Compilation, 212–13 accelerating classical techniques, 414–22 architecture effect, 427–31 automatic, 162–75, 212–13 C, uses and variations, 175–80 fast, 411–32 incremental place and route, 425–27 multiphase solutions, 422–25 partitioning-based, 423 PathFinder acceleration, 418–22 runtime netlist, 411, 413–14, 432 simulated annealing acceleration, 415–18 slow, 411 for spatial computing, 155–80 Compilation flow, 150–52 Complete evolution, 736–38 centralized, 736–37
population-oriented, 737–38 See also Evolvable hardware (EHW) Complete matching, 841 Complex programmable logic devices (CPLDs), 292 Component reuse, 198–200 signal-processing primitives, 198 tiled subsystems, 198–200 See also Streaming FPGA applications Computations data-centric, 110 data-dependent, 104 on dataflow graph, 99 density of, 826 deterministic, 95 feedforward, 389 fixed-point, 475–99 memory-centric, 779–802 models, 96 nondeterministic, 96 phased, 104 SCORE, 205 spatial, 157 stream, 203–17 Compute bound algorithms, 443 Compute models, 92–107 applications and, 94 challenges, 93–97 correctness reasoning, 95 data parallel, 105 data-centric, 105–6 dataflow, 98–103 in decomposing problems, 94–95 diversity, 92 functions, 97 multi-threaded, 93, 106 object-oriented, 98 objects, 97–98 parallelism existence, 95 SCORE, 74, 203–17 sequential control, 103–5 taxonomy, 93 transformation permissibility, 95 Turing–Complete, 97 Compute units, 319 Computing primitives, 95 Concurrent statements, 144, 150 Concurrent-error detection (CED), 846 Configurable Array Logic (CAL), 53 Configurable bitstreams, 16, 402–6 closed architecture, 402 configuration, generation, 401–9
881
882
Index Configurable bitstreams (cont.) control bits, 405 data generation software, 407–8 downloading mechanisms, 406–7 generation, 401–9 open, 408 sizes, 405, 406 tool flow, 408 underlying data structure, 402 Configurable logic blocks (CLBs), 23, 325 complexity, 507 flip-flops, 508 multiple, 509 resource reduction, 508 XC6200, 741 Configuration transfer time reduction, 80–82 architectural approaches, 81 compression, 81–82 data reuse, 82 Configuration upsets, 849–50 Configuration(s) architectures, 66–76 block reconfigurable, 74–75 cache, 83 caching, 77 compression, 81–82 controller, 66, 73 cycles, number of, 68 data reuse, 82 data transfer, 67 grouping, 76 multi-context, 68–70 partially reconfigurable, 70–71 pipeline reconfigurable, 73–74 relocation and defragmentation, 71–73 scheduling, 77–79 security, 82–83 single-context, 67–68 swapping, 72 Configure-once implementation, 445 Configured switches, 216 Conjunctive normal form (CNF), 291, 613 Connection blocks, 8 detail, 10 island-style architecture with, 9 Connection Machine, 221, 223 Constant coefficient multipliers, 459, 495 Constant folding, 169, 450–51 automated, 473 constant propagation, 463 implementations with/without, 451
in instance-specific designs, 456–57 in logical expressions, 464–66 Constrained 2D placement, 335–6 Content-addressable memories (CAMs), 444 Context switching, 80 Context-sensitive optimization, 340–42 superslices, 340, 341 See also Compaction Control dataflow graphs (CDFGs), 319 conversion to forest of trees, 330 primitive operators, 332 sequence, 334–35 Control flow, 159–60 implementation, 159 subcircuits, 160 See also C for spatial computing Control flow graph (CFG), 163, 164 Control nets, 322 Controller design, 124, 194–98 with Matlab M language, 195–97 with Simulink blocks, 194–95 with Verilog, 197 with VHDL, 197 Controllers configuration, 66, 73 delay line, 195–96 FSM, 124 RaPiD, 39 sequential, 120 vector architecture, 120 Coordinate systems, CORDIC, 520–21 Coprocessors independent, 36–40 scalar processor with, 117 streaming, 109–10 vector, 121–22 CORDIC, 437, 513–35 adaptive lattice structures and, 514 adaptive nulling and, 514 alternatives, 513, 520 angle approximation error, 522–23 architectural design, 526–27 computation noise, 522 computational accuracy, 521–26 convergence, 527–28 coordinate systems, 520–21 datapath rounding error, 523–26 engine, 527, 534 in FFT, 514 folded architecture, 528–30 functions computed by, 521 implementation, 526–27
Index input mapping, 527 input sample, 525 iterations, 516, 527 Kalman filters and, 514 micro-rotations, 526 parallel linear array, 530–33 PE, 532 processing, 522 quantization effects, 524 realizations, 513–14 result vector error, 523 rotation mode, 514–17 scaling, 517–19 scaling compensation, 534 shift sequences, 522 as shift-and-add algorithm, 513 unified description, 520–21 variable format, 524 vector rotation, 518 vectoring mode, 514, 519–20, 525 in VLSI signal processing, 514 y-reduction mode, 519 z-reduction mode, 517 CORDIC processors datapath format, 526 datapath width, 523 effective number of result bites, 525 FPGA implementation, 527–34 FPGA realizations, 523 full-range, 527, 528 with multiplier-based scaling compensation, 535 PE, 533 COordinate Rotation DIigital Computer. See CORDIC Cosine, 437 Cost function, 440 PathFinder, 368, 375 power-aware, 284 in simulated annealing, 306 Coverification, 639–40, 650–51 flow between workstation and emulator, 665 performance, 650 simulation, 651 use of, 640 VLE interfaces for, 664–65 See also Logic emulation systems CPU blocks, 15 Cray supercomputers, 60 Crosspoints, 857–58 diode, 859 nanowire-nanowire, 866
883
Custom evolvable FPGAs, 743–45 axes, 744 POEtic tissue, 743–45 See also Evolvable hardware (EHW) Customizable instruction processors, 121, 461–62 Cut enumeration-based algorithm, 287 Cut generation, 279–80 Cvt class, 266–68 GUI, 267 implementation, 266–67 D-flip-flops, 596 DA. See Distributed arithmetic DAOmap, 282–83 area improvement, 283 multiple cut selection passes, 283 DAP, 221 Data Encryption Standard (DES), 459 Data nets, 321 Data parallel, 119–22 application programming, 219–30 compute model, 105 languages, 222–23 SIMD, 120 SPMD, 120 system architecture, 119–22 Data presence, 108–9, 110 Data queuing, 756 Data range propagation, 482–84 Data-centric, 105–6, 110 Data-dependent branching, 221 Data-element size, 442–43 Data-oriented specialization, 450–52 Dataflow, 98–103 analysis-based operator size reduction, 172 direction, 321 dynamic streaming, 100–2 dynamic streaming, with peeks, 102 single-rate synchronous, 99 streaming, with allocation, 102–3 synchronous, 99–100 techniques, 93 Dataflow graphs (DFGs), 78, 319 building, 164 circuit generation, 164 computation on, 99 control (CDFG), 319 DSP, 93 edges, 165 edges, building and ordering, 166–68
884
Index Dataflow graphs (DFGs) (cont.) implicit type conversions, 172 live variables at exits, 168–69 multirate, 100 muxes, building, 167 nodes, 165 operations in clock cycles, 173–74 optimization, 164 predicates, 167 scalar variables in memory, 169 single-rate static, 100 as “stepping stone,” 164–65 top-level build algorithms, 165–66 See also DFG optimization Dataflow Intermediate Language (DIL), 34 Dataflow single-rate synchronous, 99 Datapath composition, 319–44 device architecture impact, 324–26 interconnect effect, 326 interface to module generators, 326–29 layout, 322–23 mapping, 329–33 regularity, 320–22 tool flow overview, 323–24 Datapath pruning, 524 Datapath rounding error, 523–26 Datapaths butterfly, 688 with explicit control, 195 FSM, 138–49 FSM communication with, 123–24 high-performance, 184 HWOPs, 320, 321–22 layout, 322–23 sharing, 109 SIMD, 815, 818 word-wide, 216 dbC language, 224 Deadlock, 96 Deadlock prevention, 249 Debug circuitry synthesis, 271–72 Debugging ASICs, 441 FPGAs, 440–41 JHDL, 270–72 Decoders, 376–77, 862–63 Dedicated-wire systems, 641 channel graph, 647 recursive bipartitioning, 646 routing problem, 646 See also Multiplexed-wire systems Deep pipelining, 706
Defect maps with component-specific mapping, 836 model, 832 Defect tolerance, 830–43 Active Pages and, 779, 799–801 associativity and, 800 concept, 830–32 defect map model, 832 global sparing, 836–37 local sparing, 838–39 with matching, 840–43 models, 831–32 nanoPLA, 869 perfect component model, 831, 837–38 with sparing, 835–39 substitutable resources, 832 testing, 835–36 yield, 832–35 Defects faults and, 830 lifetime, 848–49 rates, 829, 832 Defragmentation, 71–73 device support, 77 software-based, 79–80 Delay lines controller, 195–96 synchronous, 194 VPR computation, 309–10 Delay Optimal Mapping algorithm. See DAOmap Delay(s) configurable, 187 as cost approximation, 375 delta, 150 Delta delay, 150 De-multiplexers, 376–77, 862–63 Denial of service (DoS), 774 Denormals, 673 Depth-first search order, 585 Derivative monitors, 490 Deterministic Finite Automata, 103 Device architecture, 3–27 DFG. See Dataflow graphs DFG optimization, 169–73 Boolean value identification, 171 constant folding, 169 CSE, 171 dataflow analysis-based operator size reduction, 172 dead node elimination, 170–71 identity simplification, 170
Index memory access optimization, 172 redundant loads removal, 172–73 strength reduction, 170 type-based operator size reduction, 171–72 See also Dataflow graphs (DFGs) Digital signal processors (DSPs), 49, 93 Direct memory access (DMA), 246 Discrete cosine transform (DCT), 389, 479, 511 Discrete Fourier transform (DFT) output vector, 534 symmetries, 687 Discrete wavelet transform (DWT), 567 architecture illustration, 575 architectures, 571–75 computational complexity, 572 engine runtime, 574 folded architecture, 571, 572 generic 2D biorthogonal, 573–74 partitioned, 572, 573 phase, 582 two-dimensional, 571 Distributed arithmetic (DA), 503–11 algorithm, 504 application on FPGA, 511 FIR filters, 575 implementation, 504–7 LUT size and, 505 performance, improving, 508–11 reduced memory implementation, 507 theory, 503–4 two-bit-at-a-time reduced memory implementation, 509 Division operation, 437 Djikstra’s algorithm, 371 Dot product, 506, 683–86 FPGA implementation, 685 maximum sustainable floating–point rate, 685 multiply–accumulate, 686 multiply–add, 686 performance, 685–87 Downloading mechanisms, 406–7 DRAMs computational hardware, 786 dies, 831 hardware design, 780 high-density, 780 Dtb class, 270 Dynamic FPGAs, 600
885
Dynamic Instruction Set Computer (DISC), 447 Dynamic partial reconfiguration, 742–43 Dynamic reconfiguration, 552 Dynamic RPF, 29 Dynamic scheduling, 240–41 frontier, 240–41 runtime information, 240, 241 See also Scheduling Dynamic streaming dataflow, 101–2 with peeks, 102 primitives, 101 Dynamic testbench, 269–70 Dynamically linked libraries (DLLs), 235, 773 Dynamically reconfigurable ATR system, 604–6 Dynamically reconfigurable designs, 594–600 algorithm modifications, 594 FPGAs over ASICs, 595–96 image correlation circuit, 594–96 implementation method, 599–600 performance analysis, 596–97 template partitioning, 598–99 See also ATR Edge mask display, 190 Edges building, 166–67 circuit graph model, 377 detection design driver, 185 liveness, 165 ordering, 167–68, 172, 173 EDIF (Electronic Design Interchange Format), 407 Effective area, 280 Electric and magnetic field-updating algorithms, 700–1 Embedded memory blocks (EMBs), mapping logic to, 291–92 Embedded microprocessors, 197–98 Embedded multipliers, 514 Embedded multiply–accumulator (MACC) tiles, 514, 680 EMB_Pack algorithm, 292 Epigenesis, 726 Epigenetic axis, 727, 744 Error checking, 233 Error estimation, 485–96 fixed-point error, 486 high-level area models, 493–96
886
Index Error estimation (cont.) LTI systems, 487–89 noise model, 487–88 noise propagation, 488–89 nonlinear differentiable systems, 489–93 quantization, 711–12 simulation, 486 simulation-based methods, 487 Evolution artificial, 727–29 centralized, 736–37 complete, 736–38 extrinsic, 733 intrinsic, 734–35 open-ended, 738–39 population-oriented, 737–38 Evolutionary algorithms (EAs), cellular programming, 738 Evolutionary circuit design, 731, 733 Evolutionary computation, 727 Evolvable hardware (EHW), 729–46 as artificial evolution subdomain, 731 commercial FPGAs, 741–43 complete evolution, 736–38 custom, 743–45 digital platforms, 739–45 dynamic partial reconfiguration, 742–43 evolvable components, 739–40 extrinsic evolution, 733 future directions, 746 genome encoding, 731–32 intrinsic evolution, 734–35 JBits for, 743 living beings analogy, 729–30 off-chip, 732 on-chip, 732 open-ended evolution, 738–39 taxonomy, 733–39 virtual reconfiguration, 741–42 Xilinx XC6200 family, 740–41 Exit nodes, 165 Explicit layout Cartesian, 351–52 no, 350 totally, 350 in VHDL, 351 Explicit synchronization, 248 Exploration 0-1 knapsack problem, 553 complex formulations, 555–56 formulation with asymmetric communication, 553–55
parallel program partitioning, 558 sequential program partitioning, 552–57 simple formulation, 552–53 See also Hardware/software partitioning Extended logic, 12–16 elements, 12–15 fast carry chain, 13–14 multipliers, 14–15 processor blocks, 15 RAM, 15 Extreme subwavelength top-down lithography, 853 Extrinsic EHW, 733 F2 PGA, 735 Factoring, 515–16 False alarm rate (FAR), 592 Fast carry chain, 13–14 Fast Fourier transform (FFT), 21, 389, 479 butterflies, 688 CORDIC algorithm and, 514 data dependencies, 692 FPGA implementation, 689–91 implementation factors, 692 parallel architecture, 689, 690 parallel–pipelined architecture, 690, 691 performance, 691–93 pipelined architecture, 689–91 radix-2, 687–88 FDTD, 697–723 ABCs, 702 accelerating, 702 advantages on FPGA, 705–7 algorithm, 701–3 applications, 703–5 background, 697–701 breast cancer detection application, 703–4 as CA, 702–3, 723 as data and computationally intense, 702 deep pipelining, 706 field-updating algorithms, 700–1 fixed-point arithmetic, 706–7 flow diagram, 701 ground-penetrating radar application, 703 landmine detection application, 704 method, 697–707 model space, 698, 702, 712 parallelism, 705–6 PMLs, 702 reconfigurable hardware implementation, 704 spiral antenna model, 704, 705
Index UPML, 702, 706 See also Maxwell’s equations FDTD hardware design case study, 707–23 4 x 3 row caching model, 719 4-slice caching design, 718 background, 707 data analysis, 709–12 dataflow and processing core optimization, 716–18 expansion to three dimensions, 718–19 fixed-point quantization, 709–12 floating-point results comparison, 710, 711 hardware implementation, 712–22 managed-cache module, 717 memory hierarchy and interface, 712–15 memory transfer bottleneck, 715–16 model specifications, 711 parallelism, 720–21 performance results, 722 pipelining, 719–20 quantization errors, 710 relative error, 710, 711 relative error for different widths, 712 requirements, 707–8 results, 722 two hardware implementations, 721–22 WildStar–II Pro PFGA board, 708–9 Feedforward correction, 844–45 memory, 845 TMR, 844 FF. See Flip-flops Field effect transistors (FETs), 861 Field Programmable Port Extender (FPX) platform, 755, 756 applications developed for, 756 multiple copies, 770 physical implementation block diagram, 757 RAD circuits on, 770, 772 remote configuration on, 773 in WUGS, 756–57 Field-programmable gate arrays. See FPGAs Field-programmable interconnect chips (FPICs), 643 Field-programmable transistor arrays (FPTAs), 745 Field-updating algorithms, 700–1 FIFO, 37, 585 blocks, 586 buffers, 759 queues between operators, 108
887
streams, 847 token buffers, 102 Fine-grained architectures, 30–32 Finite-difference time-domain. See FDTD Finite-impulse response (FIR) filters, 21, 98, 389, 479 4-tap, 510 16-tap, 507, 508 distributed arithmetic, 575 general multipliers, 460 instance-specific multipliers, 460 mapping onto FPGA fabric, 507 SPIHT implementation and, 576 taps, 503 Finite-precision arithmetic, 519 Finite-State Machine with Datapath (FSMD), 112, 124 Finite-state machines (FSMs), 112, 620, 621 coarse-grained, 125 communicating with datapaths, 123–24 controller, 124 datapath example, 138–49 states, 621–22 VHDL programming, 130 Firewalls, 754 First-in, first-out. See FIFO Fixed instructions, 815 Fixed Order SPIHT, 578–80 basis, 579 order, 579 PSNR curve, 579 SPIHT comparison, 581 See also SPIHT Fixed-frequency FPGAs, 394–95 Fixed-Plus-Variable (F + V) computer, 48 Fixed-point computation, 475–99, 706–7 analytic peak estimation, 479–84 FDTD algorithm, 708 peak value estimation, 478–85 precision analysis for, 475–99 relative error, 712 simulation-based peak estimation, 484 Fixed-point error, 486 Fixed-point number system, 448–49, 475–78 2’s complement, 709 data structure, 710 in embedded applications, 476 flexibility, 476 multiple wordlength paradigm, 476–77 reconfigurable logic, 476 Fixed-point precision analysis, 575–78 final variable representation, 578
888
Index Fixed-point precision analysis (cont.) magnitude calculations, 576 variable representation, 577 See also SPIHT FLAME, 327–28 design data model, 327–28 library specification, 328 Manager, 327 topology description, 328 Flash memory, 17 Flexible API for Module-based Environments. See FLAME Flexible binding, 236–38 fast CAD for, 238 install time binding, 236–37 preemption and, 242 runtime binding, 237–38 See also Operating systems (OSs) FlexRAM, 801 Flip-flops (FFs), 286, 597 CLB, 508 D, 5–6, 596 retiming and, 286 Floating point, 449–50, 671–79, 706 adder block, 676 adder implementation, 675–77 adder layout, 676 application case studies, 679–92 denormals, 673 difficulty, 671–78 dot product, 683–86 FFT, 686–92 IEEE double-precision format, 672 implementation, 692 implementation considerations, 673–75 matrix multiply, 679–83 maximum sustainable rate, 685 multiplier block, 678 multiplier implementation and layout, 678 numbers, 672 summary, 692–94 Floating region, 303 Flow graphs, 78, 79 FlowMap algorithm, 279, 282 Focus of Attention (FOA) algorithm, 592 Folded CA, 123 Folded CORDIC architecture, 528–30 Follow-on SAT solver, 627–33 characteristics, 628 clause modules, 629–30 compilation time reduction, 627–33
conflict analysis, 630 creation methodology, 632 global topology, 628 implementation issues, 631–32 main control unit, 630 optimized pipelined bus system, 628, 629 performance, 630–33 shared-wire global signaling, 628 structural regularity, 628 system architecture, 627–30 See also Boolean satisfiability; SAT solvers Forward error correction (FEC), 755 Forward propagation, 482–84 FPGA fabrics, 14–15, 40–41 arbitrary-precision high-speed adder/subtractors support, 530 architectures, 30–34 dedicated paths, 511 footprint, 527 FPGA placement, 297–98 alternative, 297–98 analytic, 315 challenge, 316 clustering, 304–6 designer directives, 302–4 device legality constraints, 300–1 difficulty, 275 general–purpose FPGAs, 299–316 homogeneous, 503 importance, 299 independence tool, 312 inputs, 299 legal, 300 optimization goals, 301–2 partition-based, 312–15 problem, 299–304 PROXI algorithm, 311–12 routability-driven algorithms, 301 routing architecture influence, 302 simulated annealing, 306–12 simultaneous routing, 311–13 timing-driven algorithms, 301 tools, 301 See also FPGAs FPGAs, 1, 47 antifuse, 17–18 application implementation with, 439–52 arithmetic implementation, 448–52 ATR systems with, 591–610 backend phase, 151
Index as blank hardware, 16 case studies, 18–23 circuit layout specification, 347–63 clock rates, 441 compilation flow, 151 computing, CORDIC architectures for, 513–35 configuration, 16–18 configuration data transfer to, 67 configuration memory systems, 2 CORDIC processor implementation, 527–34 cost, 440 DA application on, 511 debug and verification, 440–41 dedicated processors, 15 development, 440 dynamic, 600 efficiency of processors and, 825 emulation system, 55 evolvable, 725–46 fabric, 15 fixed-frequency, 394–95 flash memory, 17 flexibility, 87 floating point for, 671–94 general-purpose hardware implementation, 458 island-style, 6, 7, 314 K-gate, 600 LUTs, 4–6, 279 low-quality ASICs use, 1 multi-context, 68–70 network data processing with, 755–56 number formats, 436 partially reconfigurable, 70–71 performance, 438 power consumption, 440 in reconfigurable computing role, 3 routing resources, 348, 367 scaling, 411, 412, 431–32 SIMD computing on, 219–21 single-context, 67–68 SRAM, 16–17 static, 600 streaming application programming, 183–202 strengths/weaknesses, 439–41 testing after manufacture, 407 time to market, 439–40 volatile static-RAM (SRAM), 6 See also FPGA placement
889
FPgrep, 761 FPsed, 761 FPX. See Field Programmable Extender platform Fractional fixed-point data, 523 Fractional guard bits, 522 FSM. See Finite-state machines FSM datapath, 138–49 adder representation, 144 concurrent statements, 144 control signal generation, 145–48 control signal generation illustration, 146 design illustration, 139 multiplexer representation, 144 multiplier representation, 144 next-state decoder, 149 registers, 144–45 sequential statement execution, 149 structural representation, 138–41 time-shared datapath, 141–44 FSMD. See Finite–state machine with datapath Full-range CORDIC processors, 527, 528 input quadrant mapping, 528 micro-rotation engine, 529 See also CORDIC Function blocks. See Logic blocks Functional blocks (FBs), 741 Functional mapping algorithms, 277 Functional Unit model, 41–43, 115–16 Functions, 97 GAMA, 331, 333 Garp’s nonsymmetrical RPF, 30–32, 40 configuration bits, 31 configurator, 32 number of rows, 30 partial array configuration support, 31 See also Fine-grained architectures; RPF General computational array model, 807–14 implications, 809–14 instruction distribution, 810–13 instruction storage, 813–14 General-purpose FPGA placement. See FPGA placement General-purpose programming languages (GPLs), 255, 256 Generic 2D biorthogonal DWT, 573–74 Genetic algorithms (GAs), 727–29 components, 729 crossover, 729 decoding, 728
890
Index Genetic algorithms (cont.) fitness evaluation, 728 genetic operators, 728–29 initialization, 728 mutation, 729 steps, 728–29 variable-length (VGA), 735 Genome encoding, 731–32 fitness calculation, 732 high-level languages, 731 low-level languages, 732 Genomes, 728 Given’s rotations, 514 Global RTR, 446, 447 Global sparing, 836–37 Globally Asynchronous, Locally Synchronous (GALS) model, 109 Glue-logic, 441 Granularity, 30–34 coarse, 32–34, 546 dynamically determined, 547 fine-grained, 30–32, 546 heterogeneous, 546 manual partitioning, 546 parallel program partitioning, 557 region, 545 sequential program partitioning, 545–47 See also Hardware/software partitioning Graph bipartitioning, 553 GRASP, 618, 625, 632–33 Greedy heuristics, 553–55 Ground-penetrating radar (GPR), 703, 704, 711 Group migration, 554 Hard macros, 336, 424 Hardware-Accelerated Identification of Languages (HAIL), 768 Hardware-assisted simulated annealing, 418 Hardware description languages (HDLs), 183, 235, 407, 541 Hardware execution checkpoints, 272 Hardware operators (HWOPs) boundary dissolution, 337 compaction, 324 linear stripes, 335 mapping, 323 module generation, 323 multibit wide, 320 neighboring, 338 non-bit-sliced, 324 pitch, 321
pitch-matched, 322 placement, 324 regular structure, 320–21 selection for compaction, 338 swaps, 334 See also Datapaths; HWOP placement Hardware protection, 250–51 Hardware prototyping, 411, 412–13, 432 reasons for employing, 412 Taramac system and, 427 Hardware/software partitioning, 539–59 alternative region implementation, 544, 549–50 exploration, 544, 552–57 FPGA technology and, 539 granularity, 544, 545–47 implementation models, 544, 550–52 of parallel programs, 557–58 partition evaluation, 544, 547–48 problem, 539–40 of sequential programs, 542–57 speedup following Amdahl’s Law, 543 Hash tables, 762 HDL Coder, 183 Heuristic search procedure, 496–97 Heuristics, 553, 555 greedy, 553–55 neighborhood search, 556 nongreedy, 553–55 simulated annealing, 555–56 Hierarchical annealing algorithm, 310–11 Hierarchical composition, 125 Hierarchical FPGAs, 313 Hierarchical routing, 10–12 FPGA placements, 301 long wires, 11 High-fanout nets, 419, 425 High-level languages (HLLs), 44–45, 52, 401 enabling use of, 44–45 genome encoding, 731 Huffman decoding, 233 HWOP placement, 333–37 constrained two-dimensional, 335–36 linear, 333–35 simultaneous tree covering and, 334 styles, 333 two-dimensional, 336–37 HWSystem class, 272 Hyperblocks basic block selection for, 166 building DFGs for, 164–69 formation, 168
Index I/O, 247 bound algorithms, 443 performance, 443–44 IDCT, 233 IEEE double-precision floating-point format, 672 If-then-else, 158–59 IKOS Logic Emulator, 630–31 IKOS VirtualLogic SLI Emulator, 623 Illinois Pular-based Optical Interconnect (iPOINT), 755 Ilv combinator, 359 Image correlation circuit, 594–96 Image-processing design driver, 185–94 2D video filtering, 187–91 horizontal gradient, 188, 189 mapping video filter to BEE2 FPGA platform, 191–94 RGB video conversion, 185–87 vertical gradient, 188, 189 See also Streaming FPGA applications IMap algorithm, 281–82 Implementation models dynamic reconfiguration parameter, 552 parallel program partitioning, 557–58 parameters, 551–52 real-time scheduling, 558 sequential program partitioning, 550–52 See also Hardware/software partitioning Implicit synchronization, 248–49 Imprint lithography, 870–71 Impulse project, 802 In-circuit emulation, 639, 650 Incremental mapping, 425–27 design clock cycle, 663 See also Mapping Incremental partitioning, 661 Incremental place and route, 425–77 Incremental rerouting, 374–75 Incremental routing, 661 Independence tool, 312 Induced architectural models, 814–16 fixed instructions, 815 shared instructions, 815–16 Infinite-impulse response (IIR) filters, 21, 98, 479 Install time binding, 236–37 Instance-specific design, 411, 413, 432, 455–73 approaches, 457–58 architecture adaptation, 457 changing at runtime, 456
891
concept, 455 constant coefficient multipliers, 459 constant folding, 456–57 customizable instruction processors, 461–62 examples, 459–62 function adaptation, 457 implementation, 456 key-specific crypto-processors, 459–60 NIDS, 460–61 optimizations, 456–57 partial evaluation, 462–73 requirements, 456 taxonomy, 456–57 use examples, 457 Instruction augmentation, 115–16 coprocessor model, 116 Functional Unit model, 115–16 instruction augmentation model, 116 manifestations, 115 Instruction distribution, 810–13 assumptions, 811 wiring, 811 Instruction Set Architecture (ISA) processor models, 103 Instruction-level parallelism, 796 Instructions array-wide, 814 base, 115 controller issuance, 113 fixed, 815 shared, 815–16 storage, 813–14 Integer linear programming (ILP), 497, 553 Integrated mapping algorithms, 284–89 integrated retiming, 286–87 MIS-pga, 288 placement-driven, 287–89 simultaneous logic synthesis, 284–86 See also Technology mapping Integrated retiming, 286–87 Interconnect Altera Stratix MultiTrack, 21–22 connection block, 8–10 effect on datapath placement, 326 hierarchical, 10–12 nearest neighbor, 7–8 optimization, 110 programmability, 12 segmented, 8–10 sharing, 110 structures, 7–12 switch block, 8–10
892
Index Internet key exchange (IKE), 775 Internet Protocol Security (IPSec), 775 Internet worms, 760 Interslice nets, 322 Intertask communication, 251 Intraslice nets, 322 Intrinsic evolution, 734–35 Intrusion detection, 756, 762–67 Intrusion detection and prevention system (IDPS), 763 Intrusion detection system (IDS), 762 Intrusion prevention, 756, 762–67 Intrusion prevention system (IPS), 754, 763 IP processing, 758 iRAC clustering algorithm, 305–6 Island-style FPGAs, 6, 7 with connect blocks, 9 partitioning, 314 Isolation, 251 Iterative mapping, 288 Java, 541 JBits, 408, 631 for evolving circuits, 743 JHDL with, 271 JHDL, 88, 89, 255–72 advanced capabilities, 269–72 behavior synthesis, 270 CAD system, 255, 265–68 checkpointing, 272 circuit data structure, 257 as circuit design language, 264–65 debug circuitry synthesis, 271–72 debugging capabilities, 270–72 descriptions, 264 design process illustration, 257 dynamic testbenches, 269–70 as embedded design language, 256 hardware mode, 268–69 Logic Library, 270 module generators, 263 motivation, 255–57 open-source license, 272–73 placement attributes, 263 primitive instantiation, 257–59 primitives library, 257 programmatic circuit generation, 261–63 Sea Cucumber and, 270 simulation/hardware execution environment, 268 as structural design language, 263–64 testbenches, 265–66
JHDL classes cvt, 266–68 dtb, 270 HWSystem, 272 Logic, 259–61, 272 Techmapper, 260, 264, 272 Johnson’s algorithm, 809 K-input lookup tables (K-LUTs), 277 K-Means clustering algorithm, 227, 228 Kalman filters, 514 Key-specific crypto-processors, 459–60 Lagrangian multipliers and relaxation, 376 Lambda Calculus model, 96 Langmuir–Blodgett (LB) flow techniques, 857 Language identification, 767–68 Latency BlockRAMs, 713 butterfly path, 694 C-slow retiming, 392 clock cycle, 507 communication, 247 Lattice ECP2, 83 Lava, 352 LCS algorithm, 791–94 parallel execution, 791 simulation results, 793, 794 three-dimensional, 793–94 two-dimensional, 791–92 Least significant bit (LSB), 321, 510 Leiserson’s algorithm, 384–86 LEKO, 282 LEON benchmark, 398 Lifetime defects, 848–49 detection, 848–49 repair, 849 See also Defects; Defect tolerance Linear placement, 333–35 Linear time-invariant (LTI) systems, 479–82 analytic technique, 487–89 error sensitivity, 489 scaling with transfer functions, 481–82 transfer function calculation, 479–80 Linear-feedback shift registers (LFSRs), 98 Linearization, 490 List of insignificant pixels (LIP), 569 List of insignificant sets (LIS), 569, 570 List of significant pixels (LSP), 569, 570, 586 Lithographic scaling, 854–55
Index Liveness edges, 165 Local arrays, 177 Local minima, 554 Local RTR, 446–47, 448 Local sparing, 838–39 Location update chain, 417 Logic, 3–6 duplication, 284 elements, 4–6 extended, 12–16 fast carry chain, 13–14 glue, 441 mapping to EMBs, 291–92 multivalued, 150 optimization, 342 programmability, 6 in RTL, 133 simultaneous synthesis, 284–86 unnecessary removal, 466 verification, 638 Logic blocks, 5, 6, 13 Logic class, 259–61 methods, 260–61 MUX example, 259–60 subroutines, 259 Logic emulation systems, 411, 412–13, 432, 637–68 background, 637–39 case study, 653–65 complexity, 639 configuration illustration, 639 coverification, 639–40, 650–51 fast FPGA mapping, 652–53 FPGA-based, 637–39 FPGA-based, advantages, 667 future trends, 666–67 in-circuit emulation, 639, 650 issues, 650–51 logic analysis, 651 multi-FPGA, 641–44 processor-based, 666 single-FPGA, 640–41 types, 640–50 use of, 639–40, 651 VirtuaLogic VLE, 639, 653–65 Logic fabric, 3–34, 14–15, 514 Logic gates, 278 Logic networks, 278 Logic processors, 666–67 LogicGen, 332–33 Lookup table (LUT), 4–6, 264, 409, 503 4-input, 507
893
DA implementation and, 505 defective, 838 exponential growth, 504 functionality, 403 inputs, 404 K-input, 277 logic block illustration, 6 as logic “islands,” 404 mapping to, 289–90 as memory element, 403 memory size, 509 number per logic block, 5 outputs, 404 physical, 151 size, 5 synchronous, 510 Loops fission, 177 fusion, 177 interchange, 177 memory dependencies, 178–79 nest, 177 reversal, 177 Loosely coupled RPF and processor architecture, 41 Lossless synthesis, 285 Low-level languages, genome encoding, 732 Low-temperature anneal, 311 Low-voltage differential signaling (LVDS), 667 LTI. See Linear time-invariant systems M-tap filter, 509–10 Macrocells, mapping to, 292 Macros hard, 336, 424 identification, 424 parameterizable, 493 soft, 336, 424 Malware, 762 appearance, 764 propagation, 764 Manual partitioning, 540, 546 Mapping, 329–33 1:1, 329–30 combined approach, 332–33 component-specific, 837 DA onto FPGAs, 507–8 dedicated-wire, 641 design, with multiple asynchronous clocks, 657–61 incremental, 425–27, 662–63
894
Index Mapping (cont.) LUT, 471–72 multi-FPGA emulator flow, 645 multiplexed-wire, 642 multiported memory, 657 N:1, 330–32 stages, 414 Mapping algorithms, 277–93 area-oriented, 280–82 complex logic blocks, 290–91 DAOmap, 282–83 delay optimal, 283 FlowMap, 279, 282 functional, 277 for heterogeneous resources, 289–92 IMap, 281–82 integrated, 278, 284–89 iterative, 288 LEKO, 282 logic to EMBs, 291–92 LUTs of different input sizes, 289–90 macrocells, 292 matching formulation, 841 MIS-pga, 288 optimal-depth, 287 performance-driven, 282–83 placement-driven, 287–89 PLAmap, 292 power-aware, 283–84 PRAETOR, 280–81 structural, 277, 278–84 times, 837 Markov Models, 78, 768 Mask parameters, 184, 187 MasPar, 221 Master slices, 320, 321 Matching complete, 841 defect tolerance with, 840–43 fine-grained Pterm, 841–42 formulation, 841 maximal, 841 MATLAB, 88, 195–97, 198 Matrix multiply, 679–83 decomposition, 680 FPGA implementation, 680–81 implementation, 681 MACC operations, 680 maximum achievable performance versus memory bandwidth, 683 memory accesses, 682 performance, 679, 682–83
performance of FPGAs and microprocessors, 684 Maximal matching, 841 Maximum magnitude phase, 582, 583–85 block diagram, 585 calculation, 583 See also SPIHT Maxwell’s equations, 697 curl, 698 discovery, 697 in rectangular coordinates, 699 as set of linear equations, 700 solving, 697 Memory access operations, 158 access optimization, 172 C for spatial computing, 157–58 CAM, 444 FDTD hardware implementation, 712–15 FPGA elements, 444 instruction, 814 nodes, 175 PE, 221 ports, 175, 444 retiming, 387 scalar variables in, 169 SDRAM, 760 shared, 124–25, 243–44 single pool, 104–5 total amount of, 444 virtual, 246–47 Memory management unit (MMU), 246, 247 Memory-centric computation, 779–802 algorithmic complexity, 786–94 parallelism, 794–99 performance results, 781–86 See also Active Pages Message authentication code (MAC), 775 Message passing, 124, 244 Method calls, 244 Microplacement, 342, 343 Microprocessors, 439, 441 MIS-pga algorithm, 288 Modular robotics, 739 Module generator interface, 326–29 data model, 327–28 flow, 327 intra-module layout, 328–29 library specification, 328 Module generators FLAME-based libraries, 327
Index flexibility, 326 PARAMOG library, 338 Mojave ATR system, 594, 604–6 machine comparison, 606 photograph, 605 results, 604 used resources, 605 Moore’s Law, 637, 753 circuit density growth, 49 process scaling, 826 MORPH project, 801 Morton Scan Ordering, 584 Most significant bit (MSB), 321, 493, 494, 510 Multi-context devices, 68–70 benefits, 69 configuration bits, 69 drawbacks, 69–70 physical capacity, 69 Multidomain signal transport, 658, 659, 660 requirement, 660 retimed, 660 Multi-FPGA emulation, 641–44 as complex verification platforms, 641 constraints, 644 crossbar topology, 643 dedicated-wire mapping, 641, 642 design mapping, 644–45 high-level flow, 644 inter- and intra-FPGA connections, 647 inter-partition logic communication, 641 interconnection, 647 mapping flow, 645 mesh topology, 643 multiplexed-wire mapping, 642 partitioning approach, 645–46 placement approach, 645–46 routing approaches, 646–50 topologies, 641, 643 See also Logic emulation systems Multi-SIMD coarse-grained array, 228 Multi-terminal nets, 420, 421, 425 Multi-threaded, 106, 123–25 FSMs with datapaths, 123–24 message passing, 124 model, 93 processors with channels, 124 shared memory, 124–25 Multiple wordlength adder formats, 494 optimization for, 478 paradigm, 476–77
895
Multiplexed-wire systems, 642 circuit mapping, 649 incremental compilation, 662 inter- and intra-FPGA connections, 647 partitioning for, 646 routing, 648 utilization of wires, 648 See also Dedicated-wire systems Multiplexers, 401 2-input, 130–32, 403 4-input, 134–35, 136–38, 404 FSM datapath, 144 if-then-else, 158–59 inputs, 403 logical equations, 133–34 primitive instantiation example, 258 pseudo, 377 Multiplexing factors, 796 nonactive memory and, 798 performance, 796 processor width versus, 797–99 Multiplication function, 405 Multipliers, 14–15 area estimation, 495 constant coefficient, 459, 495 embedded, 514, 712 floating point, 677–78 general cell, 466 instance-specific, 460 Lagrangian, 376 partial evaluation of, 466–70 shift-add, 467 Multiply–accumulate (MACC) operations, 680 Multiported memory mapping, 657 Multiprocessing environments, 799 Multivalued logic, 150 Multiway partitioning, 313 Muxes, building, 167 Myrinet ATR system, 606–7 host, 606 photograph, 607 simulations, 607 N:1 mapping, 330–32 NanoPLA, 841 architecture, 864–70 basic logic block, 864–67 block illustration, 865 blocks, 867 defect tolerance, 869
896
Index NanoPLA (cont.) density benefits, 870 design mapping, 869 interconnect architecture, 867–69 memories, 869 tiling with edge I/O, 868 wired-OR planes, 867 Nanoscale architecture, 853–73 bottom-up technology, 855–58 challenges, 858–59 CMOS pitch matching via tilt, 872 design alternatives, 870–72 imprint lithography, 870–71 interfacing, 871–72 lithographic scaling, 854–55 nanoPLA, 864–70 nanowire circuits, 859–62 restoration, 872 statistical assembly, 862–64 Nanovia, 871 Nanowire circuits, 859–62 inverter, 862 restoration, 860–62 wired-OR diode logic array, 859–60 Nanowires, 856–57 addressing, 866 angled, 871 assembly, 857 decoder for, 863 doping profiles, 857–58 field effect controlled, 861 Langmuir–Blodgett alignment, 857 statistical selection, 863 switchable modules between, 858 NBitAdder design, 262 NBTI, 848 NCHARGE API, 772 Nearest-neighbor connectivity, 7–8 Negotiated Analytic Placement (NAP) algorithm, 315 Negotiated Congestion Avoidance algorithm, 369 Negotiated congestion router, 367–72 algorithm, 370–71 first-order congestion, 368 iterative, 369 priority queue, 371 second-order congestion, 370 Negotiated congestion/delay router, 372–73 NetFPGA, 776 Network Intrusion Detection System (NIDS), 460–61
Network processing build motivation, 753–54 complete system, 770–75 control and configuration, 771–72 control channel security, 774–75 data, with FPGAs, 755–56 dynamic hardware plug-ins, 773 hardware/software packet, 754–55 intrusion detection/prevention, 762–67 IP wrappers, 758 layered protocol wrapper implementation, 759 partial bitfile generation, 773–74 payload processing with regular expression scanning, 761–62 payload scanning with Bloom filters, 762 payload-processing modules, 760–61 protocol, 757–62 rack-mount chassis form factor, 770–71 with reconfigurable hardware, 753–57 reconfiguration mechanisms, 772–73 semantic, 767–70 system modularity, 756–57 TCP wrappers, 758–60 Next-state decoder, 149 Nodes dead, elimination, 170–71 exit, 165 memory, connecting, 175 Seed, 291 Noise injection, 490–93 Noise model, 487–88 Noise propagation, 488–89 Nonchronological backtracking, 618 Nondeterministic finite automata (NFA), 761 Nonlinear differentiable systems, 489–93 derivative monitors, 490 hybrid approach, 489–93 linearization, 490 noise injection, 490–93 perturbation analysis, 489 Nonrecurring engineering (NRE), 855 Not a number (NAN), 449 Number formats, 436 Object-oriented model, 98 Objects, 97–98 On-demand scheduling, 239 One-time programmable (OTP), 17 Ontogenetic axis, 727, 744 Ontogeny, 726
Index Open Systems Interconnection (OSI) Reference Model, 757 Open-ended evolution, 738–39 Operating system (OS) abstracted hardware resources, 234–36 communication, 243–48 demands, 232 dynamic scheduling, 240–41 flexible binding, 236–39 on-demand scheduling, 239 preemption, 242 protection, 231, 249–51 quasi-static scheduling, 241 real-time scheduling, 241–42 roles, 231 scheduling, 239–42 security, 231 static scheduling, 239–40 support, 231–52 Operations C for spatial computing, 157 DFG, 173–74 MACC, 680 memory access, 158 packing into clock cycles, 173–74 Operator size reduction, 171–72 dataflow analysis-based, 172 type-based, 171–72 Optimization(s) common path, 161–62 compaction, 338–42 context-sensitive, 340–42 decidable, 97 DFG, 164, 169–73 FPGA placement, 301–2 instance-specific, 456–57 interconnect, 110 logic, 342 memory access, 172 for multiple wordlength, 478 SPIHT, 586 undecidable, 97 wordlength, 485–97 word-level, 339–40 Ordering edges, 167–68 absence, 173 existence, 173 false, removing, 172 Packet inspection applications, 761 Packet switches, 216 Parallel compilation, VLE system, 665
897
Parallel linear array, 531 based on Virtex-4 DSP48 embedded tile, 533 CORDIC, 530–33 Parallel PathFinder, 377–79 Parallel program partitioning, 557–58 alternative region implementations, 557 evaluation, 557 exploration, 558 granularity, 557 implementation models, 557–58 Parallel programs, 540 data dependence, 102 data parallel, 105 data-centric, 105–6 multi-threaded, 106 sequentialization and, 104–5 synchronization, 248–49 Parallelism, 99, 105, 118, 248 artificial, 105 bulk synchronous, 118–19 in compute models, 95 data, 95, 234, 442 FDTD, 705–6 FDTD hardware design case study, 720–21 in FFT computation, 689 instruction-level, 95, 234, 796 maximum possible, 236 memory-centric computation, 794–99 PathFinder qualities, 379 raw spatial, 219 task, 95 Parameterizable macros, 493 Parametric generation, 136–38 PARAMOG module generator library, 338 PARBIT tool, 773–74 Partial evaluation, 462–73 accomplishing, 462 cell logic, 468–69 constant folding in logical expressions, 464–66 FPGA-specific concerns, 471–73 functional specialization, 468–70 geometric specialization, 470 LUT mapping, 471–72 motivation, 463 of multipliers, 466–70 optimized multiplication circuitry, 468 in practice, 464–66 process of specialization, 464 at runtime, 470–71
898
Index Partial evaluation (cont.) static resources, 472 true x value, 470 unnecessary logic removal, 466 verification of runtime specialization, 472–73 of XOR gate, 463 Partial evaluators, 464 Partially reconfigurable designs, 70–71 Partition evaluation, 544, 547 design metric, 547 dynamic, 548 heterogeneous, 548 objective function, 547 parallel program partitioning, 557 sequential program partitioning, 544, 547–48 trade-off, 547–48 Partition-based placement, 312–15 bipartitions, 312 hierarchical FPGAs, 313 multiway partitioning, 312 recursive partitioning, 313–14 See also FPGA placement Partitioned DWT, 572, 573 Partitioning, 155, 507 automatic HW/SW, 175–76 automatic, trend, 540–42 binary-level, 559 hardware/software, 539–59 incremental, 661 for island-style FPGAs, 314 manual, 540, 546 multi-FPGA, 645–46 for multiplexed-wire systems, 646 multiway, 312 recursive, 313–14 super-HWOP, 340 template, 598–99 Partitions, 540 PassAddOrConstant, 673, 674 PATH algorithm, 310 PathFinder, 216, 312, 365–80 accelerating, 418–22 applying A* to, 373–74 for asymmetric architectures, 373 bidirectional switches, 377 circuit graph extensions, 376–77 circuit graph model, 367 communication bandwidth, 421 cost function, 368, 375 de-multiplexers, 376–77
distributed memory multiprocessor implementation, 378 enhancements/extensions, 374–77 implementation, 366 incremental rerouting, 374–75 in incrementally rerouting signals, 379 Lagrangian relaxation relationship, 376 Nair algorithm versus, 370 negotiated congestion router, 367–72 negotiated congestion/delay router, 372–73 parallel, 377–78 parallelized, 421 QuickRoute and, 379 resource cost, 375 SC-PathFinder, 366 in scheduling communication in computing graphs, 379 single-processor, 421 symmetric device inputs, 376 Pattern matchers, 470–71 general bit-level, 471 instance-specific, 472 requirements, 470 Pattern matching, 470 Payload processing, 760–62 with Bloom filters, 762 modules, 760–61 with regular expression, 761–62 PE. See Processing elements Peak estimation, 478–85 analytic, 479–84 simulation-based, 484 See also Fixed-point computation Perfect component model, 831, 837–38 Perfect matched layers (PMLs), 702 Performance Active Pages, 781–86 application, 441–44 computation, 441–43 coverification, 650 DA, 508–11 dot product, 685–86 FDTD hardware design case study, 722–23 FFT, 691–92 FPGA, 438 I/O, 443–44 matrix multiply, 682–83 multiplexing, 796 processor width, 796 Performance-driven mapping, 282–83
Index Perturbation analysis, 489 Peutil.exe utility, 587 Phased computations, 104 Phased reconfiguration, 210–11 manager, 117 schedule, 215 Phylogenetic axis, 727 POEtic tissue, 744 subdivision, 735 Phylogeny, 726 Physical synthesis, 316 PIM project, 801 Pipe and Filter, 108 Pipeline operators, 184 Pipeline reconfigurable architecture, 73–74 Pipelined scheduling, 174–75 Pipelined SIMD/vector processing, 228–29 Pipelining, 443 deep, 706 FDTD hardware design case study, 719–20 READ/CALCUATE/WRITE, 716 PipeRench, 32–35 CAD tools, 34 DIL, 34 PEs, 33 physical stripe, 32 pipelined configuration, 32 virtual pipeline stages, 34 See also Coarse-grained architectures; RPF Pipes, 99, 213 Pitch matching, 330 Placement directives, 302–4 fixed region, 303 floating region, 303 results, 304 See also FPGA placement Placement-driven algorithms, 287–89 PLAmap algorithm, 292, 869 Plasma architecture, 427, 428 POE model, 725–27 axes, 727 paradigms, 727 POEtic tissue, 743–45 Pointer independence, 178 Poly-phase filter bank (PFB), 200 Population-oriented evolution, 737–38 Port mapping, 133 Power cost, 284 Power estimation, 488–89 Power-aware mapping, 283–84
899
Power-based ranking, 284 PRAETOR algorithm, 280–82 area reduction techniques, 281 See also Mapping algorithms PRAM, 786 Predicates, 167 Preemption, 242 Prefetching, 77 Primary inputs (PIs), 278, 279 Primary outputs (POs), 278, 279 Primitive instantiation, 257–59 Primitive instruction, 808 PRISM, 53 Probability of detection (PD), 592 Processing elements (PEs), 29, 221–22, 225–26 data exchange, 221 index calculation, 227 memory, 221 resetting, 221 SIMD, 317 Processor width multiplying versus, 797–99 performance, 796–97 Processors with channels, 124 connecting with communication channels, 124 customizable instruction, 461–62 SIMD, 219 VLIW, 164 Programmable Active Memories (PAM), 49–50 Programmable chips, 2 Programmable logic blocks (PLBs), 290–91 Programmatic circuit generators, 261–63 Programmer assistance (C compilation), 176–80 address indirection, 178 annotations, 178–79 control structure, 177–78 data size declaration, 178 large block integration, 179–80 local arrays, 177 loop fission and fusion, 177 loop interchange, 177 operator-level module integration, 179 useful code changes, 176–77 Protection, 249–51 hardware, 250–51 task configuration, 251 PROXI algorithm, 311–12 Pterm matching, 841–42
900
Index QRD-RLS (recursive least squares) filtering, 514 Quartz system, 361 Quasi-static scheduling, 241 QuickRoute, 379 Rack-mount chassis form factor, 770–71 RAM dedicated, 15 static (SRAM), 6, 15, 16–17, 767, 775 Range propagation, 482–84 Ranking, power-based, 284 RaPiD, 36–40, 801 application design, 36 architecture block diagram, 37 datapath overview, 38 instruction generator, 39 PEs, 38 programmable controller, 39 programming, 39–40 stream generator, 37 VICs, 39 RASP system, 304–5 RAW project, 801 Real-time scheduling, 241–42, 558 Reconfigurable Application Specific Processor (RASP), 60 Reconfigurable arrays (RAs), 43 Reconfigurable Communications Processor (RCP), 41 Reconfigurable computing architectures, 29–45 fabric, 30–34 impact on datapath composition, 324–26 independent RPF coprocessor, 36–40 processor + RPF, 40–44 RPF integration, 35–44 Reconfigurable computing systems, 47–62 accelerating technology, 56–59 AMD/Intel, 55–56 CAL, 53 circuit emulation, 54–56 cloning, 54 early, 47–49 F + V, 48 future, 62 issues, 61–62 non-FPGA research, 61 PAM, 49–50 PRISM, 53 small-scale, 52–54 Splash, 51–52
supercomputing, 59–60 Teramac, 57–59 traditional processor/coprocessor arrangement, 48 VCC, 50–51 Virtual Wires, 56 XC6200, 53–54 Reconfigurable functional units (RFUs), 41 processor pipeline with, 42 as RAs, 43 RFUOPs, 43 super-scalar processor with, 116 See also RFU and processor architecture Reconfigurable image correlator, 602–3 Reconfigurable Pipelined Datapaths. See RaPiD Reconfigurable processing fabric. See RPF Reconfigurable static design, 600–4 application-specific computation unit, 603–4 correlation task order, 601–2 design-specific parameters, 601 reconfigurable image correlator, 602–3 zero mask rows, 601–2 See also ATR Reconfigurable supercomputing, 59–60 CMX-2X, 60 Cray, 60 Silicon Graphics, 60 SRC, 60 Reconfiguration configuration, 66–76 overhead, 65 phased, 210–11 phased manager, 117 process management, 76–80 RTR, 65, 446–47 virtual, 741–42 Reconfiguration management, 65–83 configuration caching, 77 configuration compression, 81–82 configuration data reuse, 82 configuration grouping, 76 configuration scheduling, 77–79 configuration security, 82–83 configuration transfer time reduction, 80–82 context switching, 80 software-based relocation and defragmentation, 78–80 Recursive partitioning, 313–14
Index Recursive Pyramid Algorithm (RPA), 572 Reflection, 269 Register Transfer Level (RTL), 87, 129 logic organization, 133 VHDL description, 133–36 Regular expression (RE), 761 Regularity circuit layout, 319 datapath composition, 320–22 importance, 344 inter-HWOP, 339 Relocation, 71–73, 237 device support, 77 software-based, 79–80 support problem, 80 Rent’s Rule, 642 Repipelining, 389–90 feedforward computations, 389 FPGA effects on, 391 latency cycles, 390 retiming derivation, 389 throughput improvement, 390 Reprogrammable application devices (RADs), 756 Resonant-tunneling diodes (RTDs), 872 Resource cost, PathFinder, 375 Retiming adoption limitation factors, 398 area-time tradeoffs, 111 Bellman-Ford algorithm, 386 benefit, 388 constraint system, 385 correctness, 386 covering and, 286 design limitations, 387 effect, 287 FFs, 286 on fixed-frequency FPGAs, 394–95 FPGA effects on, 391 global set/reset constraint, 387 goal, 384 implementations, 393–94 with initial conditions, 387 integrated, 286–87 Leiserson’s algorithm, 384–86 memories, 387 multiple clocks and, 387–88 operation, 383 problem and results, 388 sequential control, 110 as superlinear, 398
See also C-slow retiming RFU and processor architecture, 41–42 datapath, 42 processor pipeline example, 42 RGB data conversion, 185–87 cycle alignment, 186 RightSize, 493 Rock’s Law, 855 Rollback, 845–48 communications, 847–48 detection, 846 recovery, 847 scheme, 849 for tolerating configuration upsets, 849–50 Rotation CORDIC, 515–18 Given’s, 514 in matrix form, 515 micro-rotations, 526 as product of smaller rotations, 515 signal flow graph, 518 vector growth factor, 518 Rotation mode, 514–17 micro-rotation extensions, 516 as z-reduction mode, 517 See also CORDIC Routability-driven algorithms, 301 Routing, 215–16 congestion, 302 FPGA resources, 348 global, 366 hierarchical, 10–12, 301 horizontal, 308 incremental, 661 multi-FPGA emulation, 646–50 multiplexed-wire systems, 648 nearest-neighbor, 7–8 negotiated congestion, 367–372 Pathfinder-style, 422 physical FPGA modifications for, 430 programmable resources, 12 SCORE, 215–16 search wave, 419 segmented, 8–10 simultaneous placement and, 311–13 solutions, 365–66 vertical, 308 VPR, 314, 372 Rows skipping, 602 zero mask, 601–2
901
902
Index RPF and processor architectures, 40–44 Chimaera, 42–44 loosely coupled, 41 tightly coupled, 41–42 RPFs, 29 architectures, 30–34 coarse-grained, 32–33 dynamic, 29 fine-grained, 30–32 independent coprocessor, 36–40 integration into traditional systems, 35–44 integration types, 35–36 locations in memory hierarchy, 35 RaPiD, 36–40 static, 29, 901 RTL. See Register Transfer Level Rule tables, 738 Runtime binding, 237–38 Runtime netlist compilation, 213, 411, 413–14 dynamically compiled applications and, 414 requirement, 432 Runtime reconfiguration (RTR), 65, 446–47 applications, 447 global, 446, 447 local, 446–47, 448 Runtime Reconfigured Artificial Neural Network (RRANN), 447 Runtime specialization, 472–73 Sandia algorithm, 594 SAR. See Synthetic Aperture Radar SAT solvers, 618–27 algorithms, 633 backtrack algorithm implementation, 619–24 differences among, 633 follow-on, 627–33 future research, 634–35 global topology, 621 HW/SW organization, 633 implementation issues, 631–33 improved backtrack algorithm implementation, 624–27 logic engine implementation, 633 performance, 630–31 problem analysis, 618–19 runtime performance, 623 simultaneous exploration of multiple states, 635
system architecture, 627–30 system-level design and synthesis methodologies, 634 See also Boolean satisfiability Satisfiability (SAT) Boolean, 282, 613–35 FPGA-based solvers, 413 problem, 413 Sblocks, 742 SC-PathFinder, 366 Scaling CORDIC algorithm, 517–19 CORDIC, compensation, 534 FPGA, 411, 412, 431–32 Moore’s Law process, 826 with transfer functions, 481–82 wordlength, 477 Scheduling configuration, 77–79 dynamic, 240–41 module-mapped DFG, 174 on-demand, 239 operating system, 239–42 pipelined, 174–75 preemption, 242 quasi-static, 241 real-time, 241–42, 558 SCORE, 213–15 static, 239–40 window-based, 79 SCORE, 74, 203–217 application illustration, 204 back-pressure signal, 210 C++ integration and composition, 206–8 compilation, 212–13 compilation flow, 212 computations, 205 execution patterns, 208–12 fixed-size, 211–12 as higher-level programming model, 203 highlights, 217 operators, 205, 206, 207 phased reconfiguration, 210–11 platforms, 215 programming, 205–8 runtime, 203, 213–16 scalability, 203 scheduling, 213–15 sequential versus parallel, 211 standard I/O page, 211–12 stream support, 209–10 system architecture, 208–12
Index TDF, 205–6 virtualization model, 213 SCPlace algorithm, 310 SDF, 88, 99–100, 184 SDRAM memory, 760, 775 Sea Cucumber, 270 Search alternative procedures, 497 heuristic procedure, 496–97 techniques, 496–97 Search space, 728 Second-Level Detection (SLD), 592–94 as binary silhouette matcher, 593 shape sum, 593 steps, 593–94 target models, 593 See also ATR Semantic processing, 767–70 dataflow, 769 language identification, 767–68 of TCP data, 768–70 Sensitivity list, 135 Sequential control, 103–5, 110–18 with allocation, 104 compute task, 110 data dependencies, 110 data-dependent calculations, 104 Deterministic Finite Automata, 103 finite-state, 104 FSMD, 112 instruction augmentation, 115–16 phased computations, 104 phased reconfiguration manager, 117 processor, 114–15 single memory pool, 104–5 VLIW, 113–14 worker farm, 117–18 Sequential program partitioning, 540 alternative region implementation, 544, 549–50 Amdahl’s Law and, 542, 543 automatic, 175–76 exploration, 544, 552–57 granularity, 544, 545–47 ideal speedups, 543 implementation models, 550–52 manual, 176 partition evaluation, 544, 547–48 Sequential Turing Machines, 103 Sequentialization, 117 Set Partitioning in Hierarchical Trees. See SPIHT
903
Shared instructions, 815–16 Shared memory, 124–25, 243–44 abstraction, 244 implementations, 243 pools, 124, 125 Shared-wire global signaling, 628 Signal-processing primitives, 198 Signal-to-noise ratio (SNR), 486 Signal-to-quantization-noise ratio (SQNR), 486 Silicon Graphics supercomputers, 60 SIMD (single-instruction multiple data), 120, 219–22 algorithm compilation, 226 ALU control, 826 array size, 224 bit-processing elements, 817 computing on FPGAs, 219–21 datapaths, 815, 818 dot-product machine, 220–21 extended architecture, 227 interprocessor communication model, 224 multiple engines, 226–28 with pipelined vector units, 229 processing architectures, 221–22 processing array, 221, 222 processors, 219 width mismatches, 820 width selections, 820 SIMD/vector processing, 120–22 model, 229–30 multi-SIMD coarse-grained array, 228 multiple SIMD engines, 226–28 pipelined, 228–29 reconfigurable computers for, 223–26 SPMD model, 228 variations, 226–28 Simulated annealing, 306–12 accelerating, 415–18 annealing schedule, 307 complexity, 556 cost function, 306 distributed, 415 hardware-assisted, 418 hierarchical algorithm, 310–11 key feature, 556 low-temperature anneal, 311 meta-heuristics, 497 move generator, 306 parallelized, 416 schedule, 307
904
Index Simulated annealing (cont.) simultaneous placement and routing, 311–12 strengths, 307 temperature schemes, 415 VPR/related algorithms, 307–11 Simulated annealing placer, 836 Simulation, 486 Simulation-based peak estimation, 484 Simulink 2D video filtering, 187–91 component reuse, 198–200 control specification, 194–98 high-level algorithm designer, 188 image-processing design driver, 185–94 library browser, 196 mapping video filter to BEE2 platform, 191–94 Mask Editor, 198 mask parameters, 184 operator primitives, 183 pipeline operators, 184 programming streaming FPGA applications in, 183–202 RGB video conversion, 185–87 RGB-to-Y diagram, 286 SDF, 184 subsystems, 184 System Generator, 184 top-level testbench, 192–93 Simultaneous logic synthesis, 284–86 Sine, 437 Single-context FPGAs, 67–68 Single-FPGA emulation, 640–41 Single-instruction multiple data. See SIMD Single memory pool, 104–5 Single program, multiple data. See SPMD Single-rate synchronous dataflow, 99 Singular value decomposition (SVD), 514 SLD. See Second-level Detection Small-scale reconfigurable systems, 52–54 SMAP algorithm, 291 Snapshots, 847 SNORT, 445, 775 CPU time, 461 database, 761 intrusion detection, 753 intrusion filter for TCP (SIFT), 765 rule-based NID sensor, 763 Sobel edge detection filter, 188, 191, 201 Soft macros, 336, 424
Sorter case study, 357–60 with layout information removed, 362 recursion and layout, 360–61 recursive structure, 357 Sparing defect tolerance through, 835–39 global, 836–37 local, 838–39 row and column, 837 yield with, 834–35 Spartan-3E, 530 Spatial computations, 157 Spatial computing, 155–80 Spatial orientation trees, 569, 584 Spatial simulated annealing, 215 SPIHT, 565–88 architecture phases, 581–82 bitstream, 578, 579 coding algorithm, 570 coding engine, 568–71 coding phase, 582, 585–86 design considerations/modifications, 571–80 design overview, 581–82 design results, 587–88 DWT architectures, 567, 571–75 DWT phase, 582 engine runtimes, 588 Fixed Order, 578–80 fixed-point precision analysis, 575–78 hardware implementation, 580–86 image compression, 565–88 image quality, 568 LIP, 569 LIS, 569, 570 LSP, 569, 570, 586 maximum magnitude phase, 582, 583–85 Morton Scan Ordering, 584 optimization, 586 performance numbers, 587 spatial orientation trees, 569, 584 target hardware platform, 581 wavelet coding, 569 Spiral antenna model, 704, 705 Splash, 51–52 SPMD (single program, multiple data), 120 in parallel processing clusters, 228 SIMD versus, 228 Springtime PCI (SPCI) card, 664 Square-root operation, 437 SRC supercomputers, 60
Index Standalone Board-level Evolvable System (SABLES), 745 Static FPGAs, 600 Static RPF, 29 Static scheduling, 239–40 Static-RAM (SRAM), 6, 15, 16–17, 814 analyzer, 767 cells, 17, 814 drawbacks, 17 parallel banks of, 775 Straight-line code, 156 Stream computations, 217 compilation, 212–13 execution patterns, 208–12 organization, 203–17 programming, 205–8 runtime, 213–16 system architecture, 107–110, 208–12 Stream generator, 37 Streaming dataflow, 107–10 with allocation, 102–3 data presence, 108–9 datapath sharing, 109 dynamic, 100–2 interconnect sharing, 110 streaming coprocessors, 109–10 Streaming FPGA applications, 183–202 component reuse, 198–200 high-performance datapaths, 184 image-processing design driver, 185–94 Streams, 37, 99, 244–46 abstraction, 245–46 input, 99 multirate, 100 persistence, 245–46 SCORE, 209–10 video, 185, 202 write, 206 Structural mapping algorithms, 278–84 area-oriented, 280–82 cut generation, 279 DAOmap, 282–83 dynamic programming basis, 278–79 FlowMap, 279, 282 IMap, 281–82 LEKO, 282 performance-driven, 282–83 power-aware, 283–84 PRAETOR, 281–82 See also Technology mapping Subsystems, 184 with configurable delays, 187
stream-based filtering, 190–91 tiled, 198–200 Super-HWOP, 340–41 building, 342–43 microplacement, 342, 343 partitioning, 340 Superslices, 340, 342 Swap negotiation, 417 Swappable logic units (SLU), 74 SWIM project, 801 Switch blocks example architecture, 10 island-style architecture with, 9 Switch boxes, 409 connectivity, 429 style and routability, 429 Synchronization, 248–49 deadlock prevention, 249 explicit, 248 implicit, 248–49 thread-style, 248 Synchronous Data Flow. See SDF Synopsys FPGA compiler, 393 Synoptix, 493 Synplicity Identify tool, 272 Synthetic Aperture Radar (SAR) ATR in, 591 Sandia real-time, 592 System architectures, 107–25 bulk synchronization pattern, 118–19 cellular automata, 122–23 data parallel, 119–22 hierarchical composition, 125 multi-threaded, 123–25 sequential control, 110–18 streaming dataflow, 107–10 System Generator library, 184 SystemC, 205, 542 Systolic image array pipeline, 603–4 T-VPack algorithm, 305, 306 Tail duplication, 164 Task configuration protection, 251 Task Description Format (TDF), 205–6 behavioral operator, 206 compositional operator, 208 operators, 208 as portable assembly language, 207 specification, 206, 207 Taylor coefficients, 490 Taylor expansion transformation, 490
905
906
Index TCP processing, 758–60 block diagram, 760 circuit development, 759 semantic, 768–70 See also Network processing Techmapper class, 260, 264, 272 Technology mapping, 277–93 algorithms, 277 algorithms for heterogeneous resources, 289–92 functional algorithms, 277 integrated, 278, 284–89 in logic synthesis flow, 278 optimal solutions, 285 structural algorithms, 277, 278–84 Templates correlation between, 598 grouping example, 599 partitioning, 598–99 Teramac, 57–59 applications, 58–59 features, 58 in hardware prototyping applications, 427 Terasys Integrated Circuit, 221 Terminal propagation, 314, 315 Ternary content addressable memory (TCAM), 764 Test pattern generation, 614, 615 Testbenches dynamic, 269–70 JHDL, 265–66 Theoretical underpinnings, 807–27 Tightly coupled RPF and processor architecture, 41–42 Tiled subsystems, 198–200 Timing-driven algorithms, 301, 302 Topology matching, 329 Transaction application protocol interface (TAPI), 664 Transaction-based host-emulator interfacing, 650–51 Transfer functions for nonrecursive systems, 480 scaling, 481–82 Transformations, 555 Transient faults, 830 feedforward correction, 844–45 rollback, 845–48 tolerance, 843–48 Translation lookaside buffer (TLB), 247, 397
Triple modular redundancy (TMR), 844–45, 849 Triple-key DES, 83 Truth tables, 4 Turing Machine, 96 Turing–Complete compute models, 97, 119 Two-dimensional placement, 336–37 bin-based, 336 constrained, 335–36 Two-dimensional video filtering, 187–91 Uniaxial PML (UPML), 702, 706, 721 User datagram protocol (UDP), 758 Variable fixed-rate representation, 577 Variable-length chromosome GAs (VGAs), 735 Variables live at exits, 168–69 scalar, in memory, 169 Vector architectures, 120–21 functional units, 121 motivation, 120 sequential controller, 120 Vector coprocessors, 121–22 Vector functional units, 121, 229 Vectoring mode, 519–20 convergence, 520 implementations, 519 range extension, 525 simulation, 519 as y-reduction mode, 519 See also CORDIC Verilog, controller design with, 197 Very High-Speed Integrated Circuit Hardware Description Language (VHDL), 87–88, 129–53 Active Pages, 782 concurrent statements, 144, 150 controller design with, 197 delta delay, 150 design development, 130 FSM datapath example, 138–49 gates, 130 hardware compilation flow and, 150–52 hardware descriptions, 153 hardware module description, 132–33 limitations, 153 multivalued logic, 150 parametric hardware generation, 136–38 popularity, 129 port mapping, 133
Index ports, 133 programming, 130–50 RTL description, 133–36 sequential, comparison, 149 signals, 133 structural description, 130–33 submodules, 133 syntax, 153 Very long instruction word (VLIW), 61, 113–14, 795–97 computational elements, 795 processors, 164 of single multiply and add datapath, 113 time-slicing, 795 width, 797 Virtual circuit identifier (VCI), 756 Virtual Computer, 50–51 Virtual instruction configurations (VICs), 29 Chimaera architecture, 43–44 RaPiD, 39 speculative execution, 43 Virtual memory, 246–47 Virtual path identifier (VPI), 756 Virtual reconfiguration, 741–42 Virtual Wires, 56 Virtualized I/O, 72 Virtually addressed caches, 397 VirtuaLogic family, 642 VirtuaLogic VLE emulation system, 639, 653–65 array boards, 653–55 case study, 653–65 design clock cycle, 656 design partitions, 655 emulation mapping flow, 654 emulator system clock speed, 665 incremental compilation of designs, 661–64 incremental mapping, 662 incremental partitioning, 661 incremental path identification, 661 incremental routing, 661 inter-FPGA communication, 656 interfaces for coverification, 664–65 intra-FPGA computation, 656 multidomain signal transport, 658, 659, 660 multiported memory mapping, 657 netlist comparison, 661 parallel FPGA compilation, 665 partitioning, 654 software flow, 654–57
specialized mapping techniques, 657 statically scheduled routing, 656 structure, 653–54 See also Logic emulation systems Virus protection, 763–64 VLSI, CORDIC algorithm in, 514 VPack algorithm, 305 VPR, 307–11 annealing schedule, 307 delay computation, 309–10 enhancements, 310 move generator, 307 range limit update, 307–8 recomputation, 310 router, 314, 372 VStation family, 642 Washington University Gigabit Switch (WUGS), 756 Wavelets, 567 coding, 569 spatial orientation trees, 569 Wildcards (*), 152, 761 WildStar–II Pro PFGA board, 708–9 block diagram, 709 features, 708 memory hierarchy levels, 713 Xilinx Virtex–II Pro FPGAs on, 722 Window-based scheduling, 79 Wire congestion, 312 Wired-OR diode logic array, 859–60 Wordlength control over, 523 scaling, 477 Wordlength optimization, 485–97 area models, 485–96 error estimation, 485–96 problem, 499 search techniques, 496–97 simulation-based methods, 487 Word-level optimization, 339–40 Word-wide datapaths, 216, 815 Worker farms, 117–18 Worm detection, 766–67 Worm protection, 763–64 Xilinx 6200 series FPGA, 53–54, 81 cell configuration, 732 CLBs, 741 EHW platforms and, 740–41 “open” bitstream, 408 wildcard registers, 81
907
908
Index Xilinx ChipScope, 271 Core Generator IP, 348 EasyPath series, 842 Embedded Development Kit (EDK), 197 MicroBlaze, 194, 347 Virtex 2000E FPGA, 581 Virtex-4, 530, 533 XC 4036EX FPGA, 632 XC4VLX200 FPGA, 623 XC4000 library, 596 Xilinx Virtex-II Pro, 23–26, 68, 83, 530, 721 CLBs, 23–24 IBM PowerPC 405–D5 CPU cores, 25 logic architecture, 23–25 multiplier blocks, 24
routing architecture and resources, 25–26 SelectRAM+, 24, 25 on WildStar-II Pro board, 722 XC2VP100, 24 XOR gates, 463, 464 Y-reduction mode, 519 YaMoR, 739 Yield, 832–85 Law of Large Numbers impact, 835 perfect, 833–34 with sparing, 834–35 See also Defect tolerance Z-reduction mode, 517 Zero mask rows, 601–2
This page intentionally left blank
This page intentionally left blank
This page intentionally left blank
This page intentionally left blank
This page intentionally left blank
This page intentionally left blank